基于扩散模型的解耦知识蒸馏

doi:10.15888/j.cnki.csa.009615

微信公众号

网站二维码

首页 > 过刊浏览>2024年第33卷第9期 >58-64. DOI:10.15888/j.cnki.csa.009615

PDF HTML阅读 XML下载导出引用引用提醒

基于扩散模型的解耦知识蒸馏
DOI:
                        10.15888/j.cnki.csa.009615
                    
作者:
                        
                        
                    
作者单位:
作者简介:
通讯作者:
中图分类号:
基金项目:公安部科技计划(2022JSM08)

Decoupled Knowledge Distillation Based on Diffusion Model

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

知识蒸馏(KD)是一种将复杂模型(教师模型)的知识传递给简单模型(学生模型)的技术, 目前比较受欢迎的蒸馏方法大多停留在基于中间特征层, 继解耦知识蒸馏(DKD)提出后基于响应的知识蒸馏又重新回到SOTA行列, 这种使用强一致性约束条件的策略, 将经典的知识蒸馏拆分为两个部分, 解决了高度耦合的问题. 然而, 这种方法忽略了师生网络架构差距较大所引起的表征差距过大, 进而导致学生模型由于体量较小无法更有效的学习到教师模型的知识的问题. 为了解决这个问题, 本文提出了使用扩散模型来缩小师生模型之间的表征差距, 这种方法将教师特征传输到扩散模型中训练, 然后通过一个轻量级的扩散模型对学生模型进行降噪从而缩小了师生模型的表征差距. 大量的实验表明这种方法对比于基准方法在CIFAR-100、ImageNet数据集上均有较大的提升, 在师生网络架构差距较大时依然能够保持较好的性能.

Abstract:

Knowledge distillation (KD) is a technique that transfers knowledge from a complex model (teacher model) to a simpler model (student model). While many popular distillation methods currently focus on intermediate feature layers, response-based knowledge distillation (RKD) has regained its position among the SOTA models after decoupled knowledge distillation (DKD) was introduced. RKD leverages strong consistency constraints to split classic knowledge distillation into two parts, addressing the issue of high coupling. However, this approach overlooks the significant representation gap caused by the disparity in teacher-student network architectures, leading to the problem where smaller student models cannot effectively learn knowledge from teacher models. To solve this problem, this study proposes a diffusion model to narrow the representation gap between teacher and student models. This model transfers teacher features to train a lightweight diffusion model, which is then used to denoise the student model, thus reducing the representation gap between teacher and student models. Extensive experiments demonstrate that the proposed model achieves significant improvements over baseline models on CIFAR-100 and ImageNet datasets, maintaining good performance even when there is a large gap in teacher-student network architectures.

参考文献

相似文献

引证文献

引用本文

王鹏宇,朱子奇.基于扩散模型的解耦知识蒸馏.计算机系统应用,2024,33(9):58-64

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2024-03-13
最后修改日期:2024-04-10
录用日期:
在线发布日期: 2024-07-24
出版日期:

微信公众号

网站二维码

引用本文

分享

文章指标

历史

文章二维码