###

计算机系统应用英文版:2024,33(9):58-64

View/Add Comment 过刊浏览高级检索 HTML

←前一篇 | 后一篇→

码上扫一扫！

下载全文

基于扩散模型的解耦知识蒸馏

王鹏宇, 朱子奇

(武汉科技大学计算机科学与技术学院, 武汉 430065)

Decoupled Knowledge Distillation Based on Diffusion Model

WANG Peng-Yu, ZHU Zi-Qi

(School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China)

摘要

图/表

参考文献

相似文献

本文已被：浏览 292次下载 1703次
Received:March 13, 2024 Revised:April 10, 2024

中文摘要: 知识蒸馏(KD)是一种将复杂模型(教师模型)的知识传递给简单模型(学生模型)的技术, 目前比较受欢迎的蒸馏方法大多停留在基于中间特征层, 继解耦知识蒸馏(DKD)提出后基于响应的知识蒸馏又重新回到SOTA行列, 这种使用强一致性约束条件的策略, 将经典的知识蒸馏拆分为两个部分, 解决了高度耦合的问题. 然而, 这种方法忽略了师生网络架构差距较大所引起的表征差距过大, 进而导致学生模型由于体量较小无法更有效的学习到教师模型的知识的问题. 为了解决这个问题, 本文提出了使用扩散模型来缩小师生模型之间的表征差距, 这种方法将教师特征传输到扩散模型中训练, 然后通过一个轻量级的扩散模型对学生模型进行降噪从而缩小了师生模型的表征差距. 大量的实验表明这种方法对比于基准方法在CIFAR-100、ImageNet数据集上均有较大的提升, 在师生网络架构差距较大时依然能够保持较好的性能.

中文关键词: 知识蒸馏解耦知识蒸馏扩散模型表征差距师生网络

Abstract:Knowledge distillation (KD) is a technique that transfers knowledge from a complex model (teacher model) to a simpler model (student model). While many popular distillation methods currently focus on intermediate feature layers, response-based knowledge distillation (RKD) has regained its position among the SOTA models after decoupled knowledge distillation (DKD) was introduced. RKD leverages strong consistency constraints to split classic knowledge distillation into two parts, addressing the issue of high coupling. However, this approach overlooks the significant representation gap caused by the disparity in teacher-student network architectures, leading to the problem where smaller student models cannot effectively learn knowledge from teacher models. To solve this problem, this study proposes a diffusion model to narrow the representation gap between teacher and student models. This model transfers teacher features to train a lightweight diffusion model, which is then used to denoise the student model, thus reducing the representation gap between teacher and student models. Extensive experiments demonstrate that the proposed model achieves significant improvements over baseline models on CIFAR-100 and ImageNet datasets, maintaining good performance even when there is a large gap in teacher-student network architectures.

keywords: knowledge distillation (KD) decoupled knowledge distillation diffusion model representation gap teacher-student network

文章编号： 中图分类号： 文献标志码：

基金项目:公安部科技计划(2022JSM08)

Author Name	Affiliation	E-mail
WANG Peng-Yu	School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China	741003446@qq.com
ZHU Zi-Qi	School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China

Author Name	Affiliation	E-mail
WANG Peng-Yu	School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China	741003446@qq.com
ZHU Zi-Qi	School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China

引用文本：
王鹏宇,朱子奇.基于扩散模型的解耦知识蒸馏.计算机系统应用,2024,33(9):58-64
WANG Peng-Yu,ZHU Zi-Qi.Decoupled Knowledge Distillation Based on Diffusion Model.COMPUTER SYSTEMS APPLICATIONS,2024,33(9):58-64