基于奇异值分解的适应微调
作者:
基金项目:

国家自然科学基金(61976053, 62171131)


Adaptation Fine-tuning Based on Singular Value Decomposition
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [41]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    大语言模型的兴起对自然语言处理领域产生了深远影响. 随着计算资源的增长和模型规模的扩大, 大语言模型在自然语言处理中的应用潜力日益显现. 然而, 广泛使用的低秩适应微调方法在面对模型规模增大时, 遇到了微调效率和存储成本等方面的挑战. 为了解决这一问题, 本文提出了一种基于奇异值分解的适应微调方法. 该方法只需将奇异值分解得到的对角矩阵和缩放向量作为可训练参数, 从而在降低训练成本的同时, 实现了在多个自然语言处理任务上的性能提升. 实验结果显示, 基于奇异值分解的适应微调方法在GLUE和E2E基准测试中的性能超越了同等数量级的方法. 通过与常用的参数高效微调方法进行比较, 发现基于奇异值分解的适应微调方法在减少可训练参数数量和提高微调效率方面具有显著优势, 并在可训练参数微调效率实验中实现了最高的性能增益. 在未来的研究中, 将专注于进一步优化基于奇异值分解的适应微调方法, 在更广泛的任务和更大规模的模型中实现更高效的微调.

    Abstract:

    The rise of large language models has profoundly impacted natural language processing. With the growth of computational resources and the expansion of model sizes, the potential applications of large language models in natural language processing are increasingly evident. However, the widely used low-rank adaptation (LoRA) method faces challenges related to fine-tuning efficiency and storage costs as model sizes increase. To address this issue, this study proposes a singular value decomposition-based adaptation fine-tuning method. This method only requires the diagonal matrix and scaling vector obtained from singular value decomposition to be trainable parameters, achieving performance improvement in multiple natural language processing tasks while reducing training costs. Experimental results show that the proposed method outperforms other methods of the same order of magnitude in GLUE and E2E benchmark tests. Compared with commonly used parameter-efficient fine-tuning methods, it demonstrates significant advantages in reducing the number of trainable parameters and improving fine-tuning efficiency, achieving the highest performance gains in experiments on the fine-tuning efficiency of trainable parameters. Future research will focus on optimizing the proposed method to achieve more efficient fine-tuning in a wider range of tasks and larger-scale models.

    参考文献
    [1] Zhao WX, Zhou K, Li J, et al. A survey of large language models. arXiv:2303.18223, 2023.
    [2] Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nature Medicine, 2023, 29(8): 1930–1940.
    [3] Brownlee J. How to avoid overfitting in deep learning neural networks. https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/. (2019-08-06).
    [4] Ding N, Qin YJ, Yang G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence, 2023, 5(3): 220–235.
    [5] Hu EJ, Shen YL, Wallis P, et al. LoRA: Low-rank adaptation of large language models. Proceedings of the 10th International Conference on Learning Representations. OpenReview.net, 2022.
    [6] Dettmers T, Pagnoni A, Holtzman A, et al. QLORA: Efficient finetuning of quantized LLMs. Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2023. 441.
    [7] Doering N, Gorlla C, Tuttle T, et al. Empirical analysis of efficient fine-tuning methods for large pre-trained language models. arXiv:2401.04051, 2024.
    [8] Lialin V, Deshpande V, Rumshisky A. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv:2303.15647, 2023.
    [9] Liu X, Zheng YN, Du ZX, et al. GPT understands, too. AI Open, 2023.
    [10] Liu X, Ji KX, Fu YC, et al. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: ACL, 2022. 61–68.
    [11] Chavan A, Liu Z, Gupta D, et al. One-for-all: Generalized LoRA for parameter-efficient fine-tuning. arXiv:2306.07967, 2023.
    [12] Aghajanyan A, Gupta S, Zettlemoyer L. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, 2021. 7319–7328.
    [13] Sun XH, Ji YJ, Ma BC, et al. A comparative study between full-parameter and LoRA-based fine-tuning on Chinese instruction data for instruction following large language model. arXiv:2304.08109, 2023.
    [14] Zhang QR, Chen MS, Bukharin A, et al. Adaptive budget allocation for parameter-efficient fine-tuning. Proceedings of the 11th International Conference on Learning Representations. Kigali: OpenReview.net, 2023.
    [15] Zhang ZM, Ely G, Aeron S, et al. Novel methods for multilinear data completion and de-noising based on tensor-SVD. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2014. 3842–3849.
    [16] Hastie T, Mazumder R, Lee JD, et al. Matrix completion and low-rank SVD via fast alternating least squares. The Journal of Machine Learning Research, 2015, 16(1): 3367–3402.
    [17] Feng X, Yu WJ, Li YH. Faster matrix completion using randomized SVD. Proceedings of the 30th IEEE International Conference on Tools with Artificial Intelligence. Volos: IEEE, 2018. 608–615.
    [18] Zhang J, Lei Q, Dhillon IS. Stabilizing gradients for deep neural networks via efficient SVD parameterization. Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018. 5801–5809.
    [19] Yang HR, Tang MX, Wen W, et al. Learning low-rank deep neural networks via singular vector orthogonality regularization and singular value sparsification. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. Seattle: IEEE, 2020. 2899–2908.
    [20] Chen YY, Tao QH, Tonin F, et al. Primal-attention: Self-attention through asymmetric kernel SVD in primal representation. Proceedings of the 37th International Conference on Neural Information Processing Systems. New Orleans: Curran Associates Inc., 2024. 2840.
    [21] Meng FX, Wang ZH, Zhang MH. PiSSA: Principal singular values and singular vectors adaptation of large language models. arXiv:2404.02948, 2024.
    [22] Kopiczko DJ, Blankevoort T, Asano YM. VeRA: Vector-based random matrix adaptation. Proceedings of the 12th International Conference on Learning Representations. Vienna: OpenReview.net, 2024.
    [23] Tao SY, Shen CY, Zhu L, et al. SVD-CNN: A convolutional neural network model with orthogonal constraints based on SVD for context-aware citation recommendation. Computational Intelligence and Neuroscience, 2020, 2020: 5343214.
    [24] Saxe AM, McClelland JL, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. Proceedings of the 2nd International Conference on Learning Representations. Banff, 2014.
    [25] Brock A, Donahue J, Simonyan K. Large scale GAN training for high fidelity natural image synthesis. Proceedings of the 7th International Conference on Learning Representations. New Orleans: OpenReview.net, 2019.
    [26] Sun YF, Zheng L, Deng WJ, et al. SVDNet for pedestrian retrieval. Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 3820–3828.
    [27] Novikova J, Dušek O, Rieser V. The E2E dataset: New challenges for end-to-end generation. Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. Saarbrücken: ACL, 2017. 201–206.
    [28] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia: ACL, 2002. 311–318.
    [29] Doddington G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the 2nd International Conference on Human Language Technology Research. San Francisco: Morgan Kaufmann Publishers, 2002. 138–145.
    [30] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: ACL, 2005. 65–72.
    [31] Lin CY. ROUGE: A package for automatic evaluation of summaries. Proceedings of the 2004 Workshop on Text Summarization Branches Out. Barcelona: ACL, 2004. 74–81.
    [32] Vedantam R, Lawrence Zitnick C, Parikh D. CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.
    [33] Wang A, Singh A, Michael J, et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels: ACL, 2018. 353–355.
    [34] Liu YH, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692, 2019.
    [35] Li XL, Liang P. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. ACL, 2021. 4582–4597.
    [36] Zaken EB, Goldberg Y, Ravfogel S. BitFit: Simple parameter-efficient fine-tuning for Transformer-based masked language-models. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dublin: ACL, 2022. 1–9.
    [37] Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP. Proceedings of the 36th International Conference on Machine Learning. Long Beach: PMLR, 2019. 2790–2799.
    [38] Lin ZJ, Madotto A, Fung P. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv:2004.03829, 2020.
    [39] Pfeiffer J, Kamath A, Rücklé A, et al. AdapterFusion: Non-destructive task composition for transfer learning. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. ACL, 2021. 487–503.
    [40] Rücklé A, Geigle G, Glockner M, et al. AdapterDrop: On the efficiency of adapters in Transformers. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: ACL, 2021. 7930–7946.
    [41] Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners. OpenAI Blog, 2019, 1(8): 9.
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

林志鹏,郭峥嵘,张伟志,郭躬德.基于奇异值分解的适应微调.计算机系统应用,2025,34(1):276-284

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-06-06
  • 最后修改日期:2024-07-10
  • 在线发布日期: 2024-11-25
文章二维码
您是第12822711位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号