基于键值注意力机制的目标检测算法性能优化
作者:
基金项目:

江苏师范大学研究生科研实践创新计划(2024XKT2604)


Performance Optimization of Object Detection Algorithm Based on Key-value Attention Mechanism
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [55]
  • | | | |
  • 文章评论
    摘要:

    随着注意力机制在目标检测中的广泛应用, 进一步提升特征提取能力成为研究的重点. 提出了一种新的注意力机制, 旨在优化特征交互过程, 提升检测性能. 所提机制移除了传统自注意力中的查询操作, 采用深度可分离卷积高效提取局部与全局信息, 并通过键和值的加权融合实现特征聚合. 本文方法有效降低了计算复杂度, 增强了模型对重要特征的捕捉能力. 通过在5个不同类型的数据集上进行验证, 实验结果表明, 该注意力机制在处理小目标检测、遮挡处理以及复杂场景下的表现优异, 显著提高了检测精度与效率. 可视化分析进一步证实了其在特征提取中的有效性.

    Abstract:

    With the widespread application of the attention mechanism in object detection, further enhancing the feature extraction ability become the focus of research. A novel attention mechanism is proposed to optimize the feature interaction process and enhance the detection performance. The mechanism eliminates the query operation in traditional self-attention. It employs depth-separable convolution to efficiently extract both local and global information and realizes feature aggregation through the weighted fusion of keys and values. The method effectively reduces the computational complexity and enhances the model’s ability to capture important features. Through validation on five different types of datasets, the experimental results demonstrate that the attention mechanism exhibits excellent performance in handling small target detection, occlusion processing, and complex scenes, significantly improving detection accuracy and efficiency. Visual analysis further verifies its effectiveness in feature extraction.

    参考文献
    [1] Zhang YZ, Wang WJ, Li ZM, et al. Development of a cross-scale weighted feature fusion network for hot-rolled steel surface defect detection. Engineering Applications of Artificial Intelligence, 2023, 117: 105628.
    [2] Zhang YF, Ren WQ, Zhang Z, et al. Focal and efficient IoU loss for accurate bounding box regression. Neurocomputing, 2022, 506: 146–157.
    [3] Alferaidi A, Yadav K, Alharbi Y, et al. Distributed deep CNN-LSTM model for intrusion detection method in IoT-based vehicles. Mathematical Problems in Engineering, 2022, 2022: 3424819.
    [4] Guo ZQ, Xu LN, Si YJ, et al. RETRACTED: Novel computer-aided lung cancer detection based on convolutional neural network-based and feature-based classifiers using metaheuristics. International Journal of Imaging Systems and Technology, 2021, 31(4): 1954–1969.
    [5] Mnih V, Heess N, Graves A, et al. Recurrent models of visual attention. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal: MIT Press, 2014. 2204–2212.
    [6] Hassanin M, Anwar S, Radwan I, et al. Visual attention methods in deep learning: An in-depth survey. Information Fusion, 2024, 108: 102417.
    [7] Schwartz I, Schwing AG, Hazan T. High-order attention models for visual question answering. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 3667–3677.
    [8] Zheng HL, Fu JL, Zha ZJ, et al. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 5007–5016.
    [9] Wang XL, Girshick R, Gupta A, et al. Non-local neural networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 7794–7803.
    [10] Hu H, Gu JY, Zhang Z, et al. Relation networks for object detection. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 3588–3597.
    [11] Zhang ZZ, Lan CL, Zeng WJ, et al. Relation-aware global attention for person re-identification. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 3183–3192.
    [12] Hu J, Shen L, Sun G. Squeeze-and-excitation networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 7132–7141.
    [13] Wang QL, Wu BG, Zhu PF, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 11531–11539.
    [14] Qin ZQ, Zhang PY, Wu F, et al. FcaNet: Frequency channel attention networks. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021. 763–772.
    [15] Gao ZL, Xie JT, Wang QL, et al. Global second-order pooling convolutional networks. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 3019–3028.
    [16] Diba A, Fayyaz M, Sharma V, et al. Spatio-temporal channel correlation networks for action classification. Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018. 299–315.
    [17] Woo S, Park J, Lee JY, et al. CBAM: Convolutional block attention module. Proceedings of the 15th European Conference on Computer Vision. Munich: Springer, 2018. 3–19.
    [18] Hou QB, Zhou DQ, Feng JS. Coordinate attention for efficient mobile network design. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 13708–13717.
    [19] Fu J, Liu J, Tian HJ, et al. Dual attention network for scene segmentation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 3141–3149.
    [20] Chen BH, Deng WH, Hu JN. Mixed high-order attention network for person re-identification. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019. 371–381.
    [21] Roy AG, Navab N, Wachinger C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. IEEE Transactions on Medical Imaging, 2019, 38(2): 540–549.
    [22] Misra D, Nalamada T, Arasanipalai AU, et al. Rotate to attend: Convolutional triplet attention module. Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021. 3138–3147.
    [23] Hou QB, Zhang L, Cheng MM, et al. Strip pooling: Rethinking spatial pooling for scene parsing. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 4002–4011.
    [24] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 6000–6010.
    [25] Liu Z, Lin YT, Cao Y, et al. Pr Swin Transformer: Hierarchical vision Transformer using shifted windows. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021. 9992–10002.
    [26] Tan ZX, Wang MX, Xie J, et al. Deep semantic role labeling with self-attention. Proceedings of the 32nd AAAI Conference on Artificial Intelligence. New Orleans: AAAI, 2018. 4929–4936.
    [27] Lin ZH, Feng MW, dos Santos CN, et al. A structured self-attentive sentence embedding. Proceedings of the 5th International Conference on Learning Representations. Toulon, 2017.
    [28] Yang B, Bender G, Le QV, et al. CondConv: Conditionally parameterized convolutions for efficient inference. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 117.
    [29] Zhang H, Wu CR, Zhang ZY, et al. ResNeSt: Split-attention networks. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans: IEEE, 2022. 2735–2745.
    [30] Steiniger Y, Kraus D, Meisen T. Survey on deep learning based computer vision for sonar imagery. Engineering Applications of Artificial Intelligence, 2022, 114: 105157.
    [31] Li X, Wang WH, Hu XL, et al. Selective kernel networks. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 510–519.
    [32] Chen YP, Dai XY, Liu MC, et al. Dynamic convolution: Attention over convolution kernels. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 11027–11036.
    [33] Fan XJ, Zhang SJ, Chen B, et al. Bayesian attention modules. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1373.
    [34] Li X, Zhong ZS, Wu JL, et al. Expectation-maximization attention networks for semantic segmentation. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Seoul: IEEE, 2019. 9166–9175.
    [35] Zhang LW, Winn J, Tomioka R. Gaussian attention model and its application to knowledge base embedding and question answering. arXiv:1611.02266, 2016.
    [36] Park J, Woo S, Lee JY, et al. A simple and light-weight attention module for convolutional neural networks. International Journal of Computer Vision, 2020, 128(4): 783–798
    [37] Katharopoulos A, Vyas A, Pappas N, et al. Transformers are RNNs: Fast autoregressive Transformers with linear attention. Proceedings of the 37th International Conference on Machine Learning. PMLR, 2020. 5156–5165.
    [38] Child R, Gray S, Radford A, et al. Generating long sequences with sparse Transformers. arXiv:1904.10509, 2019.
    [39] Kitaev N, Kaiser Ł, Levskaya A. Reformer: The efficient Transformer. arXiv:2001.04451, 2020
    [40] Rao YM, Zhao WL, Liu BL, et al. DynamicViT: Efficient vision Transformers with dynamic token sparsification. Proceedings of the 35th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2021. 1068.
    [41] Parmar N, Vaswani A, Uszkoreit J, et al. Image Transformer. Proceedings of the 35th International Conference on Machine Learning. Stockholm: PMLR, 2018. 4055–4064.
    [42] Li YY, Huang Q, Pei X, et al. Cross-layer attention network for small object detection in remote sensing imagery. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 2148–2161.
    [43] Pan YW, Yao T, Li YH, et al. X-linear attention networks for image captioning. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10968–10977.
    [44] Ho J, Kalchbrenner N, Weissenborn D, et al. Axial attention in multidimensional Transformers. arXiv:1912.12180, 2019.
    [45] Lin HZ, Cheng X, Wu XY, et al. CAT: Cross attention in vision Transformer. Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME). Taipei: IEEE, 2022. 1–6.
    [46] Vaswani A, Ramachandran P, Srinivas A, et al. Scaling local self-attention for parameter efficient visual backbones. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 12889–12899.
    [47] Shen ZR, Zhang MY, Zhao HY, et al. Efficient attention: Attention with linear complexities. Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision. Waikoloa: IEEE, 2021. 3530–3538.
    [48] Chollet F. Xception: Deep learning with depthwise separable convolutions. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 1800–1807.
    [49] Everingham M, Eslami SMA, van Gool L, et al. The PASCAL visual object classes challenge: A retrospective. International Journal of Computer Vision, 2015, 111(1): 98–136.
    [50] Du DW, Zhu PF, Wen LY, et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul: IEEE, 2019. 213–226.
    [51] Zhang SF, Xie YL, Wan J, et al. WiderPerson: A diverse dataset for dense pedestrian detection in the wild. IEEE Transactions on Multimedia, 2020, 22(2): 380–393.
    [52] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 39(6): 1137–1149.
    [53] Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector. Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer, 2016. 21–37.
    [54] Tan MX, Pang RM, Le QV. EfficientDet: Scalable and efficient object detection. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10778–10787.
    [55] Selvaraju RR, Cogswell M, Das A, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Proceedings of the 2017 IEEE International Conference on Computer Vision. Venice: IEEE, 2017. 618–626.
    相似文献
    引证文献
引用本文

张征鑫,张笃振.基于键值注意力机制的目标检测算法性能优化.计算机系统应用,2025,34(4):195-206

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-09-13
  • 最后修改日期:2024-10-30
  • 在线发布日期: 2025-02-18
文章二维码
您是第11350318位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号