面向RGB-D语义分割的多模态任意旋转自监督学习
作者:
基金项目:

国家自然科学基金面上项目(62376266); 中国科学院基础前沿科学研究计划从 0 到 1 原始创新项目(ZDBS-LY-7024)


Self-supervised Learning Based on Multi-modal Arbitrary Rotation for RGB-D Semantic Segmentation
Author:
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献 [33]
  • |
  • 相似文献 [20]
  • | | |
  • 文章评论
    摘要:

    基于RGB-D数据的自监督学习受到广泛关注, 然而大多数方法侧重全局级别的表示学习, 会丢失对识别对象至关重要的局部细节信息. 由于RGB-D数据中图像和深度具有几何一致性, 因此这可以作为线索来指导RGB-D数据的自监督特征表示学习. 在本文中, 我们提出了ArbRot, 它可以无限制地旋转角度并为代理任务生成多个伪标签用于自监督学习, 而且还建立了全局和局部之间的上下文联系. 本文所提出的ArbRot可以与其他对比学习方法联合训练, 构建多模态多代理任务自监督学习框架, 以增强图像和深度视图的特征表示一致性, 从而为RGB-D语义分割任务提供有效的初始化. 在SUN RGB-D和NYU Depth Dataset V2数据集上的实验结果表明, 多模态任意旋转自监督学习得到的特征表示质量均高于基线模型. 开源代码: https://github.com/Physu/ArbRot.

    Abstract:

    Self-supervised learning on RGB-D datasets has attracted extensive attention. However, most methods focus on global-level representation learning, which tends to lose local details that are crucial for recognizing the objects. The geometric consistency between image and depth in RGB-D data can be used as a clue to guide self-supervised feature learning for the RGB-D data. In this study, ArbRot is proposed, which can not only rotate the angle without restriction and generate multiple pseudo-labels for pretext tasks, but also establish the relationship between global and local context. The ArbRot can be jointly trained with contrastive learning methods for establishing a multi-modal, multiple pretext task self-supervised learning framework, so as to enforce feature consistency within image and depth views, thereby providing an effective initialization for RGB-D semantic segmentation. The experimental results on the datasets of SUN RGB-D and NYU Depth Dataset V2 show that the quality of feature representation obtained by multi-modal, arbitrary-orientation rotation self-supervised learning is better than the baseline models.

    参考文献
    [1] Lopes A, Souza R, Pedrini H. A survey on RGB-D datasets. Computer Vision and Image Understanding, 2022, 222: 103489.
    [2] Cao JM, Leng HC, Lischinski D, et al. ShapeConv: Shape-aware convolutional layer for indoor RGB-D semantic segmentation. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021. 7068–7077.
    [3] 李梦怡, 朱定局. 基于全卷积网络的图像语义分割方法综述. 计算机系统应用, 2021, 30(9): 41–52.
    [4] Zhao XQ, Zhang LH, Pang YW, et al. A single stream network for robust and real-time RGB-D salient object detection. Proceedings of the 16th European Conference on Computer Vision. Glasgow: Springer, 2020. 646–662.
    [5] Deng J, Dong W, Socher R, et al. ImageNet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami: IEEE, 2009. 248–255.
    [6] Pathak D, Krähenbuhl P, Donahue J, et al. Context encoders: Feature learning by inpainting. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 2536–2544.
    [7] Noroozi M, Favaro P. Unsupervised learning of visual representations by solving jigsaw puzzles. Proceedings of the 14th European Conference on Computer Vision. Amsterdam: Springer, 2016. 69–84.
    [8] Vincent P, Larochelle H, Bengio Y, et al. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International Conference on Machine Learning. Helsinki: ACM, 2008. 1096–1103.
    [9] Larsson G, Maire M, Shakhnarovich G. Colorization as a proxy task for visual understanding. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 840–849.
    [10] Zhao XQ, Pang YW, Zhang LH, et al. Self-supervised pretraining for RGB-D salient object detection. Proceedings of the 36th AAAI Conference on Artificial Intelligence. AAAI, 2022. 3463–3471.
    [11] Chen YJ, Nießner M, Dai A. 4DContrast: Contrastive learning with dynamic correspondences for 3D scene understanding. Proceedings of the 17th European Conference on Computer Vision. Tel Aviv: Springer, 2022. 543–560.
    [12] Yang JG, Guo S, Wu GS, et al. CoMAE: Single model hybrid pre-training on small-scale RGB-D datasets. Proceedings of the 37th AAAI Conference on Artificial Intelligence. Washington: AAAI, 2023. 3145–3154.
    [13] Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by predicting image rotations. Proceedings of the 6th International Conference on Learning Representations. Vancouver: OpenReview.net, 2018.
    [14] Misra I, van der Maaten L. Self-supervised learning of pretext-invariant representations. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 6706–6716.
    [15] Chen PG, Liu S, Jia JY. Jigsaw clustering for unsupervised visual representation learning. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville: IEEE, 2021. 11521–11530.
    [16] Jing LL, Tian YL. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(11): 4037–4058.
    [17] Hendrycks D, Mazeika M, Kadavath S, et al. Using self-supervised learning can improve model robustness and uncertainty. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2019. 1403.
    [18] Chen T, Zhai XH, Ritter M, et al. Self-supervised GANs via auxiliary rotation loss. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019. 12146–12155.
    [19] Reed CJ, Metzger S, Srinivas A, et al. SelfAugment: Automatic augmentation policies for self-supervised learning. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 2673–2682.
    [20] He KM, Fan HQ, Wu YX, et al. Momentum contrast for unsupervised visual representation learning. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 9726–9735.
    [21] Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations. Proceedings of the 37th International Conference on Machine Learning (ICML). PMLR, 2020. 1597–1607.
    [22] Grill JB, Strub F, Altché F, et al. Bootstrap your own latent a new approach to self-supervised learning. Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver: Curran Associates Inc., 2020. 1786.
    [23] Chen XL, He KM. Exploring simple siamese representation learning. Proceedings of the 2021 IEEE/CVF Conference on Compute獲攠杖浩敳湩瑯慮琠楡潮湤??慡牴塴楥癲?ㄠ?づ??で??????㈠〨?????戮爠?ashville: IEEE, 2021. 15745–15753.
    [24] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional networks for biomedical image segmentation. Proceedings of the 18th International Conference on Medical Image Computing and Computer-assisted Intervention (MICCAI). Munich: Springer, 2015. 234–241.
    [25] Yang ZT, Sun YN, Liu S, et al. 3DSSD: Point-based 3D single stage object detector. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 11037–11045.
    [26] Krizhevsky A. Learning multiple layers of features from tiny images [Master’s Thesis]. Toronto: University of Toronto, 2009.
    [27] Li JY, Wang N, Zhang LF, et al. Recurrent feature reasoning for image inpainting. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 7757–7765.
    [28] Song SR, Lichtenberg SP, Xiao JX. SUN RGB-D: A RGB-D scene understanding benchmark suite. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 567–576.
    [29] Silberman N, Hoiem D, Kohli P, et al. Indoor segmentation and support inference from RGB-D images. Proceedings of the 12th European Conference on Computer Vision (ECCV). Florence: Springer, 2012. 746–760.
    [30] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016. 770–778.
    [31] Mishra P, Sarawadekar K. Polynomial learning rate policy with warm restart for deep neural network. Proceedings of the 2019 IEEE Region 10 Conference (TENCON). Kochi: IEEE, 2019. 2087–2092.
    [32] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston: IEEE, 2015. 3431–3440.
    [33] Chen LC, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image
    引证文献
    网友评论
    网友评论
    分享到微博
    发 布
引用本文

李鸿宇,张宜飞,杨东宝.面向RGB-D语义分割的多模态任意旋转自监督学习.计算机系统应用,2024,33(1):219-230

复制
分享
文章指标
  • 点击次数:633
  • 下载次数: 1361
  • HTML阅读次数: 972
  • 引用次数: 0
历史
  • 收稿日期:2023-06-29
  • 最后修改日期:2023-07-27
  • 在线发布日期: 2023-11-24
  • 出版日期: 2023-01-05
文章二维码
您是第11305038位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京海淀区中关村南四街4号 中科院软件园区 7号楼305房间,邮政编码:100190
电话:010-62661041 传真: Email:csa (a) iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号