3D Dense Captioning Method Based on Multi-level Context Voting
CSTR:
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [34]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    Traditional three-dimensional (3D) dense captioning methods have problems such as insufficient consideration of point-cloud context information, loss of feature information, and thin hidden state information. Therefore, a multi-level context voting network is proposed. It uses the self-attention mechanism to capture the context information of point clouds in the voting process and utilizes it at multiple levels to improve the accuracy of object detection. Meanwhile, the temporal fusion of hidden state and attention module is designed to fuse the hidden state of the current moment with the attention result of the previous moment to enrich the information of the hidden state and thus improve the expressiveness of the model. In addition, a “two-stage” training method is adopted in the model, which can effectively filter out the generated low-quality object proposals and enhance the description effect. Extensive experiments on official datasets ScanNet and ScanRefer show that this method achieves more competitive results compared to baseline methods.

    Reference
    [1] Romaszko L, Williams CKI, Winn J. Learning direct optimization for scene understanding. Pattern Recognition, 2020, 105(2): 107369
    [2] Chen DZ, Chang AX, Nießner M. ScanRefer: 3D object localization in RGB-D scans using natural language. Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow: Springer, 2020. 202–221.
    [3] Chen DZ, Gholami A, Nießner M, et al. Scan2Cap: Context-aware dense captioning in RGB-D scans. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville: IEEE, 2021. 3192–3202.
    [4] Chen K, Choy CB, Savva M, et al. Text2Shape: Generating shapes from natural language by learning joint embeddings. Proceedings of the 14th Asian Conference on Computer Vision. Perth: Springer, 2018. 100–116.
    [5] Chang AX, Funkhouser T, Guibas L, et al. ShapeNet: An information-rich 3D model repository. arXiv:1512.03012, 2015.
    [6] Achlioptas P, Abdelreheem A, Xia F, et al. ReferIt3D: Neural listeners for fine-grained 3D object identification in real-world scenes. Proceedings of the 16th European Conference on Computer Vision. Glasgow: Springer, 2020. 422–440.
    [7] Dai A, Chang AX, Savva M, et al. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 2432–2443.
    [8] Yuan ZH, Yan X, Liao YH, et al. InstanceRefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021. 1791–1800.
    [9] Zhao LC, Cai DG, Sheng L, et al. 3DVG-transformer: Relation modeling for visual grounding on point clouds. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Montreal: IEEE, 2021. 2908–2917.
    [10] Chen DZ, Wu QR, Nießner M, et al. D3Net: A speaker-listener architecture for semi-supervised dense captioning and visual grounding in RGB-D scans. arXiv:2112.01551, 2021.
    [11] Jiang L, Zhao HS, Shi SS, et al. PointGroup: Dual-set point grouping for 3D instance segmentation. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 4867–4876.
    [12] Qi CR, Su H, Kaichun M, et al. PointNet: Deep learning on point sets for 3D classification and segmentation. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu: IEEE, 2017. 77–85.
    [13] Qi CR, Yi L, Su H, et al. PointNet++: Deep hierarchical feature learning on point sets in a metric space. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach: Curran Associates Inc., 2017. 5105–5114.
    [14] Shi SS, Wang XG, Li HS. PointRCNN: 3D object proposal generation and detection from point cloud. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach: IEEE, 2019. 770–779.
    [15] Leibe B, Leonardis A, Schiele B. Robust object detection with interleaved categorization and segmentation. International Journal of Computer Vision, 2008, 77(1–3): 259–289.
    [16] Qi CR, Litany O, He KM, et al. Deep Hough voting for 3D object detection in point clouds. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul: IEEE, 2019. 9276–9285.
    [17] Chen JT, Lei BW, Song QY, et al. A hierarchical graph network for 3D object detection on point clouds. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 389–398.
    [18] Yang ZT, Sun YN, Liu S, et al. 3DSSD: Point-based 3D single stage object detector. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle: IEEE, 2020. 11037–11045.
    [19] Zhang ZW, Sun B, Yang HT, et al. H3DNet: 3D object detection using hybrid geometric primitives. Proceedings of the 16th European Conference on Computer Vision (ECCV). Glasgow: Springer, 2020. 311–329.
    [20] Xie Q, Lai YK, Wu J, et al. MLCVNet: Multi-level context VoteNet for 3D object detection. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle: IEEE, 2020. 10444–10453.
    [21] Zhu Z, Wang T, Qu H. Macroscopic control of text generation for image captioning. arXiv:2101.08000, 2021.
    [22] Ji JZ, Xu C, Zhang XD, et al. Spatio-temporal memory attention for image captioning. IEEE Transactions on Image Processing, 2020, 29: 7615–7628. [doi: 10.1109/TIP.2020.3004729
    [23] Anderson P, He XD, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City: IEEE, 2018. 6077–6086.
    [24] Wang WX, Chen ZH, Hu HF. Hierarchical attention network for image captioning. Proceedings of the 33rd AAAI Conference on Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conference and 9th AAAI Symposium on Educational Advances in Artificial Intelligence. Honolulu: AAAI, 2019. 1099.
    [25] Mi JP, Lyu J, Tang S, et al. Interactive natural language grounding via referring expression comprehension and scene graph parsing. Frontiers in Neurorobotics, 2020, 14: 43. [doi: 10.3389/fnbot.2020.00043
    [26] Li XY, Jiang SQ. Know more say less: Image captioning based on scene graphs. IEEE Transactions on Multimedia, 2019, 21(8): 2117–2130. [doi: 10.1109/TMM.2019.2896516
    [27] Johnson J, Karpathy A, Li FF. DenseCap: Fully convolutional localization networks for dense captioning. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016. 4565–4574.
    [28] Krishna R, Hata K, Ren F, et al. Dense-captioning events in videos. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV). Venice: IEEE, 2017. 706–715.
    [29] Zhao HS, Jiang L, Jia JY, et al. Point transformer. arXiv:2012.09164, 2020.
    [30] Fan HQ, Su H, Guibas L. A point set generation network for 3D object reconstruction from a single image. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu: IEEE, 2017. 605–613.
    [31] Vedantam R, Zitnick CL, Parikh D. CIDEr: Consensus-based image description evaluation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston: IEEE, 2015. 4566–4575.
    [32] Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia: Association for Computational Linguistics, 2002. 311–318.
    [33] Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor: Association for Computational Linguistics, 2005. 65–72.
    [34] Lin CY. ROUGE: A package for automatic evaluation of summaries. Proceedings of the Workshop on Text Summarization Branches Out. Barcelona: Association for Computational Linguistics, 2004. 74–81.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

吴春雷,郝宇钦,李阳.基于多层级上下文投票的三维密集字幕.计算机系统应用,2023,32(3):291-299

Copy
Share
Article Metrics
  • Abstract:561
  • PDF: 1328
  • HTML: 1188
  • Cited by: 0
History
  • Received:August 03,2022
  • Revised:September 07,2022
  • Online: December 09,2022
Article QR Code
You are the first990795Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063