Survey on Video Semantic Segmentation Based on Deep Learning
CSTR:
Author:
  • Article
  • | |
  • Metrics
  • |
  • Reference [39]
  • |
  • Related [20]
  • | | |
  • Comments
    Abstract:

    At present, the research on video semantic segmentation is mainly divided into two aspects. The first one is how to improve the accuracy of image segmentation by using timing information between video frames, while the second one is how to use the similarity between the frames to determine the key frame, reduce the amount of calculation, and improve the running speed of the model. In terms of improving segmentation accuracy, new modules are generally designed and combined with existing CNNs. In terms of reducing computation load, the low-level feature correlation of frame sequence is used to select the key frame, which reduces computation load and operation time at the same time. Firstly, this paper introduces the development background and operation datasets Cityscapes and CamVid of video semantic segmentation. Secondly, the existing video semantic segmentation methods are introduced. Finally, it summarizes the current development of video semantic segmentation, and gives some prospects and suggestions for future development.

    Reference
    [1] Long L, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 3431-3440.
    [2] Grundmann M, Kwatra V, Han M, et al. Efficient hierarchical graph-based video segmentation. Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA. 2010. 2141-2148.
    [3] Xu CL, Corso JJ. Evaluation of super-voxel methods for early video processing. Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA. 2012. 1202-1209.
    [4] Shi JB, Malik J. Motion segmentation and tracking using normalized cuts. Proceedings of the 6th International Conference on Computer Vision. Bombay, India. 1998. 1154-1160.
    [5] Papazoglou A, Ferrari V. Fast object segmentation in unconstrained video. Proceedings of 2013 IEEE International Conference on Computer Vision. Sydney, NSW, Australia. 2013. 1777-1784.
    [6] Fragkiadaki K, Arbelaez P, Felsen P, et al. Learning to segment moving objects in videos. Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA. 2015. 4083-4090.
    [7] Hartmann G, Grundmann M, Hoffman J, et al. Weakly supervised learning of object segmentations from web-scale video. Proceedings of European Conference on Computer Vision. Florence, Italy. 2012. 198-208.
    [8] Tang K, Sukthankar R, Yagnik J, et al. Discriminative segment annotation in weakly labeled video. Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA. 2013. 2483-2490.
    [9] Liu X, Tao DC, Song ML, et al. Weakly supervised multiclass video segmentation. Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 2015. 57-64.
    [10] Brostow GJ, Shotton J, Fauqueur J, et al. Segmentation and recognition using structure from motion point clouds. Proceedings of the 10th European Conference on Computer Vision. Marseille, France. 2008. 44-57.
    [11] Floros G, Leibe B. Joint 2D-3D temporally consistent semantic segmentation of street scenes. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Providence, RI, USA. 2012. 2823-2830.
    [12] Sturgess P, Alahari K, Ladicky L, et al. Combining appearance and structure from motion features for road scene understanding. Proceedings of the British Machine Vision Conference. London, UK. 2009.
    [13] Kundu A, Li Y, Dellaert F, et al. Joint semantic segmentation and 3D reconstruction from monocular video. Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland. 2014. 703-718.
    [14] Sengupta S, Greveson E, Shahrokni A, et al. Urban 3D semantic modelling using stereo vision. Proceedings of 2013 IEEE International Conference on Robotics and Automation. Karlsruhe, Germany. 2013. 580-585.
    [15] Miksik O, Munoz D, Bagnell JA, et al. Efficient temporal consistency for streaming video scene analysis. Proceedings of 2013 IEEE International Conference on Robotics and Automation. Karlsruhe, Germany. 2013. 133-139.
    [16] Jampani V, Gadde R, Gehler PV. Video propagation networks. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 3154-3164
    [17] Jampani V, Kiefel M, Gehler PV. Learning sparse high dimensional filters:Image filtering, dense CRFs and bilateral neural networks. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 4452-4461.
    [18] Brostow GJ, Fauqueur J, Cipolla R. Semantic object classes in video:A high-definition ground truth database. Pattern Recognition Letters, 2009, 30(2):88-97.[doi:10.1016/j.patrec.2008.04.005
    [19] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 3213-3223.
    [20] Gers FA, Schmidhuber J, Cummins F. Learning to forget:Continual prediction with LSTM. Neural Computation, 2000, 12(10):2451-2471.[doi:10.1162/089976600300015015
    [21] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780.[doi:10.1162/neco.1997.9.8.1735
    [22] Fayyaz M, Saffar MH, Sabokrou M, et al. STFCN:Spatio-temporal FCN for semantic video segmentation. arXiv preprint arXiv:1608.05971, 2016.
    [23] Gadde R, Jampani V, Gehler PV. Semantic video CNNs through representation warping. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy. 2017. 4463-4472.
    [24] Shelhamer E, Rakelly K, Hoffman J, et al. Clockwork convnets for video semantic segmentation. Proceedings of European Conference on Computer Vision. Amsterdam, The Netherlands. 2016. 852-868.
    [25] Gadde R, Jampani V, Kiefel M, et al. Superpixel convolutional networks using bilateral inceptions. Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands. 2016. 597-613.
    [26] Kroeger T, Timofte R, Dai DX, et al. Fast optical flow using dense inverse search. Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands. 2016. 471-488.
    [27] Nilsson D, Sminchisescu C. Semantic video segmentation by gated recurrent flow propagation. arXiv preprint arXiv:1612.08871, 2016.
    [28] Ilg E, Mayer N, Saikia T, et al. FlowNet 2.0:Evolution of optical flow estimation with deep networks. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 1647-1655.
    [29] Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. Proceedings of Advances in Neural Information Processing Systems. Montreal, QB, Canada. 2015. 2017-2025.
    [30] Jin XJ, Li X, Xiao HX, et al. Video scene parsing with predictive feature learning. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy. 2017. 5581-5589.
    [31] Goodfellow IJ, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, QB, Canada. 2014. 2672-2680.
    [32] Koutník J, Greff K, Gomez F, et al. A clockwork RNN. Proceedings of the 31st International Conference on International Conference on Machine Learning. Beijing, China. 2014. II-1863-II-1871.
    [33] Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland. 2014. 818-833.
    [34] Horn BKP, Schunck BG. Determining optical flow. Artificial Intelligence, 1981, 17(1-3):185-203.[doi:10.1016/0004-3702(81)90024-2
    [35] Dosovitskiy A, Fischer P, Ilg E, et al. FlowNet:Learning optical flow with convolutional networks. Proceedings of 2015 IEEE International Conference on Computer Vision. Santiago, Chile. 2015. 2758-2766.
    [36] Zhu XZ, Xiong YW, Dai JF, et al. Deep feature flow for video recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 4141-4150.
    [37] Li YL, Shi JP, Lin DH. Low-latency video semantic segmentation. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 2018. 5997-6005.
    [38] Xu YS, Fu TJ, Yang HK, et al. Dynamic video segmentation network. Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA. 2018. 6556-6565.
    [39] Chopra S, Hadsell R, LeCun Y. Learning a similarity metric discriminatively, with application to face verification. Proceedings of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Diego, CA, USA. 2005. 539-546.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

韩利丽,孟朝晖.基于深度学习的视频语义分割综述.计算机系统应用,2019,28(12):1-8

Copy
Share
Article Metrics
  • Abstract:2418
  • PDF: 7837
  • HTML: 7673
  • Cited by: 0
History
  • Received:May 16,2019
  • Revised:May 31,2019
  • Online: December 13,2019
  • Published: December 15,2019
Article QR Code
You are the first990425Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address:4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code:100190
Phone:010-62661041 Fax: Email:csa (a) iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063