计算机系统应用  2020, Vol. 29 Issue (10): 133-140 PDF

1. 复旦大学 计算机科学与技术学院, 上海 201203;
2. 上海视频技术与系统工程研究中心, 上海 201203

XU Rui1,2, FENG Rui1,2
1. School of Computer Science, Fudan University, Shanghai 201203, China;
2. Shanghai Engineering Research Center for Video Technology and System, Shanghai 201203, China
Foundation item: National Key Research and Development Program of China (2017YFC0803702)
Abstract: In order to improve the accuracy of the human pose estimation task of convolutional neural networks, we propose an improved loss function based on Mean Squared Error (MSE) to deal with the pixel imbalance between foreground (Gaussian kernel) and background in heatmaps, assign different weights to the loss function according to different pixel values in the foreground and background, and named it Focus Mean Squared Error (FMSE). Compared with the mean squared loss function, the proposed focused mean squared loss function can effectively reduce the impact of pixel imbalance between foreground and background on network performance, help the network locate the spatial location of key points, improve network performance, and make the loss function converge faster in the training phase. Experiments are performed on public data sets to verify the effectiveness of the proposed focused mean square loss function.
Key words: deep learning     loss function     human pose estimation     key point detection     sample imbalance

1 相关工作

2 聚焦均方损失函数

 $f = {e^{ - \frac{{{{\left( {x - {x_0}} \right)}^2} + {{\left( {y - {y_0}} \right)}^2}}}{{2 \times {\delta ^2}}}}}$ (1)

 图 1 热点图

 ${{ Cross}}\_{{E}}ntropy\_Loss = \left\{ {\begin{array}{*{20}{l}} {{{\log }_2}y'}&{y = 1}\\ {{{\log }_2}(1 - y')}&{y = 0} \end{array}} \right.$ (2)
 ${{Focal}}\_{\mathop{ Loss}\nolimits} = \left\{ {\begin{array}{*{20}{l}} { - \alpha {{\left( {1 - {y^{\prime} }} \right)}^\gamma }{{\log }_2}{y^{\prime} }}&{y = 1}\\ { - (1 - \alpha ){y^{\prime \gamma }}{{\log }_2}\left( {1 - {y^{\prime} }} \right)}&{y = 0} \end{array}} \right.$ (3)

 $MSE\_Loss = \frac{1}{2}\sum\limits_{i = 1}^n {\left( {{{y'}_i} - {y_i}} \right)}$ (4)
 $FMSE\_Loss = \frac{1}{2}\sum\limits_{i = 1}^n {{{\left( {{y_i} + \delta } \right)}^\gamma }\left( {{{y'}_i} - {y_i}} \right)}$ (5)

 图 2 均方损失函数与聚焦均方损失函数图像

 图 3 聚焦均方损失函数的γ值影响

3 实验及分析

3.1 实验所选用的网络

 图 4 沙漏网络结构

 图 5 高分辨率网络结构

HRNet结构分为纵向Depth和横向Scale两个维度, 横向上不同分辨率子网络并行, 纵向上进行多分辨率信息融合, 从上到下, 每个stages分辨率减半, 通道数加倍.

3.2 MPII和MSCOCO数据集

3.3 训练与测试信息

3.4 实验环境

3.5 实验结果及分析

 图 6 MSCOCO数据集上训练与验证信息

 图 7 MSCOCO数据集上训练与验证信息

 图 8 关键点检测结果示例

4 总结与展望

 [1] Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model. Proceedings of 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, AK, USA. 2008. 1–8. [2] Andriluka M, Roth S, Schiele B. Pictorial structures revisited: People detection and articulated pose estimation. Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA. 2009. 1014–1021. [3] Freund Y, Schapire RE. A desicion-theoretic generalization of on-line learning and an application to boosting. Proceedings of the 2nd European Conference on Computational Learning Theory. Barcelona, Spain. 1995. 23–37. [4] Eichner M, Ferrari V, Zurich S. Better appearance models for pictorial structures. Proceedings of British Machine Vision Conference. London, UK. 2009.5.. [5] Sapp B, Jordan C, Taskar B. Adaptive pose priors for pictorial structures. Proceedings of 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. San Francisco, CA, USA. 2010. 422–429. [6] Yang Y, Ramanan D. Articulated pose estimation with flexible mixtures-of-parts. Proceedings of CVPR 2011. Providence, RI, USA. 2011. 1385–1392. [7] Pishchulin L, Andriluka M, Gehler P, et al. Poselet conditioned pictorial structures. Proceedings of 2013 IEEE Conference on Computer Vision and Pattern Recognition. Portland, OR, USA. 2013. 588–595. [8] Toshev A, Szegedy C. Deeppose: Human pose estimation via deep neural networks. Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 2014. 1653–1660. [9] Tompson J, Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation. Proceedings of the 27th International Conference on Neural Information Processing Systems. Montreal, QC, Canada. 2014. 1799–1807. [10] Andriluka M, Pishchulin L, Gehler P, et al. 2D human pose estimation: New benchmark and state of the art analysis. Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA. 2014. 3686–3693. [11] Lin TY, Maire M, Belongie S, et al. Microsoft coco: Common objects in context. Proceedings of the 13th European Conference on Computer Vision. Zurich, Switzerland. 2014. 740–755. [12] Felzenszwalb PF , Huttenlocher DP. Efficient matching of pictorial structures. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Hilton Head Island, SC, USA. 2000, 2. [13] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Proceedings of the 26th Annual Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2012. 1097–1105. [14] Wei SE, Ramakrishna V, Kanade T, et al. Convolutional pose machines. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 4724–4732. [15] Newell A, Yang KY, Deng J. Stacked hourglass networks for human pose estimation. Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands. 2016. 483–499. [16] Xiao B, Wu HP, Wei YC. Simple baselines for human pose estimation and tracking. Proceedings of the 15th European Conference on Computer Vision. Munich, Germany. 2018. 472–487. [17] Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, CA, USA. 2019. 5686–5696. [18] Cao Z, Simon T, Wei SE, et al. Realtime multi-person 2D pose estimation using part affinity fields. Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, USA. 2017. 1302–1310. [19] Ren SQ, He KM, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. DOI:10.1109/TPAMI.2016.2577031 [20] Lin TY, Goyal P, Girshick R, et al. Focal loss for dense object detection. Proceedings of 2017 IEEE International Conference on Computer Vision. Venice, Italy. 2017. 2999–3007.