﻿ 基于Cascade结构的牛脸姿态估计
 计算机系统应用  2019, Vol. 28 Issue (7): 240-245 PDF

Gesture Estimation of Cattle Face Based on Cascade Structure
GOU Xian-Tai, HUANG Wei, LIU Qi-Fen
School of Electrical Engineering, Southwest Jiaotong University, Chengdu 611756, China
Foundation item: Science and Technology Major Program of Sichuan Province (18ZDZX0162); Major Science and Technology Research and Development Plan of Sichuan Province (2017GZ0159)
Abstract: The single-angle feature coding identity authentication method cannot meet the current demand in terms of data capacity because of the increasing number of cattle. A cascade structure is used to detect the cattle’s face and then estimate the angle of the cattle’s face, which build a solid feature base for multi-angle feature coding. The result of experiments shows that the cascade structure can obtain a higher accuracy in both the face detection and attitude angle estimation tasks.
Key words: cascade     cattle face detection     gesture angle

1 cascade结构组成及原理

cascade结构流程如图1所示.

 图 1 cascade结构主流程图

1.1 牲畜脸部检测

1.1.1 SSD网络结构

SSD网络结构如图2所示, 模型基于VGG-16, 将最后两层全连接层变为卷积层, 之后添加FCN网络进行特征提取. 分别抽取Conv4_3、Conv7、Conv8_2、Conv9_2、Conv10_2、Conv11_2层的feature map进行多尺度特征提取.

 图 2 SSD网络结构

1.1.2 损失函数

SSD的损失函数由两部分组成: 分类置信度 ${L_{conf}}$ 和坐标误差 ${L_{loc}}$ .

 $\left\{\begin{split} &L\left( {x,c,l,g} \right) = \frac{1}{N}\left( {{L_{conf}}\left( {x,c} \right) + a{L_{loc}}\left( {x,l,g} \right)} \right)\\ &L\left( {x,c,l,g} \right) = \frac{1}{N}\left( {{L_{conf}}\left( {x,c} \right) + \alpha {L_{loc}}\left( {x,l,g} \right)} \right)\\ &L\left( {x,c,l,g} \right) = \frac{1}{N}\left( {{L_{conf}}\left( {x,c} \right) + a{L_{loc}}\left( {x,l,g} \right)} \right) \end{split}\right.$ (1)

 $\begin{split}{L_{conf}}\left( {x,c} \right) &= - \mathop \sum \limits_{i \in Pos}^N x_{ij}^p\log \left( {\hat c_i^p} \right) - \mathop \sum \limits_{i \in Neg} \log \left( {\hat c_i^0} \right)\;where\;\hat c_i^p \!\!\!\!\!\!\!\!\!\!\!\!\\ &= \frac{{{\rm{exp}}\left( {c_i^p} \right)}}{{\mathop \sum \nolimits_p {\rm{exp}}\left( {c_i^p} \right)}}\end{split}$ (2)

 $\left\{ \begin{split} &{{L_{loc}}\left( {x,l,g} \right) = \mathop \sum \limits_{i \in Pos}^N \mathop \sum \limits_{m \in \left\{ {cx,cy,w,h} \right\}} x_{ij}^k\;{\rm{smoot}}{{\rm{h}}_{L1}}\left( {l_i^m - \hat g_j^m} \right)}\\ &{\hat g_j^{cx} = \left( {g_j^{cx} - d_i^{cx}} \right)/d_i^w\;\;\;\;\hat g_j^{cy} = \left( {g_j^{cy} - d_i^{cy}} \right)/d_i^h}\\ &{\hat g_j^w = {\rm{log}}\left( {\frac{{g_j^w}}{{d_i^w}}} \right)\;\;\;\;\;\;\hat g_j^h = {\rm{log}}\left( {\frac{{g_j^h}}{{d_i^h}}} \right)}\\ &{where\;{\rm{smoot}}{{\rm{h}}_{L1}}\left( x \right) = \left\{ {\begin{array}{*{20}{l}} {0.5{x^2}\;\;\;\;\;\;\;if\left| x \right| < 1}\\ {\left| x \right| - 0.5\;\;\;otherwise} \end{array}} \right.} \end{split}\right.$ (3)

1.1.3 困难负样本挖掘

1.2 牛脸姿态估计

 图 3 MobileNet网络模型

MobileNet的网络结构如图3所示, 使用了大量的1×1卷积与深度可分离卷积, 减少了大量参数.

1.2.1 深度可分离卷积

 图 4 深度可分离卷积

 ${D_k} \times {D_k} \times M \times {D_F} \times {D_F} + N \times M \times {D_F} \times {D_F}$ (4)
 图 5 1×1卷积

 ${D_k} \times {D_k} \times N \times M \times {D_F} \times {D_F}$ (5)

 $\frac{{{D_k} \times {D_k} \times M \times {D_F} \times {D_F} + N \times M \times {D_F} \times {D_F}}}{{{D_k} \times {D_k} \times M \times N \times {D_F} \times {D_F}}} = \frac{1}{N} + \frac{1}{{D_k^2}}$ (6)

 ${D_k} \times {D_k} \times \alpha M \times \beta {D_F} \times \beta {D_F} + \alpha N \times \alpha M \times \beta {D_F} \times \beta {D_F}$ (7)

1.2.2 损失函数

 ${L_\delta }\left( {y,f\left( x \right)} \right) = \left\{ {\begin{array}{*{20}{l}} {\dfrac{1}{2}{{\left( {y - f\left( x \right)} \right)}^2}{\rm{ }}if\left| {y - f\left( x \right)} \right| \le \delta }\\ {\delta \cdot \left( {\left| {y - f\left( x \right)} \right| - \dfrac{1}{2}\delta } \right),\;{\rm{otherwise}}} \end{array}} \right.$

Huber loss相较于传统的L2 loss有更强的鲁棒性, 当残差(residual)很小的时候, loss函数为L2范数, 残差大的时候, 为L1范数的线性函数, 所以Huber loss对于离群点不敏感, 不易发生梯度爆炸的问题. 同时, 超参 $\delta$ 可以对Huber loss的函数曲线进行调整, 使之更适合模型的训练.

2 实验结果与分析 2.1 训练集与测试集

2.2 实验平台

2.3 训练

2.4 SSD模型试验结果与分析

 $P = \frac{{{\text{预测正确的牛脸框数}}}}{{{\text{预测的牛脸框数}}}}$
 $R = \frac{{{\text{预测正确牛脸框数}}}}{{{\text{真实牛脸框数}}}}$
 ${{F}} = \frac{{2 \cdot P \cdot R}}{{P + R}}$

 图 6 SSD检测效果图

2.5 MobileNet角度预测实验结果与分析

MobileNet的训练数据为SSD训练数据按照groundtruth进行切分, 切下来的牛脸使用blender软件进行标注, 获得牛脸的x, y, z三个角度. 为增加实验的对比度, 本文使用不同的宽度因子α以及分辨率因子β. 可以评价指标为预测得到的x, y, z三个角度与其对应ground truth的平均误差. 实验结果如下:

3 结论与展望

 [1] Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector. Proceedings of the 14th European Conference on Computer Vision. Amsterdam, The Netherlands. 2016. 21–37. [2] Howard AG, Zhu ML, Chen B, et al. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv: 1704.04861, 2017. [3] Wang XY, Han TX, Yan SC. An HOG-LBP human detector with partial occlusion handling. 2009 IEEE 12th International Conference on Computer Vision. Kyoto, Japan. 2009. 32-39. [4] Bicego M, Lagorio A, Grosso E, et al. On the use of SIFT features for face authentication. Conference on Computer Vision and Pattern Recognition Workshop. New York, NY, USA. 2006. 35-35. [5] Zhu Q, Yeh MC, Cheng KT, et al. Fast human detection using a cascade of histograms of oriented gradients. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. New York, NY, USA. 2006. 1491–1498. [6] Cao XB, Wu CX, Yan PK, et al. Linear SVM classification using boosting HOG features for vehicle detection in low-altitude airborne videos. 201118th IEEE International Conference on Image Processing. Brussels, Belgium. 2011. 2421–2424. [7] 蔡骋, 宋肖肖, 何进荣. 基于计算机视觉的牛脸轮廓提取算法及实现. 农业工程学报, 2017, 33(11): 171-177. Cai C, Song XX, He JR. Algorithm and realization for cattle face contour extraction based on computer vision. Transactions of the Chinese Society of Agricultural Engineering, 2017, 33(11): 171-177. DOI:10.11975/j.issn.1002-6819.2017.11.022 [8] Zhu J, Rosset S, Zou H, et al. Multi-class AdaBoost. Statistics and Its Interface, 2006, 2(3): 349-360. [9] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv: 1311.2524, 2013. [10] Girshick R. Fast R-CNN. arXiv preprint arXiv: 1504.08083, 2015. [11] Ren SQ, He KM, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. Proceedings of 28th International Conference on Neural Information Processing Systems. Montreal, Canada. 2015. 91-99. [12] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA. 2016. 779-788. [13] Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition. Miami, FL, USA. 2009. 248-255. [14] Abadi M, Barham P, Chen JM, et al. Tensorflow: A system for large-scale machine learning. arXiv preprint arXiv: 1605.08695, 2016. [15] Tieleman T, Hinton G. Lecture 6. 5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4(2): 26–30.