﻿ 基于压缩感知和音频指纹的固定音频检索方法
 计算机系统应用  2020, Vol. 29 Issue (8): 165-172 PDF

Specific Audio Retrieval Method Based on Compressed Sensing and Audio Fingerprint
ZHAO Wen-Bing, JIA Mao-Shen, WANG Qi
Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China
Foundation item: National Natural Science Foundation of China (61971015)
Abstract: In order to solve the problem of large amount of data and slow retrieval speed in the existing audio retrieval, a fixed audio retrieval method is proposed in this study based on compressed sensing and audio fingerprint dimensionality reduction. In the training stage of audio retrieval, the sample audio signal is sparse processed, and the sparse audio data is compressed by the compression sensing algorithm, then the audio fingerprint is extracted, and then the audio fingerprint discrete Gini coefficient is introduced to reduce the dimension of the fingerprint by calculating the discrete Gini coefficient of each dimension of the audio fingerprint. In the recognition stage of audio retrieval, we use the same algorithm as in the training stage to process the audio to be tested and match with the sample audio fingerprint. The experimental results show that the proposed audio retrieval method greatly reduces the storage of the sample audio database and improves the audio retrieval speed on the basis of ensuring a better retrieval accuracy.
Key words: audio retrieval     compressed sensing     discrete Gini coefficient     audio fingerprinting

1 基于压缩感知的音频特征库构建 1.1 声音预处理

1.2 音频信号的压缩处理

 图 1 不同帧能量保留比下6类音频信号时频保留数统计

${{x}} = [{x_n}(1),{x_n}(2),\cdots,{x_n}(N)]$ 为预处理后的第n帧音频信号, 根据稀疏编码模型音频信号 ${{{x}}_n}(p)$ 在DCT域的频域系数α可用式(2)表示:

 ${{\alpha}} = {{\psi x}}$ (1)

 ${{\bar x = }}{{{\psi }}^{{\rm T}}}{{\alpha }}^{\prime}$ (2)

 ${{Y = \Phi \bar X}}$ (3)

 ${{\Phi }} = {\left[ \begin{array}{c} 111000000000\cdots0 \\ 000111000000\cdots0 \\ 000000111000\cdots0 \\ \vdots \\ 000000000\cdots0111 \\ \end{array} \right]_{H \times N}}$ (4)

$N \times 1$ 稀疏音频信号 ${{\bar X}}$ 经过观测矩阵Φ压缩后得到维度为H×1的观测信号Y减小了音频序列数据量.

1.3 稀疏音频指纹特征提取

 $F(n,m) = \left\{ {\begin{array}{*{20}{l}} 1,\;{ {\rm if}\;{t(n,m)} - t(n - 1,m) > 0} \\ 0,\;{ {\rm if}\;{t(n,m) - t(n - 1,m) \le 0} } \end{array}} \right.$ (5)

1.4 音频指纹降维

(1)求取音频指纹的离散洛伦兹曲线, 离散洛伦兹曲线是求离散基尼系数的关键曲线, 是由累积指纹数据占比矢量 ${\mathop {{{W}}}\limits^ \rightharpoonup} ^j$ 的各个元素构成, j表示音频指纹的维度序号, 取值范围j=1,2,…, M–1求取累积指纹数据占比矢量 ${\mathop {{{W}}}\limits^ \rightharpoonup} ^j$ 的计算过程如下:

 $\mathop {{C}}\limits^{\rightharpoonup}{^j} = [c_1^j,c_2^j,\cdots,c_L^j]$ (6)

 $w_1^j = \frac{{c_1^j}}{{{{\left\| {\mathop {{{C}}}\limits^ \rightharpoonup} ^j \right\|}_1}}},\;w_2^j = \frac{{c_1^j + c_2^j}}{{{{\left\| {\mathop {{{C}}}\limits^ \rightharpoonup} ^j \right\|}_1}}},\cdots, w_L^j = \frac{{c_1^j + c_2^j + \cdots + c_L^j}}{{{{\left\| {\mathop {{{C}}}\limits^ \rightharpoonup} ^j \right\|}_1}}} = 1$ (7)

(2)以上述所求的离散洛伦兹曲线为分界线, 可得音频指纹第j维度的基尼系数公式如下:

 ${G^j} = \frac{{{S_a}}}{{{S_a} + {S_b}}}$ (8)

 图 2 音频指纹离散基尼系数示意图

 ${\tilde G^j} = 1 - \frac{1}{L}\left( 2\times \sum\nolimits_{i = 1}^{L - 1} {w_i^j + 1} \right)$ (9)

2 音频特征检索

(1)选取待测音频经上述预处理、稀疏化处理以及压缩处理得到待测观测序列信号 ${{\bar Y}}$ .

(2)将上述压缩处理后的待测观测序列信号 ${{\bar Y}}$ 经指纹特征提取、指纹特征降维得到待测音频指纹 ${F_d}(n,r)$ , 其中, ${F_d}(n,r)$ 表示待测音频信号序列第n帧音频指纹的第r位.

(3)将得到的待测音频指纹与样本音频指纹库中的音频指纹进行相似度匹配, 本文选取比特误差率(Bit Error Rate, BER)作为匹配算法比较两个音频片段之间的相似度, 其计算公式如下:

 $BER = \frac{{\displaystyle\sum\nolimits_{n = 1}^T {\displaystyle\sum\nolimits_{r = 1}^R {{F_d}(n,r) \oplus F^{\prime} (n,r)} } }}{{T \times R}}$ (10)

(4)设置比特误差率的阈值, 求其BER的值, 若其值小于设定的阈值, 则表示待检音频与样本音频库中的音频相似度较高, 反之, 待检音频与样本音频库中的音频相似度较低, 从而得出检测结果.

3 实验结果与分析 3.1 性能评价指标

3.2 实验结果分析

3.2.1 音频指纹降维程度分析

 图 3 语音与歌曲数据音频指纹各维度的离散基尼系数

3.2.2 样本压缩比与指纹降维对检索性能的影响

(1)样本不同压缩程度对检索性能的影响

(2)指纹维数对检索性能的影响

(3)样本压缩程度和指纹降维程度对检索性能的影响

3.2.3 不同信噪比下不同算法的音频检索性能对比

 图 4 3种算法的查全率趋势图

 图 5 3种算法的查准率趋势图

4 结束语

 [1] 张卫强, 刘加. 网络音频数据检索技术. 通信学报, 2007, 28(12): 152-155. DOI:10.3321/j.issn:1000-436x.2007.12.026 [2] 张卫强, 刘加, 陈恩庆. 一种基于仿生模式识别思想的固定音频检索方法. 自然科学进展, 2008, 18(7): 808-813. DOI:10.3321/j.issn:1002-008X.2008.07.013 [3] Doidge AN, Evans LH, Herron JE, et al. Separating content-specific retrieval from post-retrieval processing. Cortex, 2017, 86: 1-10. DOI:10.1016/j.cortex.2016.10.003 [4] Kashino K, Kurozumi T, Murase H. A quick search method for audio and video signals based on histogram pruning. IEEE Transactions on Multimedia, 2003, 5(3): 348-357. DOI:10.1109/TMM.2003.813281 [5] Kim KM, Kim SY, Jeon JK, et al. Quick audio retrieval Using multiple feature vectors. IEEE Transactions on Consumer Electronics, 2006, 52(1): 200-205. DOI:10.1109/TCE.2006.1605048 [6] 齐晓倩, 陈鸿昶, 黄海. 基于K-L距离的两步固定音频检索方法. 计算机工程, 2011, 37(19): 160-162. DOI:10.3969/j.issn.1000-3428.2011.19.052 [7] Tzanetakis G, Cook P. Music analysis and retrieval systems for audio signals. Journal of the American Society for Information Science and Technology, 2004, 55(12): 1077-1083. DOI:10.1002/asi.20060 [8] Tian L, Song QH, Lu XS. Information technology and an audio retrieval method based on a novel audience rating system. Advanced Materials Research, 2014, 886: 664-667. DOI:10.4028/www.scientific.net/AMR.886.664 [9] Haitsma J, Kalker T. A highly robust audio fingerprinting system. Proceedings of the 3rd International Conference on Music Information Retrieval. Paris, France. 2002. 107–115. [10] 王晖楠, 魏娇. 基于人工智能识别的音乐片段指纹检索技术研究. 自动化与仪器仪表, 2019(5): 119-122, 126. [11] Yao SS, Niu BN, Liu JQ. Audio identification by sampling sub-fingerprints and counting matches. IEEE Transactions on Multimedia, 2017, 19(9): 1984-1995. DOI:10.1109/TMM.2017.2723846 [12] 于云, 周伟栋. 基于压缩感知的鲁棒性说话人识别参数研究. 计算机技术与发展, 2016, 26(3): 18-22. DOI:10.3969/j.issn.1673-629X.2016.03.005 [13] Son W, Cho HT, Yoon K, et al. Sub-fingerprint masking for a robust audio fingerprinting system in a real-noise environment for portable consumer devices. IEEE Transactions on Consumer Electronics, 2010, 56(1): 156-160. DOI:10.1109/TCE.2010.5439139 [14] Donoho DL. Compressed sensing. IEEE Transactions on Information Theory, 2006, 52(4): 1289-1306. DOI:10.1109/TIT.2006.871582 [15] 李秀梅, 吕军. 基于压缩感知的信号时频表示重构. 计算机系统应用, 2016, 25(7): 176-181. DOI:10.15888/j.cnki.csa.005239 [16] 王蓉芳, 焦李成, 刘芳, 等. 利用纹理信息的图像分块自适应压缩感知. 电子学报, 2013, 41(8): 1506-1514. DOI:10.3969/j.issn.0372-2112.2013.08.009 [17] University of Iowa Electronic Music Studios. University of Iowa musical instrument samples. http://theremin.music.uiowa.edu/MIS.html. [18] Jia MS, Yang ZY, Bao CC, et al. Encoding multiple audio objects using intra-object sparsity. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2015, 23(6): 1082-1095. DOI:10.1109/TASLP.2015.2419980 [19] 叶蕾, 杨震, 王天荆, 等. 行阶梯观测矩阵、对偶仿射尺度内点重构算法下的语音压缩感知. 电子学报, 2012, 40(3): 429-434. DOI:10.3969/j.issn.0372-2112.2012.03.003