﻿ 利用协变量调整控制混杂因子的鲁棒文本分类
 计算机系统应用  2020, Vol. 29 Issue (3): 155-160 PDF

Robust Text Categorization Using Covariates to Control Confounding Factors
DONG Yuan-Yuan
Qilu Normal University, Jinan 250013, China
Foundation item: Social Science Planning Research Project of Shandong Province (17CTYJ03)
Abstract: Aiming at the problem that many documents categorization methods seldom control hybrid variables and have low robustness to data distribution, a documents (text) categorization method based on covariate adjustment is proposed. Firstly, it is assumed that the confounding factors (variables) in text categorization can be observed in the training stage, but not in the testing stage. Then, the sum of confounding factors is calculated in the prediction stage under the condition of the confounding factors in the training stage. Finally, based on Pearl’s covariate adjustment, the accuracy of text features and classification variables to the classifier is observed by controlling the confounding factors. The performance of the proposed method is verified by microblog data set and IMDB data set. The experimental results show that the proposed method can achieve higher classification accuracy and robustness against mixed variables than other methods.
Key words: covariate adjustment     confounding variables     text classification     text features     robustness

1 引言

2 用于文本分类器的协变量调整 2.1 文本分类中得分协变量调整

 $p(y|do(x)) = \sum\limits_{{\textit{z}} \in Z} {p(y|x,{\textit{z}})p({\textit{z}})}$ (1)

 图 1 本文方法的有向图模型

 $p({\textit{z}} = k) = \frac{{\displaystyle \sum\limits_{i \in D} {1[{{\textit{z}}_i} = k]} }}{{|D|}}$ (2)

2.2 对调整强度进行调节

 \begin{aligned}[b] L(D,\theta ) = & \sum\limits_{i \in D} {\log {p_\theta }({y_i}|{x_i},{{\textit{z}}_i})} \\ & - {\lambda _x}\sum\limits_k {{{(\theta _k^x)}^2}} - {\lambda _{\textit{z}}}\sum\limits_k {{{(\theta _k^{\textit{z}})}^2}} \end{aligned} (3)

3 实验与分析

1) ${P_{{\rm{train}}}}(y = 1|{\textit{z}} = 1) = {b_{{\rm{test}}}}$ ;

2) ${P_{{\rm{test}}}}(y = 1|{\textit{z}} = 1) = {b_{{\rm{test}}}}$ ;

3) ${P_{{\rm{train}}}}(Y) = {P_{{\rm{test}}}}(Y)$ ;

4) ${P_{{\rm{train}}}}(Z) = {P_{{\rm{test}}}}(Z)$ .

3.1 实验数据及设置

3.2 对比的模型

Logistic回归(LR): 本文研究的主线是一个标准L2正则化logistic回归分类器, 该分类器不会为混杂因子做任何调整, 仅简单地对 $P(Y|X)$ 进行建模.

3.3 结果分析

3.3.1 微博实验

 图 2 微博数据的实验结果

3.3.2 IMDB实验

 图 3 根据卡方统计的结果

 图 4 Simpson悖论的特征百分比

3.4 参数分析

 图 5 IMDB数据的实验结果

 图 6 混杂因子特征系数和准确度

4 结论与展望

 [1] 苏金树, 张博锋, 徐昕. 基于机器学习的文本分类技术研究进展. 软件学报, 2006, 17(9): 1848-1859. [2] 宋胜利, 王少龙, 陈平. 面向文本分类的中文文本语义表示方法. 西安电子科技大学学报(自然科学版), 2013, 40(2): 89-97, 129. [3] 王啸宇, 郭代红, 徐元杰. 基于文本分类技术的住院患者药源性变态反应自动监测模块研究. 中国药物应用与监测, 2016, 13(2): 117-120. [4] Fukuchi K, Sakuma J, Kamishima T. Prediction with model-based neutrality. Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases. Prague, Czech Republic. 2013. 499–514. [5] 黄章树, 叶志龙. 基于改进的CHI统计方法在文本分类中的应用. 计算机系统应用, 2016, 25(11): 136-140. DOI:10.15888/j.cnki.csa.005393 [6] 刘露, 彭涛, 左万利, 等. 一种基于聚类的PU主动文本分类方法. 软件学报, 2013, 24(11): 2571-2583. DOI:10.3724/SP.J.1001.2013.04467 [7] 周庆平, 谭长庚, 王宏君, 等. 基于聚类改进的KNN文本分类算法. 计算机应用研究, 2016, 33(11): 3374-3377, 3382. [8] 王宇达. 因果效应和统计推断[硕士学位论文]. 北京: 北京邮电大学, 2015. [9] Mariani J, Antonietti L, Tajer C, et al. Gender differences in the treatment of acute coronary syndromes: Results from the Epi-cardio registry. Revista Argentina de Cardiología, 2013, 81(4): 287-295. DOI:10.7775/rac.v81.i4.2330 [10] Breitenstein MK, Pathak J, Simon G. Studying the confounding effects of socio-ecological conditions in retrospective clinical research: A use case of social stress. AMIA Joint Summits on Translational Science Proceedings, 2015, 2015: 41-45. [11] Pearl J. On measurement bias in causal inference. Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence. Catalina Island, CA, USA. 2010. 425–432. [12] 吴江, 侯绍新, 靳萌萌, 等. 基于LDA模型特征选择的在线医疗社区文本分类及用户聚类研究. 情报学报, 2017, 36(11): 1183-1191. DOI:10.3772/j.issn.1000-0135.2017.11.010 [13] 赵谦, 孟德宇, 徐宗本. L1/2正则化Logistic回归 . 模式识别与人工智能, 2012, 25(5): 721-728. DOI:10.3969/j.issn.1003-6059.2012.05.001 [14] 介科伟. 基于Pearson相关性分析的高校学生恋爱模型. 首都师范大学学报(自然科学版), 2017, 38(6): 8-13. [15] 吴小安. 辛普森悖论——逻辑进路和因果进路之争. 自然辩证法通讯, 2018, 40(5): 53-59.