Spam Message Recognition Based on TFIDF and Self-Attention-Based Bi-LSTM

doi:10.15888/j.cnki.csa.007495

AIPUB归智期刊联盟

WeChat

Mobile website

2025-4-14- 22

Home > Archive>Volume 29, Issue 9, 2020 >171-177. DOI:10.15888/j.cnki.csa.007495

PDF HTML XML Export Cite reminder

Spam Message Recognition Based on TFIDF and Self-Attention-Based Bi-LSTM
DOI:
                        10.15888/j.cnki.csa.007495
                    
CSTR:
                        [cstr]
                    
Author:
                        WU Si-HuiWU Si-Hui
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site
CHEN Shi-PingCHEN Shi-Ping
Shanghai Key Laboratory of Data Science, Fudan University, Shanghai 201203, China
Find this author on All Journals
Find this author on BaiDu
Search for this author on this site

                    
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference [23]

Related [20]

Cited by

Materials

Comments

Abstract:

Mobile phone text messaging has become an increasingly important means of daily communication, so the identification of spam messages has importantly practical significance. A self-attention-based Bi-LSTM neural network model combined with TFIDF is proposed for this purpose. The model first inputs the short message to the Bi-LSTM layer in a vector manner, after feature extraction and combining the information of TFIDF and self-attention layers, the final feature vector is obtained. Finally, the feature vector is classified by the Softmax classifier to obtain the classification result. The experimental results show, compared with the traditional classification model, the self-attention-based Bi-LSTM model combined with TFIDF improves the accuracy of text recognition by 2.1%–4.6%, and the running time is reduced by 0.6 s–10.2 s.

Key words:spam message;text categorization;self-attention;Bi-LSTM;TFIDF

Reference

[1] Shahi TB, Yadav A. Mobile SMS spam filtering for Nepali text using Naïve Bayesian and support vector machine. International Journal of Intelligence Science, 2014, 4(1): 24-28. [doi: 10.4236/ijis.2014.41004

[2] Kim Y. Convolutional neural networks for sentence classification. arXiv: 1408.5882, 2014.

[3] Jiang MY, Liang YC, Feng XY, et al. Text classification based on deep belief network and softmax regression. Neural Computing and Applications, 2018, 29(1): 61-70. [doi: 10.1007/s00521-016-2401-x

[4] Talman A, Yli-Jyrä A, Tiedemann J. Natural language inference with hierarchical BiLSTM max pooling architecture. arXiv: 1808.08762, 2018.

[5] 雷朔, 刘旭敏, 徐维祥. 基于词向量特征扩展的中文短文本分类研究. 计算机应用与软件, 2018, 35(8): 269-274. [doi: 10.3969/j.issn.1000-386x.2018.08.049

[6] Zhang R, Lee H, Radev D. Dependency sensitive convolutional neural networks for modeling sentences and documents. arXiv: 1611.02361, 2016.

[7] Anoop VS, Prem SC, Asharaf S, et al. Generating and visualizing topic hierarchies from microblogs: An iterative latent dirichlet allocation approach. Proceedings of 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI). Kochi, India. 2015. 824-828.

[8] Wigington C, Stewart S, Davis B, et al. Data augmentation for recognition of handwritten words and lines using a CNN-LSTM network. Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Kyoto, Japan. 2017. 639-645.

[9] Kim Y, Jernite Y, Sontag D, et al. Character-aware neural language models. arXiv: 1508.06615, 2015.

[10] Gers FA, Schmidhuber E. LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks, 2001, 12(6): 1333-1340. [doi: 10.1109/72.963769

[11] Graves A. Supervised sequence labelling with recurrent neural networks. Berlin, Heidelberg: Springer, 2012.

[12] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8): 1735-1780. [doi: 10.1162/neco.1997.9.8.1735

[13] Kiperwasser E, Goldberg Y. Simple and accurate dependency parsing using bidirectional LSTM feature representations. Transactions of the Association for Computational Linguistics, 2016, 4: 313-327. [doi: 10.1162/tacl_a_00101

[14] 唐明, 朱磊, 邹显春. 基于Word2Vec的一种文档向量表示. 计算机科学, 2016, 43(6): 214-217, 269. [doi: 10.11896/j.issn.1002-137X.2016.06.043

[15] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, CA, USA. 2017. 5998-6008.

[16] Carrasco M, Barbot A. Spatial attention alters visual appearance. Current Opinion in Psychology, 2019, 29: 56-64. [doi: 10.1016/j.copsyc.2018.10.010

[17] 郑雄风, 丁立新, 万润泽. 基于用户和产品Attention机制的层次BGRU模型. 计算机工程与应用, 2018, 54(11): 145-152. [doi: 10.3778/j.issn.1002-8331.1701-0337

[18] Cho K, Van Merrienboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv: 1406.1078, 2014.

[19] Mikolov T, Yih WT, Zweig G. Linguistic regularities in continuous space word representations. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, GA, USA. 2013. 746-751.

[20] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words and phrases and their compositionality. Proceedings of the 26th International Conference on Neural Information Processing Systems. Lake Tahoe, NV, USA. 2013. 3111-3119.

[21] Mikolov T, Chen K, Corrado G, et al. Efficient estimation of word representations in vector space. Proceedings of the 1st International Conference on Learning Representations. Scottsdale, AZ, USA. 2013. 1-12.

[22] Zhao MZ, Xu B, Lin HF, et al. Discover potential adverse drug reactions using the skip-gram model. Proceedings of 2015 IEEE International Conference on Bioinformatics and Biomedicine. Washington, WA, USA. 2015. 599-698.

[23] 来斯惟, 徐立恒, 陈玉博, 等. 基于表示学习的中文分词算法探索. 中文信息学报, 2013, 27(5): 8-14. [doi: 10.3969/j.issn.1003-0077.2013.05.002

Get Citation

吴思慧,陈世平.结合TFIDF的Self-Attention-Based Bi-LSTM的垃圾短信识别.计算机系统应用,2020,29(9):171-177

Copy

Article Metrics

Abstract:1279
PDF: 3303
HTML: 1950
Cited by: 0

History

Received:December 12,2019
Revised:January 03,2020
Adopted:
Online: September 07,2020
Published: September 15,2020

Article QR Code

You are the first991220Visitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-3
Address：4# South Fourth Street, Zhongguancun,Haidian, Beijing,Postal Code：100190
Phone：010-62661041 Fax： Email：csa (a) iscas.ac.cn
Technical Support：Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063