###
计算机系统应用英文版:2019,28(11):238-244
本文二维码信息
码上扫一扫!
基于贪婪算法的文档图像中干扰线的去除
(1.武夷学院 数学与计算机学院, 武夷山 354300;2.苏州大学 计算机科学与技术学院, 苏州 215006;3.认知计算与智能信息处理福建省高校重点实验室, 武夷山 354300;4.南通大学 信息科学技术学院, 南通 226019)
Interferential Line Elimination in Document Image Based on Greedy Algorithm
(1.School of Mathematics and Computer Science, Wuyi University, Wuyishan 354300, China;2.School of Computer Science and Technology, Soochow University, Suzhou 215006, China;3.Fujian Provincial Key Laboratory of Cognitive Computing and Intelligent Information Processing, Wuyishan 354300, China;4.School of Information Science and Technology, Nantong University, Nantong 226019, China)
摘要
图/表
参考文献
相似文献
本文已被:浏览 1598次   下载 1522
Received:March 29, 2019    Revised:April 26, 2019
中文摘要: 各种文档中经常包含有各种特殊作用的横线、手划线等,当这些文档通过扫描等数字化方式存入计算机并需要进一步识别处理成文字编码时,这些线条却成为OCR的干扰因素,降低了文档内容的识别率.为此,本文提出一种新的文档干扰线去除算法,先将文档图像二值化,二值化过程考虑了不均匀光照带来的影响;然后将前景细化为单像素,减少线条粗细造成的影响;接着通过一种改进的贪婪算法计算横、竖两个方向线段的权重,判断权重较高的线段为干扰线;最后通过与干扰线距离的大小判断图像中每个前景像素的归属,从而获得一个完整的文档恢复图.仿真实验表明,本文提出的算法能够有效去除干扰线,特别在干扰线与文字粘连的情况下,去除干扰线的同时较少地影响文档图像的质量,且具有较高的计算速度和较好的去除效果,为图像进一步OCR识别提供了良好的基础.
Abstract:Documents often contain horizontal lines, hand lines, etc., which are used for various special functions. When these documents are stored in computers by scanning or the like and need to be further recognized and processed into text codes, these lines become interference factors of OCR, thus the recognition rate of document content is decreased. This study proposes a new document interference line removal algorithm, which first binarizes the document image, and the binarization process takes into account the effects of uneven illumination; then the foreground is refined into single pixels, reducing the thickness of the lines. The effect is then calculated by an improved greedy algorithm to calculate the weights of the horizontal and vertical line segments, and the line segment with higher weight is determined as the interference line; finally, the distance of each foreground pixel in the image is determined by the distance from the interference line. Thereby obtaining a complete document recovery map. The simulation results show that the proposed algorithm can effectively remove the interference lines, especially in the case of interference lines and text adhesion, and remove the interference lines while affecting the quality of document images less, and has a higher computing speed and better removal effect. The removal effect provides a good basis for further OCR recognition of images.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(61672369);中央引导地方科技发展专项(2018L3013);福建省自然科学基金面上项目(2015J01669,2017J01651);福建省教育厅中青年教师项目(JA15522)
引用文本:
王平,张晓峰,王宜怀,程仁贵.基于贪婪算法的文档图像中干扰线的去除.计算机系统应用,2019,28(11):238-244
WANG Ping,ZHANG Xiao-Feng,WANG Yi-Huai,CHENG Ren-Gui.Interferential Line Elimination in Document Image Based on Greedy Algorithm.COMPUTER SYSTEMS APPLICATIONS,2019,28(11):238-244