1. 武夷学院 数学与计算机学院, 武夷山 354300;
2. 南通大学 信息科学技术学院, 南通 226019;
3. 苏州大学 计算机科学与技术学院, 苏州 215006;
4. 认知计算与智能信息处理福建省高校重点实验室, 武夷山 354300

Interferential Line Elimination in Document Image Based on Greedy Algorithm
WANG Ping1,3,4, ZHANG Xiao-Feng2, WANG Yi-Huai3, CHENG Ren-Gui1,4
1. School of Mathematics and Computer Science, Wuyi University, Wuyishan 354300, China;
2. School of Information Science and Technology, Nantong University, Nantong 226019, China;
3. School of Computer Science and Technology, Soochow University, Suzhou 215006, China;
4. Fujian Provincial Key Laboratory of Cognitive Computing and Intelligent Information Processing, Wuyishan 354300, China
Foundation item: National Natural Science Foundation of China (61672369); Special Fund of Central Government for Local Science and Technology Development (2018L3013); General Program of Natural Science Foundation of Fujian Province (2015J01669, 2017J01651); Mid-Aged and Young Faculty Program of Education Buearu, Fujian Province (JA15522)
Abstract: Documents often contain horizontal lines, hand lines, etc., which are used for various special functions. When these documents are stored in computers by scanning or the like and need to be further recognized and processed into text codes, these lines become interference factors of OCR, thus the recognition rate of document content is decreased. This study proposes a new document interference line removal algorithm, which first binarizes the document image, and the binarization process takes into account the effects of uneven illumination; then the foreground is refined into single pixels, reducing the thickness of the lines. The effect is then calculated by an improved greedy algorithm to calculate the weights of the horizontal and vertical line segments, and the line segment with higher weight is determined as the interference line; finally, the distance of each foreground pixel in the image is determined by the distance from the interference line. Thereby obtaining a complete document recovery map. The simulation results show that the proposed algorithm can effectively remove the interference lines, especially in the case of interference lines and text adhesion, and remove the interference lines while affecting the quality of document images less, and has a higher computing speed and better removal effect. The removal effect provides a good basis for further OCR recognition of images.
Key words: binarization     interferential line elimination     greedy algorithm     OCR

1 研究概况

2 图像预处理

2.1 图像二值化

 $B\left( {i,j} \right) = \left\{ \begin{gathered} 0,\;\;{\rm{if }}\;\;I(i,j) > T(i,j) \\ 1,\;\;{\rm{if }}\;\;I(i,j) \le T(i,j) \\ \end{gathered} \right.$ (1)

2.2 图像细化

3 基于贪婪算法的图像干扰线检测

(1)干扰线一般为横向, 偶尔出现竖方向, 极少出现旋转方向;

(2)干扰线一般较长, 远远大于文字字体的大小.

 图 1 图像二值化

 图 2 图像细化

 ${V_{li}} > {T_l}$ (2)

(1)初始化, 设 ${V_{li}} = 0\;(i = 1,\cdots,n)$ , 其中 $n$ 是图像中包含中心线的数目;

(2)扫描前方的像素点, 并加上相应的权值;

(3)循环步骤(2), 直到遍历了细化图像中每个像素点.

 ${T_l} = 3{V_{lm}}$ (3)

 图 3 当前像素的前方像素3类情况

 图 4 检测出的干扰线部分

4 图像干扰线去除

 $I\left( {x,y} \right) \in \left\{ \begin{array}{l} I_f \; {\rm if}\; DCw\left( {i,j} \right) \le DCg(i,j) \\ {I_b} \;{\rm if}\; DCw\left( {i,j} \right) > DCg(i,j) \\ \end{array} \right.$ (4)

(1)初始化距离矩阵(与图像大小相同)中所有的位置为一个极大值 $max$ (实验中可取值10 000);

(2)设置 $Cw$ 中所有的像素对应位置的距离为0;

(3)设置所有距离为 $max$ 且与距离0相邻的位置距离为1;

(4)循环步骤(3), 设置所有距离为 $max$ 且与距离 $i$ 相邻的位置距离为 $i + 1$ .

 图 5 干扰线去除效果图

5 实验分析

 图 6 人造图像的实验结果

 图 7 真实扫描的文档图像

 图 8 真实扫描图像去除干扰线的实验结果

6 结论与展望

