Abstract:Documents often contain horizontal lines, hand lines, etc., which are used for various special functions. When these documents are stored in computers by scanning or the like and need to be further recognized and processed into text codes, these lines become interference factors of OCR, thus the recognition rate of document content is decreased. This study proposes a new document interference line removal algorithm, which first binarizes the document image, and the binarization process takes into account the effects of uneven illumination; then the foreground is refined into single pixels, reducing the thickness of the lines. The effect is then calculated by an improved greedy algorithm to calculate the weights of the horizontal and vertical line segments, and the line segment with higher weight is determined as the interference line; finally, the distance of each foreground pixel in the image is determined by the distance from the interference line. Thereby obtaining a complete document recovery map. The simulation results show that the proposed algorithm can effectively remove the interference lines, especially in the case of interference lines and text adhesion, and remove the interference lines while affecting the quality of document images less, and has a higher computing speed and better removal effect. The removal effect provides a good basis for further OCR recognition of images.