An Adaptive Thresholding Algorithm-Based Optical Character Recognition System for Information Extraction in Complex Images

Daniel Akinbade; Adewale Opeoluwa Ogunde; Mba Obasi Odim; Bosede Oyenike Oguntunde

doi:10.3844/jcssp.2020.784.801

Research Article Open Access

An Adaptive Thresholding Algorithm-Based Optical Character Recognition System for Information Extraction in Complex Images

Daniel Akinbade¹, Adewale Opeoluwa Ogunde¹, Mba Obasi Odim¹ and Bosede Oyenike Oguntunde¹

¹ Redeemer’s University, Nigeria

Abstract

Extracting texts from images with complex backgrounds is a major challenge today. Many existing Optical Character Recognition (OCR) systems could not handle this problem. As reported in the literature, some existing methods that can handle the problem still encounter major difficulties with extracting texts from images with sharp varying contours, touching word and skewed words from scanned documents and images with such complex backgrounds. There is, therefore, a need for new methods that could easily and efficiently extract texts from these images with complex backgrounds, which is the primary reason for this work. This study collected image data and investigated the processes involved in image processing and the techniques applied for data segmentation. It employed an adaptive thresholding algorithm to the selected images to properly segment text characters from the image’s complex background. It then used Tesseract, a machine learning product, to extract the text from the image file. The images used were coloured images sourced from the internet with different formats like jpg, png, webp and different resolutions. A custom adaptive algorithm was applied to the images to unify their complex backgrounds. This algorithm leveraged on the Gaussian thresholding algorithm. The algorithm differs from the conventional Gaussian algorithm as it dynamically generated the blocksize to apply threshing to the image. This ensured that, unlike conventional image segmentation, images were processed area-wise (in pixels) as specified by the algorithm at each instance. The system was implemented using Python 3.6 programming language. Experimentation involved fifty different images with complex backgrounds. The results showed that the system was able to extract English character-based texts from images with complex backgrounds with 69.7% word-level accuracy and 81.9% character-level accuracy. The proposed method in this study proved to be more efficient as it outperformed the existing methods in terms of the character level percentage accuracy.

Journal of Computer Science

Volume 16 No. 6, 2020, 784-801

DOI: https://doi.org/10.3844/jcssp.2020.784.801

Submitted On: 6 April 2020 Published On: 14 June 2020

How to Cite: Akinbade, D., Ogunde, A. O., Odim, M. O. & Oguntunde, B. O. (2020). An Adaptive Thresholding Algorithm-Based Optical Character Recognition System for Information Extraction in Complex Images. Journal of Computer Science, 16(6), 784-801. https://doi.org/10.3844/jcssp.2020.784.801

Copyright: © 2020 Daniel Akinbade, Adewale Opeoluwa Ogunde, Mba Obasi Odim and Bosede Oyenike Oguntunde. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

3,942 Views
1,616 Downloads
16 Citations

Download

Keywords

Adaptive Threshold Algorithm
Complex Backgrounds
Images
Optical Character Recognition
Pattern Recognition