DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type
- 1 Jouf University, Saudi Arabia
Abstract
This paper presents a novel Arabic dataset that considers the characteristics of the Arabic language filling some gaps not covered by existing datasets. Conventional datasets consider Arabic in a similar way to Latin languages. These datasets either delete diacritic and supplement marks, considering them as defects, or keep them without considering the actual meaning. More than half of all Arabic characters have diacritics above or below characters. In this context, this work presents the novel Detailed Arabic Dataset (DAD) for bridging these gaps. The additional marks included in this dataset are the single dot, two dots "-", three dots "^", Hamza and two supplement marks: The bar for Tah, or Zah and the complement bar for Kaf. A special application was built to generate a dataset for Arabic online recognition and writer identification (called OFMArabicDatasetBuilder). Totally the ground truth contains 93064 entries based on sub-word and letter parts (not on words or lines as other datasets). This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification. This paper also presents benchmarking results of using k-nearest neighbours machine learning with DAD.
DOI: https://doi.org/10.3844/jcssp.2021.19.32
Copyright: © 2021 Said S. Saloum. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 3,534 Views
- 1,656 Downloads
- 3 Citations
Download
Keywords
- Arabic Dataset
- Arabic Benchmark
- Arabic Recognition
- Arabic Writer Identification
- Diacritics Marks
- Hamza
- Supplement Marks
- Tah
- Zah