Research Article Open Access

DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type

Said S. Saloum1
  • 1 Jouf University, Saudi Arabia

Abstract

This paper presents a novel Arabic dataset that considers the characteristics of the Arabic language filling some gaps not covered by existing datasets. Conventional datasets consider Arabic in a similar way to Latin languages. These datasets either delete diacritic and supplement marks, considering them as defects, or keep them without considering the actual meaning. More than half of all Arabic characters have diacritics above or below characters. In this context, this work presents the novel Detailed Arabic Dataset (DAD) for bridging these gaps. The additional marks included in this dataset are the single dot, two dots "-", three dots "^", Hamza and two supplement marks: The bar for Tah, or Zah and the complement bar for Kaf. A special application was built to generate a dataset for Arabic online recognition and writer identification (called OFMArabicDatasetBuilder). Totally the ground truth contains 93064 entries based on sub-word and letter parts (not on words or lines as other datasets). This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification. This paper also presents benchmarking results of using k-nearest neighbours machine learning with DAD.

Journal of Computer Science
Volume 17 No. 1, 2021, 19-32

DOI: https://doi.org/10.3844/jcssp.2021.19.32

Submitted On: 2 November 2020 Published On: 21 January 2021

How to Cite: Saloum, S. S. (2021). DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type. Journal of Computer Science, 17(1), 19-32. https://doi.org/10.3844/jcssp.2021.19.32

  • 3,534 Views
  • 1,656 Downloads
  • 3 Citations

Download

Keywords

  • Arabic Dataset
  • Arabic Benchmark
  • Arabic Recognition
  • Arabic Writer Identification
  • Diacritics Marks
  • Hamza
  • Supplement Marks
  • Tah
  • Zah