Heuristic Lemmatization for Arabic Texts Indexation and Classification

Faten Khalfallah Hammouda; Abdelsalam Abdelhamid Almarimi

doi:10.3844/jcssp.2010.660.665

Research Article Open Access

Heuristic Lemmatization for Arabic Texts Indexation and Classification

Faten Khalfallah Hammouda and Abdelsalam Abdelhamid Almarimi

Abstract

Problem statement: This study proposed a system based on a heuristic lemmatization for Arabic text indexation and classification. This research is needed for a lot of NLP applications such as the research of information and automatic abstract. This system was not related to any linguistic rule. The proposed method was limited to five different domains: Sports, medicine, politics, economics and agriculture. The main idea is collecting different texts related to the chosen domains and studying them by extracting the pertinent terms. Approach: Every entered text had the formatting stage in which we can remove some words and letters that do not have any importance for the meaning. After that, the frequencies’ average is calculated to classify the text and its related domain. Results: The main finality of the System of Indexation and Classification of Arabic Texts (SICAT) is to classify finally an unknown text in its suitable domain. So, it’s to detect the text theme. To do this task, we applied a method by pertinent terms correspondence. It is about testing the correspondence of all pertinent terms of the text to classify with the keywords of every domain of the corpus. The domain, that constitutes the majority of terms having a correspondence with terms of the text, represents the theme that we look for to classify our unknown text. Conclusion: It holds two main parts: the indexation and the classification. The indexation stage is composed of three main parts: the pre-learning, the lemmatization and the frequencies’ calculation. The classification stage is composed of two main components: the extraction of keywords and classification of new text. We have made many tests of verification to test the validation of the system. The system performance was evaluated on the different chosen domains, achieves 90% precision and 85% recall.

Journal of Computer Science

Volume 6 No. 6, 2010, 660-665

DOI: https://doi.org/10.3844/jcssp.2010.660.665

Submitted On: 29 January 2010 Published On: 30 June 2010

How to Cite: Hammouda, F. K. & Almarimi, A. A. (2010). Heuristic Lemmatization for Arabic Texts Indexation and Classification. Journal of Computer Science, 6(6), 660-665. https://doi.org/10.3844/jcssp.2010.660.665

Copyright: © 2010 Faten Khalfallah Hammouda and Abdelsalam Abdelhamid Almarimi. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

3,183 Views
2,710 Downloads
3 Citations

Download

Keywords

Natural language processing
indexation and classification