Research on Deep Neural Network for Afaan-Oromo Language Text-to-Speech Synthesis

Diriba Gichile Rundasa; Arulmurugan Ramu; Teshale Debushe Adugna; Chala Sembeta Teshome; Desalegn Tasew

doi:10.3844/jcssp.2025.1037.1048

Research Article Open Access

Research on Deep Neural Network for Afaan-Oromo Language Text-to-Speech Synthesis

Diriba Gichile Rundasa¹, Arulmurugan Ramu², Teshale Debushe Adugna³, Chala Sembeta Teshome¹ and Desalegn Tasew³

¹ Department of Computer Science Information Technology, Mattu University, Ethiopia
² Department of Computational Sciences and Software Engineering, K. Zhubanov University, Kazakhstan
³ Department of Information Technology, Mattu University, Ethiopia

Abstract

Text-to-speech synthesis is the automatic translation of unlimited natural language sentences from Text to spoken form that closely mimics the spoken form of the same Text by a native speaker of the language. The purpose of a Texttext-to-speech synthesizer is to generate comprehensible, natural signalling human voice from text transcriptions. Despite the wide range of potential applications for Text-to-speech systems, the field is language-dependent, with most efforts concentrated on accessible languages, especially English. The linguistic resources required to make a speech from texts are lacking for under-resourced languages like the Afaan-Oromo language. To develop an Afaan Oromo language text-to-speech synthesizer, a speech dataset was prepared, which is 10644 text and audio pairs in numbers and assembled from dependable sources. After that, the proposed model is developed, which incorporates nonstandard terminology, including acronyms, currencies and numerals, in addition to common terms and names. The deep neural network was selected for this study because it has a good ability to convert Text into complex spoken Text. A number of experiments were carried out to find the best-performing model. To assess the performance of the model objectively, the attention mistake is used where, whereas to assess the models' performance subjectively, the Mean Opinion Score or scale (MOS) test is used. Subsequently, the objective outcomes evaluation revealed that Deep Voice (DV) 3 produced 18 of the 248 words in the evaluation sentence set. At the same time, Tacotron-2(two) made attention errors, which are two in number. Moreover, MOS scores for naturalness and intelligibility have made 4.36 and 4.33 out of five (5) for Tacotron-2 (two), respectively and 3.32 and 3.04 for Deep Voice(DV) 3, respectively. Because it can translate intricate verbal information into auditory feature parameters, the deep neural network was selected for this research. Therefore, the Tacotron-2 (two) model yielded good results and promising results compared with Deep Voice (DV) 3, making it suitable for a range of applications, such as smart education, different telephone inquiry services, and recommendation systems, which are the most common areas of the system.

Journal of Computer Science

Volume 21 No. 5, 2025, 1037-1048

DOI: https://doi.org/10.3844/jcssp.2025.1037.1048

Submitted On: 4 November 2024 Published On: 14 April 2025

How to Cite: Rundasa, D. G., Ramu, A., Adugna, T. D., Teshome, C. S. & Tasew, D. (2025). Research on Deep Neural Network for Afaan-Oromo Language Text-to-Speech Synthesis. Journal of Computer Science, 21(5), 1037-1048. https://doi.org/10.3844/jcssp.2025.1037.1048

Copyright: © 2025 Diriba Gichile Rundasa, Arulmurugan Ramu, Teshale Debushe Adugna, Chala Sembeta Teshome and Desalegn Tasew. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

436 Views
138 Downloads
0 Citations

Download

Keywords

Deep Neural Network
Speech Processing
Mean Opinion Score
Afaan-Oromo
Text to Speech
Tacotron 2 (Two)
Deep Voice (DV) 3