Research on Deep Neural Network for Afaan-Oromo Language Text-to-Speech Synthesis
- 1 Department of Computer Science Information Technology, Mattu University, Ethiopia
- 2 Department of Computational Sciences and Software Engineering, K. Zhubanov University, Kazakhstan
- 3 Department of Information Technology, Mattu University, Ethiopia
Abstract
Text-to-speech synthesis is the automatic translation of unlimited natural language sentences from Text to spoken form that closely mimics the spoken form of the same Text by a native speaker of the language. The purpose of a Texttext-to-speech synthesizer is to generate comprehensible, natural signalling human voice from text transcriptions. Despite the wide range of potential applications for Text-to-speech systems, the field is language-dependent, with most efforts concentrated on accessible languages, especially English. The linguistic resources required to make a speech from texts are lacking for under-resourced languages like the Afaan-Oromo language. To develop an Afaan Oromo language text-to-speech synthesizer, a speech dataset was prepared, which is 10644 text and audio pairs in numbers and assembled from dependable sources. After that, the proposed model is developed, which incorporates nonstandard terminology, including acronyms, currencies and numerals, in addition to common terms and names. The deep neural network was selected for this study because it has a good ability to convert Text into complex spoken Text. A number of experiments were carried out to find the best-performing model. To assess the performance of the model objectively, the attention mistake is used where, whereas to assess the models' performance subjectively, the Mean Opinion Score or scale (MOS) test is used. Subsequently, the objective outcomes evaluation revealed that Deep Voice (DV) 3 produced 18 of the 248 words in the evaluation sentence set. At the same time, Tacotron-2(two) made attention errors, which are two in number. Moreover, MOS scores for naturalness and intelligibility have made 4.36 and 4.33 out of five (5) for Tacotron-2 (two), respectively and 3.32 and 3.04 for Deep Voice(DV) 3, respectively. Because it can translate intricate verbal information into auditory feature parameters, the deep neural network was selected for this research. Therefore, the Tacotron-2 (two) model yielded good results and promising results compared with Deep Voice (DV) 3, making it suitable for a range of applications, such as smart education, different telephone inquiry services, and recommendation systems, which are the most common areas of the system.
DOI: https://doi.org/10.3844/jcssp.2025.1037.1048
Copyright: © 2025 Diriba Gichile Rundasa, Arulmurugan Ramu, Teshale Debushe Adugna, Chala Sembeta Teshome and Desalegn Tasew. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 436 Views
- 138 Downloads
- 0 Citations
Download
Keywords
- Deep Neural Network
- Speech Processing
- Mean Opinion Score
- Afaan-Oromo
- Text to Speech
- Tacotron 2 (Two)
- Deep Voice (DV) 3