Transformer-Based Neural Network Approaches for Speech Recognition and Synthesis in the Sakha Language
https://doi.org/10.25587/3034-7378-2025-4-56-78
Abstract
Recent breakthroughs in artificial intelligence and deep learning have fundamentally transformed the landscape of spoken language processing technologies. Automatic speech recognition (ASR) and text-to-speech (TTS) synthesis have emerged as essential components driving digital accessibility across diverse linguistic communities. The Sakha language, representing the northeastern branch of the Turkic language family, continues to face substantial technological barriers stemming from insufficient digital resources, limited annotated corpora, and the absence of production-ready speech processing systems. This comprehensive investigation examines the feasibility and effectiveness of adapting contemporary transformer-based neural architectures for bidirectional speech conversion tasks in Sakha. Our research encompasses detailed analysis of encoder-decoder frameworks, specifically OpenAI’s Whisper large-v3 and Meta’s Wav2Vec2-BERT for voice-to-text transformation, alongside Coqui’s XTTS-v2 system for text-to-voice generation. Particular emphasis is placed on addressing linguistic and technical obstacles inherent to Sakha, including its complex agglutinative morphological structure, systematic vowel harmony patterns, and distinctive phonemic inventory featuring sounds absent from most Indo-European languages. Experimental evaluation demonstrates that comprehensive fine-tuning of Whisper-large-v3 achieves exceptional recognition accuracy with word error rate (WER) of 8%, while the self-supervised Wav2Vec2-BERT architecture attains 13% WER when augmented with statistical n-gram language modeling. The neural synthesis system exhibits robust performance despite minimal training data availability, achieving average loss of 2.49 following extended training optimization and practical deployment via Telegram messaging bot. Additionally, ensemble meta-stacking combining both recognition architectures achieves 27% WER, demonstrating effective complementarity through learned hypothesis arbitration. These findings validate transfer learning methodologies as viable pathways for developing speech technologies serving digitally underrepresented linguistic communities.
About the Authors
S. P. StepanovРоссия
Sergei P. Stepanov – Cand. Sci. (Physics and Mathematics), Head, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
WoS Researcher ID: F-7549-2017
Scopus Author ID: 56419440700
Elibrary AuthorID: 856700
Dong Zhang
Китай
Dong Zhang – Cand. Sci. (Physics and Mathematics), Associate Professor
Qufu, Shandong
WoS Researcher ID: ACW-5232-2022
Scopus Author ID: 57212194896
A. A. Alekseeva
Россия
Altana A. Alekseeva – Research Assistant, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
V. L. Aprosimov
Россия
Vladislav L. Aprosimov – Research Assistant, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
Dj. A. Fedorov
Россия
Djuluur A. Fedorov – Research Assistant, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
V. S. Leveryev
Россия
Vladimir S. Leveryev – Research Assistant, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
T. A. Novgorodov
Россия
Tuygun A. Novgorodov – Research Assistant, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
E. S. Podorozhnaya
Ekaterina S. Podorozhnaya – Research Assistant, the Laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
T. Z. Zakharov
Timur Z. Zakharov – Research Assistant, laboratory “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science
Yakutsk
References
1. Besacier L, Barnard E, Karpov A, Schultz T. Automatic speech recognition for under-resourced languages: A survey. Speech Communication. 2014;(56): 85–100. DOI: https://doi.org/10.1016/j.specom.2013.07.008
2. Joshi P, Santy S, Buber A, Bali K, Choudhury M. The state and fate of linguistic diversity and inclusion in the NLP world. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6282–6293.
3. Pakendorf B. Contact in the prehistory of the Sakha (Yakuts): Linguistic and genetic perspectives. LOT Publications: Utrecht. 2007
4. Johanson L, Csató ÉA. The Turkic Languages. Routledge Language Family Series. Routledge: London. 2021. DOI: https://doi.org/10.4324/9781003243809
5. Dyachkovsky ND. Sound structure of the Yakut language. Part 1: Vocalism (Дьячковский Н.Д. Звуковой строй якутского языка. Вокализм). Yakutsk: Yakutsk publishing house. 1971 (in Russian).
6. Dyachkovsky ND. Sound structure of the Yakut language. Part 2: Consonantism (Дьячковский Н.Д. Звуковой строй якутского языка. Консонантизм). Yakutsk: Yakutsk publishing house. 1977 (in Russian).
7. Mussakhojayeva S, Dauletbek K, Yeshpanov R, Varol HA. Multilingual speech recognition for Turkic languages. Information. 2023;14(2): 74. DOI: https://doi.org/10.3390/info14020074
8. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. DOI: http://dx.doi.org/10.1109/5.18626
9. Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. Proceedings of International Conference on Acoustics, Speech and Signal Processing. 2013:6645–6649. DOI: https://doi.org/10.1109/ICASSP.2013.6638947
10. Conneau A, Baevski A, Collobert R, Mohamed A, Auli M. Unsupervised cross-lingual representation learning for speech recognition. In Proceedings of Interspeech. 2021:2426–2430. DOI: https://doi.org/10.48550/arXiv.2006.13979
11. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:5998–6008. DOI: https://doi.org/10.48550/arXiv.1706.03762
12. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In Proceedings of International Conference on Machine Learning. 2023:28492–28518. DOI: https://doi.org/10.48550/arXiv.2212.04356
13. Du W, Maimaitiyiming Y, Nijat M, Li L, Hamdulla A, Wang D. Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: An overview. Applied Sciences. 2023;13(1):326. DOI: https://doi.org/10.3390/app13010326
14. Yeshpanov R, Mussakhojayeva S, Khassanov Y. Multilingual text-tospeech synthesis for Turkic languages using transliteration. In Proceedings of Interspeech. 2023:5521–5525. DOI: https://doi.org/10.48550/arXiv.2305.15749
15. Kim J, Kim S, Kong J, Yoon S. Glow-TTS: A generative low for text-to-speech via monotonic alignment search. In Proceedings of the International Conference on Neural Information Processing Systems. 2020:8067–8077. DOI: https://doi.org/10.48550/arXiv.2005.11129
16. Kim J, Kong J, Son J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning. 2021:5530–5540. DOI: https://doi.org/10.48550/arXiv.2106.06103
17. Kong J, Kim J, Bae J. HiFi-GAN: Generative adversarial networks for eficient and high idelity speech synthesis. In Proceedings of the International Conference on Neural Information Processing Systems. 2020:17022–17033. DOI: https://doi.org/10.48550/arXiv.2010.05646
18. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R., et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proceedings of IEEE ICASSP. 2018:4779–4783. DOI: https://doi.org/10.48550/arXiv.1712.05884
19. Ren Y, Hu C, Tan X, Qin T, Zhao S, Zhao Z, Liu T-Y. FastSpeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of ICLR. 2021. DOI: https://doi.org/10.48550/arXiv.2006.04558
20. Karibayeva A, Karyukin V, Abduali B, Amirova D. Speech recognition and synthesis models and platforms for the Kazakh language. Information. 2025;16(10):879. DOI: https://doi.org/10.3390/info16100879
21. Ardila R, Branson M, Davis K, Kohler M, Meyer J, Henretty M, Morais R, Saunders L, Tyers F, Weber G. Common Voice: A massively-multilingual speech corpus. In Proceedings of LREC. 2020:4218–4222. DOI: https://doi.org/10.48550/arXiv.1912.06670
22. Park DS, Chan W, Zhang Y, Chiu C-C, Zoph B, Cubuk ED, Le QV. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of Interspeech. 2019:2613–2617. DOI: https://doi.org/10.21437/Interspeech.2019-2680
23. Kudo T, Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP System Demonstrations. 2018:66–71. DOI: https://doi.org/10.48550/arXiv.1808.06226
24. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing System. 2020:12449-12460. DOI: https://doi.org/10.48550/arXiv.2006.11477
25. Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classiication: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of ICML. 2006:369–376. DOI: https://doi.org/10.1145/1143844.1143891
26. Casanova E, Weber J, Shulby C, Junior AC, Gölge E, Müller MA. XTTS: A massively multilingual zero-shot text-to-speech model. In Proceedings of Interspeech. 2024. DOI: https://doi.org/10.48550/arXiv.2406.04904
27. Panayotov V, Chen G, Povey D, Khudanpur S. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of IEEE ICASSP. 2015:5206-5210. DOI: https://doi.org/10.1109/ICASSP.2015.7178964
28. Dyakonov AG. Ensembles in machine learning: Methods and applications. Data Science Course Materials (Дьяконов А.Г. Ансамбли в машинном обучении). 2019. Available at: https://alexanderdyakonov.wordpress.com (accessed 07.09.2025).
29. Wolpert DH. Stacked generalization. Neural Networks 1992;5(2): 241-259. DOI: https://doi.org/10.1016/S0893-6080(05)80023-1
30. Breiman L. Stacked regressions. Machine Learning. 1996;24(1):49–64.
31. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020:38–45. DOI: https://doi.org/10.18653/v1/2020.emnlp-demos.6
32. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the International Conference on Neural Information Processing Systems. 2019:8026–8037. DOI: https://doi.org/10.48550/arXiv.1912.01703
Review
For citations:
Stepanov S.P., Zhang D., Alekseeva A.A., Aprosimov V.L., Fedorov D.A., Leveryev V.S., Novgorodov T.A., Podorozhnaya E.S., Zakharov T.Z. Transformer-Based Neural Network Approaches for Speech Recognition and Synthesis in the Sakha Language. Arctic XXI century. 2025;(4):56-78. https://doi.org/10.25587/3034-7378-2025-4-56-78
JATS XML












