Preview

Arctic XXI century

Advanced search

Transformer-Based Neural Network Approaches for Speech Recognition and Synthesis in the Sakha Language

https://doi.org/10.25587/3034-7378-2025-4-56-78

Abstract

Recent breakthroughs in artificial intelligence and deep learning have fundamentally transformed the landscape of spoken language processing technologies. Automatic speech recognition (ASR) and text-to-speech (TTS) synthesis have emerged as essential components driving digital accessibility across diverse linguistic communities. The Sakha language, representing the northeastern branch of the Turkic language family, continues to face substantial technological barriers stemming from insufficient digital resources,  limited  annotated  corpora,  and  the  absence  of  production-ready  speech processing systems. This comprehensive investigation examines the feasibility and effectiveness of adapting contemporary transformer-based neural architectures for bidirectional speech conversion tasks in Sakha. Our research encompasses detailed analysis  of  encoder-decoder  frameworks,  specifically  OpenAI’s  Whisper  large-v3 and  Meta’s  Wav2Vec2-BERT  for  voice-to-text  transformation,  alongside  Coqui’s XTTS-v2  system  for  text-to-voice  generation.  Particular  emphasis  is  placed  on addressing linguistic and technical obstacles inherent to Sakha, including its complex agglutinative  morphological  structure,  systematic  vowel  harmony  patterns,  and distinctive phonemic inventory featuring sounds absent from most Indo-European languages.  Experimental  evaluation  demonstrates  that  comprehensive  fine-tuning of  Whisper-large-v3  achieves  exceptional  recognition  accuracy  with  word  error rate (WER) of 8%, while the self-supervised Wav2Vec2-BERT architecture attains 13%  WER  when  augmented  with  statistical  n-gram  language  modeling.  The neural synthesis system exhibits robust performance despite minimal training data availability, achieving average loss of 2.49 following extended training optimization and practical deployment via Telegram messaging bot. Additionally, ensemble meta-stacking combining both recognition architectures achieves 27% WER, demonstrating effective  complementarity  through  learned  hypothesis  arbitration.  These  findings validate transfer learning methodologies as viable pathways for developing speech technologies serving digitally underrepresented linguistic communities.

About the Authors

S. P. Stepanov
M.K. Ammosov North-Eastern Federal University
Россия

Sergei  P.  Stepanov  –  Cand.  Sci.  (Physics  and  Mathematics),  Head,  the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

Yakutsk

WoS  Researcher ID:  F-7549-2017

Scopus  Author  ID:  56419440700

Elibrary AuthorID: 856700



Dong Zhang
Qufu Normal University
Китай

Dong  Zhang – Cand. Sci. (Physics and Mathematics), Associate Professor

Qufu,  Shandong

WoS  Researcher ID:  ACW-5232-2022

Scopus  Author ID: 57212194896



A. A. Alekseeva
M.K. Ammosov North-Eastern Federal University
Россия

Altana A. Alekseeva –  Research  Assistant,     the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

 Yakutsk



V. L. Aprosimov
M.K. Ammosov North-Eastern Federal University
Россия

Vladislav L. Aprosimov – Research  Assistant,     the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

 Yakutsk



Dj. A. Fedorov
M.K. Ammosov North-Eastern Federal University
Россия

Djuluur A.  Fedorov – Research  Assistant,     the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

 Yakutsk



V. S. Leveryev
M.K. Ammosov North-Eastern Federal University
Россия

Vladimir S.  Leveryev – Research  Assistant,     the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

 Yakutsk



T. A. Novgorodov
M.K. Ammosov North-Eastern Federal University
Россия

Tuygun A. Novgorodov – Research  Assistant,     the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

 Yakutsk



E. S. Podorozhnaya
M.K. Ammosov North-Eastern Federal University

Ekaterina S. Podorozhnaya – Research  Assistant,     the Laboratory  “Computational  Technologies  and  Artificial  Intelligence”,  Institute of  Mathematics  and  Information  Science

 Yakutsk



T. Z. Zakharov
M.K. Ammosov North-Eastern Federal University

Timur  Z.  Zakharov  –  Research  Assistant,    laboratory  “Computational Technologies and Artificial Intelligence”, Institute of Mathematics and Information Science

Yakutsk



References

1. Besacier L, Barnard E, Karpov A, Schultz T. Automatic speech recognition for under-resourced languages: A survey. Speech Communication. 2014;(56): 85–100. DOI: https://doi.org/10.1016/j.specom.2013.07.008

2. Joshi P, Santy S, Buber A, Bali K, Choudhury M. The state and fate of linguistic diversity and inclusion in the NLP world. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 6282–6293.

3. Pakendorf B. Contact in the prehistory of the Sakha (Yakuts): Linguistic and genetic perspectives. LOT Publications: Utrecht. 2007

4. Johanson L, Csató ÉA. The Turkic Languages. Routledge Language Family Series. Routledge: London. 2021. DOI: https://doi.org/10.4324/9781003243809

5. Dyachkovsky ND. Sound structure of the Yakut language. Part 1: Vocalism (Дьячковский Н.Д. Звуковой строй якутского языка. Вокализм). Yakutsk: Yakutsk publishing house. 1971 (in Russian).

6. Dyachkovsky ND. Sound structure of the Yakut language. Part 2: Consonantism (Дьячковский Н.Д. Звуковой строй якутского языка. Консонантизм). Yakutsk: Yakutsk publishing house. 1977 (in Russian).

7. Mussakhojayeva S, Dauletbek K, Yeshpanov R, Varol HA. Multilingual speech recognition for Turkic languages. Information. 2023;14(2): 74. DOI: https://doi.org/10.3390/info14020074

8. Rabiner LR. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE. 1989;77(2):257–286. DOI: http://dx.doi.org/10.1109/5.18626

9. Graves A, Mohamed A, Hinton G. Speech recognition with deep recurrent neural networks. Proceedings of International Conference on Acoustics, Speech and Signal Processing. 2013:6645–6649. DOI: https://doi.org/10.1109/ICASSP.2013.6638947

10. Conneau A, Baevski A, Collobert R, Mohamed A, Auli M. Unsupervised cross-lingual representation learning for speech recognition. In Proceedings of Interspeech. 2021:2426–2430. DOI: https://doi.org/10.48550/arXiv.2006.13979

11. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:5998–6008. DOI: https://doi.org/10.48550/arXiv.1706.03762

12. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In Proceedings of International Conference on Machine Learning. 2023:28492–28518. DOI: https://doi.org/10.48550/arXiv.2212.04356

13. Du W, Maimaitiyiming Y, Nijat M, Li L, Hamdulla A, Wang D. Automatic speech recognition for Uyghur, Kazakh, and Kyrgyz: An overview. Applied Sciences. 2023;13(1):326. DOI: https://doi.org/10.3390/app13010326

14. Yeshpanov R, Mussakhojayeva S, Khassanov Y. Multilingual text-tospeech synthesis for Turkic languages using transliteration. In Proceedings of Interspeech. 2023:5521–5525. DOI: https://doi.org/10.48550/arXiv.2305.15749

15. Kim J, Kim S, Kong J, Yoon S. Glow-TTS: A generative low for text-to-speech via monotonic alignment search. In Proceedings of the International Conference on Neural Information Processing Systems. 2020:8067–8077. DOI: https://doi.org/10.48550/arXiv.2005.11129

16. Kim J, Kong J, Son J. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of the International Conference on Machine Learning. 2021:5530–5540. DOI: https://doi.org/10.48550/arXiv.2106.06103

17. Kong J, Kim J, Bae J. HiFi-GAN: Generative adversarial networks for eficient and high idelity speech synthesis. In Proceedings of the International Conference on Neural Information Processing Systems. 2020:17022–17033. DOI: https://doi.org/10.48550/arXiv.2010.05646

18. Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R., et al. Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions. In Proceedings of IEEE ICASSP. 2018:4779–4783. DOI: https://doi.org/10.48550/arXiv.1712.05884

19. Ren Y, Hu C, Tan X, Qin T, Zhao S, Zhao Z, Liu T-Y. FastSpeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of ICLR. 2021. DOI: https://doi.org/10.48550/arXiv.2006.04558

20. Karibayeva A, Karyukin V, Abduali B, Amirova D. Speech recognition and synthesis models and platforms for the Kazakh language. Information. 2025;16(10):879. DOI: https://doi.org/10.3390/info16100879

21. Ardila R, Branson M, Davis K, Kohler M, Meyer J, Henretty M, Morais R, Saunders L, Tyers F, Weber G. Common Voice: A massively-multilingual speech corpus. In Proceedings of LREC. 2020:4218–4222. DOI: https://doi.org/10.48550/arXiv.1912.06670

22. Park DS, Chan W, Zhang Y, Chiu C-C, Zoph B, Cubuk ED, Le QV. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proceedings of Interspeech. 2019:2613–2617. DOI: https://doi.org/10.21437/Interspeech.2019-2680

23. Kudo T, Richardson J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of EMNLP System Demonstrations. 2018:66–71. DOI: https://doi.org/10.48550/arXiv.1808.06226

24. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing System. 2020:12449-12460. DOI: https://doi.org/10.48550/arXiv.2006.11477

25. Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classiication: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of ICML. 2006:369–376. DOI: https://doi.org/10.1145/1143844.1143891

26. Casanova E, Weber J, Shulby C, Junior AC, Gölge E, Müller MA. XTTS: A massively multilingual zero-shot text-to-speech model. In Proceedings of Interspeech. 2024. DOI: https://doi.org/10.48550/arXiv.2406.04904

27. Panayotov V, Chen G, Povey D, Khudanpur S. LibriSpeech: An ASR corpus based on public domain audio books. In Proceedings of IEEE ICASSP. 2015:5206-5210. DOI: https://doi.org/10.1109/ICASSP.2015.7178964

28. Dyakonov AG. Ensembles in machine learning: Methods and applications. Data Science Course Materials (Дьяконов А.Г. Ансамбли в машинном обучении). 2019. Available at: https://alexanderdyakonov.wordpress.com (accessed 07.09.2025).

29. Wolpert DH. Stacked generalization. Neural Networks 1992;5(2): 241-259. DOI: https://doi.org/10.1016/S0893-6080(05)80023-1

30. Breiman L. Stacked regressions. Machine Learning. 1996;24(1):49–64.

31. Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020:38–45. DOI: https://doi.org/10.18653/v1/2020.emnlp-demos.6

32. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the International Conference on Neural Information Processing Systems. 2019:8026–8037. DOI: https://doi.org/10.48550/arXiv.1912.01703


Review

For citations:


Stepanov S.P., Zhang D., Alekseeva A.A., Aprosimov V.L., Fedorov D.A., Leveryev V.S., Novgorodov T.A., Podorozhnaya E.S., Zakharov T.Z. Transformer-Based Neural Network Approaches for Speech Recognition and Synthesis in the Sakha Language. Arctic XXI century. 2025;(4):56-78. https://doi.org/10.25587/3034-7378-2025-4-56-78

Views: 35

JATS XML

ISSN 3034-7378 (Print)
ISSN 3034-7386 (Online)