Natural Language Processing in Low-Resource Languages: Progress and Prospects
Author(s):Ritul Phukan1, Monalisa Daimari2, Anupam Kharghoria3, Biman Basumatary3
Affiliation: 1,2,3,4Department of Computer Science and Engineering, Assam Down Town University, Guwahati, India
Page No: 4-8
Volume issue & Publishing Year: Volume 2 Issue 9 ,Sep -2025
Journal: International Journal of Advanced Multidisciplinary Application.(IJAMA)
ISSN NO: 3048-9350
DOI: https://doi.org/10.5281/zenodo.17582873
Abstract:
Low-resource languages languages with limited annotated corpora, lexicons, and digital resources pose major challenges for modern natural language processing (NLP). Recent progress in transfer learning, multilingual pretraining, parameter-efficient adaptation, data augmentation, and community-driven dataset creation has substantially improved capabilities for many such languages, yet large performance gaps remain compared to high-resource languages. This article surveys the technical advances that enable NLP for low-resource languages (including unsupervised and weakly supervised methods, multilingual and massively multilingual models, few-shot and in-context learning with large language models, and adapter/LoRA-style parameter-efficient fine-tuning). We examine practical pipelines for tasks such as machine translation, speech recognition, OCR, and information extraction; describe prominent dataset and community projects; summarize typical evaluation strategies and their pitfalls; and outline promising research directions (community data collection, privacy-preserving methods, on-device adaptation, and ethics-aware deployments). The review highlights approaches that balance performance, compute cost, and data-efficiency, and recommends research and deployment practices to accelerate inclusive language technology.
Keywords: Low-resource languages, transfer learning, multilingual pretraining, few-shot learning, LoRA / adapters, data augmentation, machine translation, speech datasets, Masakhane, Common Voice
Reference:
- [1] A. Conneau et al., �Unsupervised cross-lingual representation learning at scale,� in Proc. ACL, 2020, pp. 8440�8451.
- [2] S. Ruder, I. Vuli?, and A. S�gaard, �A survey of cross-lingual word embedding models,� J. Artif. Intell. Res., vol. 65, pp. 569�631, 2019.
- [3] J. Tiedemann, �Parallel data, tools and interfaces in OPUS,� in Proc. LREC, 2012, pp. 2214�2218.
- [4] M. Nekoto et al., �Participatory research for low-resourced machine translation: A case study in African languages,� Findings of ACL: EMNLP 2020, pp. 2144�2160.
- [5] K. Heffernan, A. Salesky, and A. Post, �Bitext mining using distant supervision for low-resource languages,� in Proc. NAACL-HLT, 2021, pp. 3617�3629.
- [6] J. Schneider et al., �Common Voice: A massively-multilingual speech corpus,� in Proc. LREC, 2020, pp. 4218�4226.