Natural Language Processing in Low Resource Languages: Progress and Prospects

International Journal of Advanced Multidisciplinary Application (IJAMA)

Peer reviewed Journal II Open access Journal II ISSN Approved No: 3048-9350

Join As Reviewer

Submit Your Paper

Author Benefits

Natural Language Processing in Low-Resource Languages: Progress and Prospects

Download Pdf

Author :Ritul Phukan¹, Monalisa Daimari², Anupam Kharghoria³, Biman Basumatary⁴

Affiliation:^,2,3,4Department of Computer Science and Engineering, Assam Down Town University, Guwahati, India

Volume/Issue : Volume 2 Issue 9 -2025/Sep ,Pages : 4 to 8

Author Indexing :

Abstract

Low-resource languages—languages with limited annotated corpora, lexicons, and digital resources—pose major challenges for modern natural language processing (NLP). Recent progress in transfer learning, multilingual pretraining, parameter-efficient adaptation, data augmentation, and community-driven dataset creation has substantially improved capabilities for many such languages, yet large performance gaps remain compared to high-resource languages. This article surveys the technical advances that enable NLP for low-resource languages (including unsupervised and weakly supervised methods, multilingual and massively multilingual models, few-shot and in-context learning with large language models, and adapter/LoRA-style parameter-efficient fine-tuning). We examine practical pipelines for tasks such as machine translation, speech recognition, OCR, and information extraction; describe prominent dataset and community projects; summarize typical evaluation strategies and their pitfalls; and outline promising research directions (community data collection, privacy-preserving methods, on-device adaptation, and ethics-aware deployments). The review highlights approaches that balance performance, compute cost, and data-efficiency, and recommends research and deployment practices to accelerate inclusive language technology.

Keywords

Low-resource languages, transfer learning, multilingual pretraining, few-shot learning, LoRA / adapters, data augmentation, machine translation, speech datasets, Masakhane, Common Voice

References

[1] A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proc. ACL, 2020, pp. 8440–8451.

[2] S. Ruder, I. Vulić, and A. Søgaard, “A survey of cross-lingual word embedding models,” J. Artif. Intell. Res., vol. 65, pp. 569–631, 2019.

[3] J. Tiedemann, “Parallel data, tools and interfaces in OPUS,” in Proc. LREC, 2012, pp. 2214–2218.

[4] M. Nekoto et al., “Participatory research for low-resourced machine translation: A case study in African languages,” Findings of ACL: EMNLP 2020, pp. 2144–2160.

[5] K. Heffernan, A. Salesky, and A. Post, “Bitext mining using distant supervision for low-resource languages,” in Proc. NAACL-HLT, 2021, pp. 3617–3629.

[6] J. Schneider et al., “Common Voice: A massively-multilingual speech corpus,” in Proc. LREC, 2020, pp. 4218–4226.