Abstract
Low-resource languages—languages with limited annotated corpora, lexicons, and digital resources—pose major challenges for modern natural language processing (NLP). Recent progress in transfer learning, multilingual pretraining, parameter-efficient adaptation, data augmentation, and community-driven dataset creation has substantially improved capabilities for many such languages, yet large performance gaps remain compared to high-resource languages. This article surveys the technical advances that enable NLP for low-resource languages (including unsupervised and weakly supervised methods, multilingual and massively multilingual models, few-shot and in-context learning with large language models, and adapter/LoRA-style parameter-efficient fine-tuning). We examine practical pipelines for tasks such as machine translation, speech recognition, OCR, and information extraction; describe prominent dataset and community projects; summarize typical evaluation strategies and their pitfalls; and outline promising research directions (community data collection, privacy-preserving methods, on-device adaptation, and ethics-aware deployments). The review highlights approaches that balance performance, compute cost, and data-efficiency, and recommends research and deployment practices to accelerate inclusive language technology.
Keywords
Low-resource languages, transfer learning, multilingual pretraining, few-shot learning, LoRA / adapters, data augmentation, machine translation, speech datasets, Masakhane, Common Voice
References
[1] A. Conneau et al., “Unsupervised cross-lingual representation learning at scale,” in Proc. ACL, 2020, pp. 8440–8451.
[2] S. Ruder, I. Vulić, and A. Søgaard, “A survey of cross-lingual word embedding models,” J. Artif. Intell. Res., vol. 65, pp. 569–631, 2019.
[3] J. Tiedemann, “Parallel data, tools and interfaces in OPUS,” in Proc. LREC, 2012, pp. 2214–2218.
[4] M. Nekoto et al., “Participatory research for low-resourced machine translation: A case study in African languages,” Findings of ACL: EMNLP 2020, pp. 2144–2160.
[5] K. Heffernan, A. Salesky, and A. Post, “Bitext mining using distant supervision for low-resource languages,” in Proc. NAACL-HLT, 2021, pp. 3617–3629.
[6] J. Schneider et al., “Common Voice: A massively-multilingual speech corpus,” in Proc. LREC, 2020, pp. 4218–4226.
editorinchief.ijama@gmail.com
Working days : Mon- Saturday
Working Hours :9 am -5:30 Pm