Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review

Josh McGiff, Nikola S. Nikolov·May 07, 2025

Summary

A review of 54 studies focuses on strategies for addressing data scarcity in generative language modeling for low-resource languages. Transformer-based models, limited language focus, and inconsistent evaluation methods are highlighted. Recommendations suggest broadening methods to support a wider range of languages, aiming for inclusive AI tools. Strategies include data augmentation, training on related languages, and mass translation. Models for languages like Khmer, Minnan, and Bengali were trained on datasets ranging from 70k to 1.2 million sentences. Evaluation metrics like BLEU, ROUGE, and Perplexity are commonly used.

Advanced features