SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages

Gayane Ghazaryan, Erik Arakelyan, Pasquale Minervini, Isabelle Augenstein·June 20, 2024

Summary

The paper presents SynDARin, a method for generating and validating QA datasets in low-resource languages like Armenian. It uses parallel content mining from English and target languages to create synthetic multiple-choice questions, translating and validating them for quality. The process ensures content quality, factual accuracy, and reduces annotation costs. A 1.2K Armenian QA dataset is created, with high human evaluation scores, demonstrating the effectiveness of the translation validation. The dataset is used to benchmark LLMs, revealing their limitations in low-resource settings. The research contributes a novel framework, a high-quality Armenian QA dataset, and insights into LLM performance. Future work could expand to more languages and address translation challenges for rare languages.

Key findings

4

Introduction
Background
Overview of low-resource language challenges in QA datasets
Importance of parallel content mining
Objective
To develop SynDARin: a novel method for QA dataset creation
Aim to improve factual accuracy and reduce annotation costs
Focus on Armenian as a case study
Method
Data Collection
Parallel Content Mining
English-Armenian parallel corpora extraction
Selection of relevant content for QA generation
Question Template Extraction
English QA dataset analysis
Identifying question patterns and answer types
Data Synthesis
Synthetic multiple-choice question generation
Integration of parallel content and question templates
Data Preprocessing and Translation
Translation from English to Armenian
Machine translation techniques
Post-editing for quality control
Fact Checking and Validation
Human-in-the-loop validation process
Ensuring content quality and factual accuracy
Dataset Creation
1.2K Armenian QA dataset production
Human evaluation for dataset quality
Experiments and Evaluation
Benchmarking LLMs
Using the Armenian dataset to assess LLM performance
Identifying limitations in low-resource settings
Human Evaluation
High evaluation scores and dataset quality assessment
Contributions
Novel framework for low-resource QA dataset generation
High-quality Armenian QA dataset
Insights into LLM performance in low-resource languages
Future Work
Expansion to more languages
Addressing translation challenges for rare languages
Continuous improvement of the methodology
Conclusion
Summary of key findings and implications
Potential applications and future research directions
Basic info
papers
computation and language
machine learning
artificial intelligence
Advanced features
Insights
What insights does the research provide regarding LLMs in low-resource settings?
What is the primary focus of SynDARin?
How does SynDARin generate QA datasets for low-resource languages?
What is the significance of the 1.2K Armenian QA dataset created by SynDARin?