Vocabulary Expansion of Chat Models with Unlabeled Target Language Data

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras·December 16, 2024

Summary

This paper explores using unlabeled target language data for vocabulary expansion in chat models to enhance their performance in underrepresented languages. Off-the-shelf vocabulary expansion generally performs well across target language tasks and models, but underperforms when source chat models are strong. To improve adapted models, the paper proposes post-hoc techniques that inject information from the source model without requiring further training, achieving performance improvements in 87% of cases. These techniques, including Chat Template, Copy, and Merge, are more effective in generative tasks, with notable improvements observed in models like Gemma 2. Adapted chat models outperform adapted base models in 85% of settings, suggesting chat models can be a powerful alternative.

Key findings

16
  • header

Introduction
Background
Overview of chat models and their importance in natural language processing
Challenges in adapting chat models for underrepresented languages
Objective
Aim of the research: improving chat model performance through unsupervised vocabulary expansion using unlabeled target language data
Method
Data Collection
Sources of unlabeled target language data
Techniques for collecting relevant data for vocabulary expansion
Data Preprocessing
Cleaning and formatting of collected data
Methods for preparing data for integration into chat models
Post-Hoc Techniques
Chat Template
Description of the technique
How it injects information from the source model into the target model
Copy
Explanation of the method
How it duplicates source model vocabulary in the target model
Merge
Overview of the process
How it combines source and target model vocabularies effectively
Evaluation
Metrics used to assess the performance of adapted models
Comparison of adapted models against adapted base models
Results
Performance Improvements
Quantitative analysis of performance gains across various tasks
Percentage of cases where adapted models outperform adapted base models
Task-Specific Analysis
Detailed examination of improvements in generative tasks
Case studies highlighting the effectiveness of the proposed techniques
Discussion
Generalizability
Discussion on the applicability of the techniques across different languages and tasks
Limitations
Identification of challenges and limitations in applying the techniques
Future Work
Suggestions for further research and potential improvements
Conclusion
Summary of Findings
Recap of the main contributions and results
Implications
Impact of the research on the field of natural language processing
Call to Action
Recommendations for practitioners and researchers in the field
Basic info
papers
computation and language
artificial intelligence
Advanced features
Insights
What techniques does the paper propose for improving adapted models in underrepresented languages?
What percentage of settings does the paper report that adapted chat models outperform adapted base models?
What is the main focus of the paper discussed in the text?