Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)

Tim Hartill·November 25, 2024

Summary

Summary: Timothy John Hartill's 2024 PhD thesis from The University of Auckland focuses on teaching smaller language models to generalize for unseen compositional questions. The work, submitted under Computer Science, explores advancements in AI for handling novel linguistic queries. The thesis evaluates models with local compute capacity and no internet connectivity, training them to answer diverse questions by reasoning over provided context from a local Wikipedia corpus and rationales from a larger, resource-efficient Language Model. Key contributions include novel evaluation methods to assess question-answering without memorization, establishing baseline results on unseen datasets, and demonstrating significant performance improvement with retrieval-augmented training. The text outlines a system with components like the Iterator and Reasoning Model, iteratively retrieving, ranking, and scoring sentences to generate answers. The study adapts a method to evaluate a question-answering model's improvement, attributing it to memorization or the TDND dataset. It uses semantic similarity scores to identify unmemorizable evaluation samples, focusing on reading comprehension and multi-choice options. The text discusses various datasets used in AI research, including AdversarialQA, ARC, BoolQ, MCTest, NarrativeQA, NewsQA, OpenbookQA, PhysicalIQA, PubmedQA, QAConv, QASC, Quail, Quoref, RACE, Reclor, Record, ROPES, SocialIQA, SQuAD 1.1, SQuAD 2, TweetQA, and Winogrande. The text also discusses various studies on natural language processing and computational linguistics, focusing on topics like generative models, statistical analysis, word representation, and factual precision in text generation.

Key findings

6

Tables

5

Introduction
Background
Overview of the current state of AI in handling novel linguistic queries
Importance of teaching smaller language models to generalize for unseen questions
Objective
Aim of the research: to develop methods for training smaller language models to answer diverse questions by reasoning over local context and rationales
Focus on models with limited compute capacity and no internet connectivity
Method
Data Collection
Use of a local Wikipedia corpus for context
Incorporation of rationales from a larger, resource-efficient Language Model
Data Preprocessing
Preparation of the local Wikipedia corpus for model training
Selection and processing of rationales for effective training
Model Training
Training process for the Iterator and Reasoning Model
Iterative retrieval, ranking, and scoring of sentences to generate answers
Evaluation
Novel methods for assessing question-answering without memorization
Establishment of baseline results on unseen datasets
Demonstration of significant performance improvement with retrieval-augmented training
Contributions
System Components
Description of the Iterator and Reasoning Model
Explanation of the iterative process for answer generation
Evaluation Method Adaptation
Method to evaluate question-answering model improvement
Attribution of improvement to memorization or the TDND dataset
Use of semantic similarity scores for identifying unmemorizable evaluation samples
Datasets and Studies
Overview of various datasets used in AI research
Discussion of studies on natural language processing and computational linguistics
Focus on topics like generative models, statistical analysis, word representation, and factual precision in text generation
Conclusion
Summary of key findings and contributions
Implications for the field of AI and language models
Future directions for research
Advanced features