On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation

Xiaonan Jing, Srinivas Billa, Danny Godbout·October 16, 2024

Summary

The paper explores automated faithfulness evaluation in guided natural language generation (NLG), developing a rubrics template and using large language models (LLMs) to score generation quality into quantifiable scales. It compares popular LLMs and natural language inference (NLI) models, evaluates their performance in scoring and sensitivity, and introduces methods for generating synthetic unfaithful data. The study uses four travel-domain datasets, showing that GPT-4 can accurately judge and explain factual consistency. Tuning NLI models on synthetic data improves performance. The text also discusses the progression of scores based on the percentage of hallucinating content and the impact of unfaithful content on scores.

Key findings

3

Tables

1

Introduction
Background
Overview of natural language generation (NLG) and its applications
Importance of faithfulness in NLG outputs
Objective
Aim of the research: developing a rubrics template for automated faithfulness evaluation
Focus on using large language models (LLMs) for scoring generation quality
Method
Data Collection
Selection of datasets for evaluation
Characteristics of the four travel-domain datasets used
Data Preprocessing
Preparation of datasets for model training and evaluation
Handling of unfaithful content in synthetic data generation
Model Evaluation
Comparison of popular LLMs and NLI models
Assessment of performance in scoring and sensitivity
Techniques for generating synthetic unfaithful data
Analysis
Progression of scores based on the percentage of hallucinating content
Impact of unfaithful content on scores
Results
Performance of LLMs and NLI Models
Detailed comparison of model performance
Insights into strengths and weaknesses
Synthetic Data Impact
Effectiveness of synthetic data in improving NLI model performance
Faithfulness Evaluation
Accuracy of GPT-4 in judging and explaining factual consistency
Sensitivity analysis of models to unfaithful content
Discussion
Theoretical Implications
Contribution to the field of NLG and faithfulness evaluation
Potential for future research
Practical Applications
Real-world implications of automated faithfulness evaluation
Integration of findings into NLG systems
Conclusion
Summary of Findings
Key outcomes and their significance
Future Directions
Areas for further exploration
Recommendations for practitioners and researchers
Basic info
papers
computation and language
artificial intelligence
Advanced features
Insights
What methods does the study introduce for generating synthetic unfaithful data, and how are these utilized in the evaluation process?
What are the key findings regarding the progression of scores based on the percentage of hallucinating content and the impact of unfaithful content on scores, particularly in relation to GPT-4's performance in factual consistency?
How does the paper compare popular large language models (LLMs) and natural language inference (NLI) models in terms of scoring generation quality?