On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation

Xiaonan Jing, Srinivas Billa, Danny Godbout·October 16, 2024

Summary

The paper explores automated faithfulness evaluation in guided natural language generation (NLG), developing a rubrics template and using large language models (LLMs) to score generation quality into quantifiable scales. It compares popular LLMs and natural language inference (NLI) models, evaluates their performance in scoring and sensitivity, and introduces methods for generating synthetic unfaithful data. The study uses four travel-domain datasets, showing that GPT-4 can accurately judge and explain factual consistency. Tuning NLI models on synthetic data improves performance. The text also discusses the progression of scores based on the percentage of hallucinating content and the impact of unfaithful content on scores.

Key findings

Tables

Introduction

Background

Overview of natural language generation (NLG) and its applications

Importance of faithfulness in NLG outputs

Objective

Aim of the research: developing a rubrics template for automated faithfulness evaluation

Focus on using large language models (LLMs) for scoring generation quality

Method

Data Collection

Selection of datasets for evaluation

Characteristics of the four travel-domain datasets used

Data Preprocessing

Preparation of datasets for model training and evaluation

Handling of unfaithful content in synthetic data generation

Model Evaluation

Comparison of popular LLMs and NLI models

Assessment of performance in scoring and sensitivity

Techniques for generating synthetic unfaithful data

Analysis

Progression of scores based on the percentage of hallucinating content

Impact of unfaithful content on scores

Results

Performance of LLMs and NLI Models

Detailed comparison of model performance

Insights into strengths and weaknesses

Synthetic Data Impact

Effectiveness of synthetic data in improving NLI model performance

Faithfulness Evaluation

Accuracy of GPT-4 in judging and explaining factual consistency

Sensitivity analysis of models to unfaithful content

Discussion

Theoretical Implications

Contribution to the field of NLG and faithfulness evaluation

Potential for future research

Practical Applications

Real-world implications of automated faithfulness evaluation

Integration of findings into NLG systems

Conclusion

Summary of Findings

Key outcomes and their significance

Future Directions

Areas for further exploration

Recommendations for practitioners and researchers

Basic info

papers

computation and language

artificial intelligence

Advanced features

Insights

What methods does the study introduce for generating synthetic unfaithful data, and how are these utilized in the evaluation process?

What are the key findings regarding the progression of scores based on the percentage of hallucinating content and the impact of unfaithful content on scores, particularly in relation to GPT-4's performance in factual consistency?

How does the paper compare popular large language models (LLMs) and natural language inference (NLI) models in terms of scoring generation quality?