From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries

Hitesh Wadhwa, Rahul Seetharaman, Somyaa Aggarwal, Reshmi Ghosh, Samyadeep Basu, Soundararajan Srinivasan, Wenlong Zhao, Shreyas Chaudhari, Ehsan Aghazadeh·June 18, 2024

Summary

This paper investigates the behavior of Retrieval Augmented Generation (RAG) in large language models (LLMs) like LLaMa and the Phi family, focusing on their reliance on external context versus parametric memory for factual queries. Causal mediation analysis and attention contribution/knockout techniques reveal that LLMs heavily rely on retrieved context, bypassing their internal memory, and that context tokens have a more significant impact on the last token prediction than the subject in the question. This suggests a "shortcut" behavior where models prioritize external information over their internal knowledge, raising concerns about the balance between LM priors and retrieved information to prevent reasoning drift. The study employs these methods on models like Phi-2 (2.7B) and LLaMa-2 (7B) with the Knowns Fact Dataset, finding a decrease in the models' dependence on parametric memory when RAG is used. The research also highlights the need for future work on instruction-tuned models, retrieval quality, and the interplay between RAG and model performance in factual recall.

Key findings

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to investigate the benefits of using RAG context as an external knowledge source to complement the parametric knowledge stored in language models for factual queries. Specifically, it explores the utility of parametric memory and the interaction between parametric and non-parametric memory in the retrieval augmented generation process . This study addresses the issue of understanding how language models utilize external knowledge over parametric information for factual queries, which is a novel problem in the field of natural language processing .

What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis that utilizing Retrieval-Augmented Generation (RAG) context as an external knowledge source complements the parametric knowledge stored in language models, reducing the reliance on parametric memory for factual recall in the process of retrieval augmented generation . The study explores the interplay between parametric and non-parametric memory, specifically observing a reduced reliance on the subject token and associated MLP activations when the context is augmented with RAG . The research delves into the mechanisms underlying language models' preference for information provided via RAG over their internal parametric knowledge contribution, aiming to understand how large language models exhibit a "shortcut mechanism" when provided with non-parametric knowledge through a RAG system .

What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper introduces several novel ideas, methods, and models related to language models and factual queries:

RAG Pipeline and Attention Contribution: The paper discusses the use of a RAG pipeline to introduce variability in retrieved documents, which can be sensitive to the retrieval model and hyperparameters . Additionally, it introduces the concept of Attention Contribution, which leverages self-attention patterns to predict constraint satisfaction and factual errors in generated text by measuring the contribution of different components to the model's predictions .
Causal Tracing and Attention Knockouts: The paper presents the method of Causal Tracing to identify hidden states that significantly influence factual predictions by running clean, corrupted, and corrupted-with-restoration runs to determine the impact of specific text spans on model predictions . Furthermore, it discusses Attention Knockouts, which study the impact of removing attention from a token position to another, highlighting critical attention edges essential for maintaining prediction quality in transformer-based models .
Parametric vs. Non-Parametric Memory: The study explores the interplay between parametric and non-parametric memory in retrieval augmented generation, observing a reduced reliance on subject tokens and MLP activations when RAG context is augmented to the prompt . It also delves into the influence of RAG on factual queries in different language models like Phi-2 and LLaMA-2, highlighting the importance of RAG for factual recall in various scenarios involving these models .
Correlation with Factual Correctness and Quality Checks: The paper analyzes the correlation between aggregated attention norms and factual correctness of model outputs, indicating that higher attention norms to constraint tokens correlate with increased factual accuracy . Additionally, it discusses quality checks on generated synthetic data, ensuring that the attribute token occurs exactly once within the context to maintain data integrity .

These proposed ideas, methods, and models contribute to a deeper understanding of how language models utilize external knowledge and parametric information for factual queries, shedding light on the mechanisms underlying LM's preference for information provided via RAG systems . The paper introduces novel characteristics and advantages compared to previous methods in the realm of language models and factual queries:

Utilization of RAG Context: The study delves into the benefits of incorporating Retrieval Augmented Generation (RAG) context as an external knowledge source to complement the parametric knowledge stored in models for factual queries. By leveraging RAG context, the models exhibit a reduced reliance on parametric memory for factual recall, showcasing a preference for utilizing external context over internal knowledge .
Mechanistic Probing Methods: The paper employs mechanistic probing methods like Attention Contribution, Attention Knockouts, and Causal Tracing to understand how language models prioritize information provided via RAG systems over their parametric knowledge contribution. These methods help in identifying the influence of RAG on factual queries and shed light on the mechanisms underlying LM's preference for external context .
Interplay Between Parametric and Non-Parametric Memory: The study explores the interplay between parametric and non-parametric memory in retrieval augmented generation, highlighting a reduced reliance on subject tokens and MLP activations when RAG context is augmented to the prompt. This analysis provides insights into how language models balance the use of external context and internal knowledge for factual reasoning .
Correlation with Factual Correctness: The paper analyzes the correlation between aggregated attention norms and factual correctness of model outputs, revealing that higher attention norms to constraint tokens correlate with increased factual accuracy. This correlation serves as a predictive measure for evaluating the reliability of the model's responses, enhancing the understanding of model performance in factual queries .
Quality Checks on Generated Data: The study includes quality checks on generated synthetic data to ensure data integrity. By verifying that the attribute token occurs exactly once within the context, the paper maintains the quality and reliability of the generated data, contributing to the robustness of the analysis .

Overall, the paper's in-depth analysis using mechanistic probing methods, correlation studies, and quality checks provides a comprehensive understanding of how language models prioritize external knowledge over parametric information for factual queries, offering valuable insights into the functioning and preferences of these models in handling factual reasoning tasks .

Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

Several related research studies exist in the field of language models utilizing external knowledge for factual queries. Noteworthy researchers in this area include Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom, Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen, Arnab Sen Sharma, David Atkinson, David Bau, Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, Jason Weston, and many others .

The key solution mentioned in the paper involves the mechanistic probing of language models to understand how they utilize external knowledge from retrieval augmented generation (RAG) context to complement their parametric knowledge for factual queries. The study highlights that language models exhibit a "shortcut mechanism" by relying more on external context information provided by RAG rather than their parametric memory when answering questions. This behavior is observed through causal mediation analysis, attention contributions, and attention knockouts, which demonstrate a reduced reliance on parametric memory when the context is augmented with RAG .

How were the experiments in the paper designed?

The experiments in the paper were designed to understand the benefits of using RAG context as an external knowledge source to complement the parametric knowledge stored in the models for factual queries. The study utilized three different mechanistic probing methods, including attention contributions, attention knockouts, and causal traces, to explore the utility of parametric memory and the interplay between parametric and non-parametric memory in the process of retrieval augmented generation . The experiments specifically observed a reduced reliance on the subject token and the MLP activations associated with it when the context was augmented with RAG . Additionally, the study analyzed the impact of long context, the position of subject and attribute tokens, and the tendency to exhibit proximity and recency bias for future work .

What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the study is the Knowns Fact Dataset, which consists of 1209 factual queries . The Knowns Fact Dataset includes records in the format of (subject, relation, object/attribute) .

Regarding the code used in the study, it is mentioned that the code for generating RAG context from the Knowns Fact Dataset using GPT4 is available in the Appendix E of the document . The study ensured that the generated context followed specific constraints using quality assurance techniques .

Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide strong support for the scientific hypotheses that needed verification. The study delves into the mechanisms underlying large language models' preference for information provided by Retrieval-Augmented Generation (RAG) contexts over their parametric knowledge contribution . Through various probing methods such as causal tracing, attention contributions, and attention knockouts, the paper systematically explores how language models utilize external knowledge to enhance factual reasoning . The analysis includes examining the correlation between attention norms and factual correctness, demonstrating that higher attention norms to constraint tokens lead to increased factual accuracy . Additionally, the study investigates the impact of RAG contexts on model performance, showing a reduced reliance on subject tokens and parametric memory when RAG context is augmented to the prompt . These findings collectively provide robust evidence supporting the hypotheses under investigation in the paper.

What are the contributions of this paper?

The contributions of the paper include:

Introducing three mechanistic probing methods to understand the benefits of using RAG context as an external knowledge source alongside parametric knowledge for factual queries .
Exploring the utility of parametric memory and the interaction between parametric and non-parametric memory in retrieval augmented generation, showing that parametric memory becomes less critical for factual recall when RAG context is added to the prompt .
Analyzing attention contributions and attention knockouts to observe the reduced reliance on the subject token and MLP activations when the context is augmented with RAG, indicating that language models primarily rely on the context rather than parametric memory to answer questions .
Correlating higher attention norms to constraint tokens with increased factual accuracy, providing a predictive measure for evaluating the reliability of the model's responses .
Studying the impact of knocking out attention from the last token to the subject token in autoregressive models, both in the RAG setting and the vanilla setting, to understand the model's reliance on parametric memory for factual queries .
Leveraging two state-of-the-art language models, Phi-2 and LLaMA-2, trained on different corpora, to comprehensively probe the influence of RAG for factual queries and measure causal mediation easily .

What work can be continued in depth?

Further research in this area can delve deeper into several aspects:

Investigating the impact of long context: Exploring how language models handle longer contexts and the influence of the subject token and attribute token positions, considering factors like proximity and recency bias .
Analyzing instruction-tuned models: Studying models fine-tuned on objectives like RLHF to understand their performance and behavior in factual queries .
Examining the quality of retrievers and rankers: Understanding how the quality of retrievers, rankers, and hyperparameters affect the noisy and sensitive nature of retrieved outputs, which is crucial for improving model reasoning and accuracy .

Introduction

Background

Overview of Retrieval Augmented Generation (RAG)

Emergence of LLMs like LLaMa and Phi family

Objective

To examine RAG's role in factual queries

Investigate reliance on external context vs. parametric memory

Address reasoning drift concerns

Methodology

Data Collection

Selection of models (Phi-2, LLaMa-2, Knowns Fact Dataset)

Dataset: Knowns Fact Dataset for factual queries

Data Analysis Techniques

Causal Mediation Analysis

Assessing the influence of retrieved context on output

Attention Contribution/Knockout

Measuring the impact of context tokens and subject in questions

Experimental Setup

Comparing RAG with non-RAG scenarios

Evaluating performance with and without RAG

Results

Model Behavior

Heaviness of LLMs on retrieved context

"Shortcut" behavior: external information preference

Dependency on parametric memory with RAG

Impact on Factual Recall

Decrease in reliance on internal memory with RAG

Performance trends with and without RAG

Discussion

Instruction-Tuned Models

Future research implications for instruction-tuning

Retrieval Quality

The role of high-quality retrieval in model performance

Preventing Reasoning Drift

Balancing LM priors and retrieved information

Conclusion

Summary of findings and implications for the future of LLMs and RAG

Recommendations for improving factual reasoning in augmented models

Basic info

papers

computation and language

artificial intelligence

Advanced features

Insights

What is the primary focus of the paper concerning Retrieval Augmented Generation (RAG) in LLMs?

What are the implications of LLMs prioritizing external information over their internal knowledge, as suggested by the research?

What method does the study use to analyze the models' reliance on retrieved context versus internal memory?

How do LLMs like LLaMa and the Phi family utilize external context versus parametric memory for factual queries, according to the study?