R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models

Shangqing Tu, Yuanchun Wang, Jifan Yu, Yuyang Xie, Yaran Shi, Xiaozhi Wang, Jing Zhang, Lei Hou, Juanzi Li·June 17, 2024

Summary

R-Eval is a user-friendly Python toolkit designed to evaluate retrieval-augmented large language models (RALLMs) for their domain knowledge and performance across different workflows, tasks, and domains. The paper presents a comparative analysis of 21 RALLMs, including GPT-4 and GPT-3.5, using three task levels and two domains (Wikipedia and Aminer). Key findings reveal variations in performance, with GPT-4 performing well in Knowledge Seeking and Application tasks, while PAL workflow has more tool-using errors. R-Eval highlights the importance of task and domain considerations when selecting RAG workflows and LLM combinations, and aims to facilitate research and industry by providing a platform for continuous improvement and standardized evaluation at <https://github.com/THU-KEG/R-Eval>. The study also touches on the need for fine-tuning and optimization based on specific application requirements.

Key findings

12

Tables

4

Introduction
Background
[Rise of RALLMs and their significance]
[Need for standardized evaluation tools]
Objective
To assess and compare RALLMs' performance
To identify best practices for workflow and domain selection
To promote continuous improvement and standardization
Methodology
Data Collection
Model Selection
21 RALLMs, including GPT-4 and GPT-3.5
Task and Domain Selection
Three task levels: Knowledge Seeking, Application, and Tool-Using
Two domains: Wikipedia and Aminer
Evaluation Framework
R-Eval toolkit design
Performance metrics (accuracy, efficiency, domain-specific knowledge)
Experiment Setup
Workflow analysis: PAL workflow
Error analysis: tool-using errors
Data Preprocessing
Dataset preparation for benchmarking
Standardization of input and evaluation queries
Comparative Analysis
Task Performance
Knowledge Seeking
Application tasks
Tool-Using tasks (with focus on PAL workflow)
Domain-wise Evaluation
Wikipedia domain
Aminer domain
GPT-4 vs. GPT-3.5 Comparison
Strengths and weaknesses
Findings and Insights
Variations in model performance across tasks and domains
Importance of task and domain considerations
Fine-tuning and optimization recommendations
R-Eval Platform
[Description of the GitHub repository]
Standardized evaluation process
Facilitating research and industry adoption
Conclusion
Summary of key takeaways
Future directions for RALLM research and development
Open challenges and opportunities
References
List of cited literature and resources
Basic info
papers
computation and language
artificial intelligence
Advanced features