R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models

Shangqing Tu, Yuanchun Wang, Jifan Yu, Yuyang Xie, Yaran Shi, Xiaozhi Wang, Jing Zhang, Lei Hou, Juanzi Li·June 17, 2024

Summary

R-Eval is a user-friendly Python toolkit designed to evaluate retrieval-augmented large language models (RALLMs) for their domain knowledge and performance across different workflows, tasks, and domains. The paper presents a comparative analysis of 21 RALLMs, including GPT-4 and GPT-3.5, using three task levels and two domains (Wikipedia and Aminer). Key findings reveal variations in performance, with GPT-4 performing well in Knowledge Seeking and Application tasks, while PAL workflow has more tool-using errors. R-Eval highlights the importance of task and domain considerations when selecting RAG workflows and LLM combinations, and aims to facilitate research and industry by providing a platform for continuous improvement and standardized evaluation at <https://github.com/THU-KEG/R-Eval>. The study also touches on the need for fine-tuning and optimization based on specific application requirements.

Key findings

12

Tables

4

Introduction

Background

[Rise of RALLMs and their significance]

[Need for standardized evaluation tools]

Objective

To assess and compare RALLMs' performance

To identify best practices for workflow and domain selection

To promote continuous improvement and standardization

Methodology

Data Collection

Model Selection

21 RALLMs, including GPT-4 and GPT-3.5

Task and Domain Selection

Three task levels: Knowledge Seeking, Application, and Tool-Using

Two domains: Wikipedia and Aminer

Evaluation Framework

R-Eval toolkit design

Performance metrics (accuracy, efficiency, domain-specific knowledge)

Experiment Setup

Workflow analysis: PAL workflow

Error analysis: tool-using errors

Data Preprocessing

Dataset preparation for benchmarking

Standardization of input and evaluation queries

Comparative Analysis

Task Performance

Knowledge Seeking

Application tasks

Tool-Using tasks (with focus on PAL workflow)

Domain-wise Evaluation

Wikipedia domain

Aminer domain

GPT-4 vs. GPT-3.5 Comparison

Strengths and weaknesses

Findings and Insights

Variations in model performance across tasks and domains

Importance of task and domain considerations

Fine-tuning and optimization recommendations

R-Eval Platform

[Description of the GitHub repository]

Standardized evaluation process

Facilitating research and industry adoption

Conclusion

Summary of key takeaways

Future directions for RALLM research and development

Open challenges and opportunities

References

List of cited literature and resources

Basic info

papers

computation and language

artificial intelligence

Advanced features