An Empirical Comparison of Text Summarization: A Multi-Dimensional Evaluation of Large Language Models

Anantharaman Janakiraman, Behnaz Ghoraani·April 06, 2025

Summary

A study evaluated 17 large language models for text summarization, focusing on metrics like factual consistency, semantic similarity, and human-like quality. Models varied in performance, with some excelling in specific areas. The research offers recommendations for different use cases, emphasizing the importance of balancing factors for effective summarization systems. It introduces a multidimensional evaluation framework, featuring models like deepseek-v3, claude-3-5-sonnet, gpt-3.5-turbo, and gemini-1.5-flash for various applications. The study includes dataset-specific performance visualizations, highlighting models' effectiveness across metrics.

Introduction

Background

Overview of text summarization and its importance

Brief history and evolution of large language models

Objective

To evaluate 17 large language models for text summarization based on metrics like factual consistency, semantic similarity, and human-like quality

Method

Data Collection

Description of the dataset used for evaluation

Criteria for selecting the models for evaluation

Data Preprocessing

Techniques used for preparing the data for evaluation

Handling of missing or incomplete data

Evaluation Metrics

Detailed explanation of the metrics used (factual consistency, semantic similarity, human-like quality)

Weighting and balancing of these metrics in the evaluation framework

Results

Model Performance

Overview of the models' performance across the evaluation metrics

Identification of models excelling in specific areas

Model Comparison

Comparative analysis of the models based on their performance

Discussion on the trade-offs between different models

Discussion

Recommendations

Based on the evaluation, recommendations for different use cases

Importance of balancing factors for effective summarization systems

Multidimensional Evaluation Framework

Introduction of a framework that considers multiple dimensions of model performance

Explanation of the framework's components (e.g., deepseek-v3, claude-3-5-sonnet, gpt-3.5-turbo, gemini-1.5-flash)

Dataset-Specific Performance Visualizations

Visual representation of models' effectiveness across different metrics

Insights into models' performance in specific contexts or domains

Conclusion

Summary of Findings

Recap of the key insights from the evaluation

Future Directions

Potential areas for further research

Implications for the development of summarization systems

Basic info

papers

computation and language

machine learning

artificial intelligence

Advanced features

Insights

How does the multidimensional evaluation framework assess the performance of models like deepseek-v3 and gpt-3.5-turbo?

What are the main objectives and findings of the study evaluating large language models for text summarization?

What innovative approaches does the study introduce for evaluating text summarization models?

What limitations are identified in the study regarding the evaluation of large language models for summarization?