SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories

Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, Yizheng Chen·April 29, 2025

Summary

SecRepoBench evaluates 19 LLMs on real-world secure code generation in C/C++ repositories, focusing on 15 CWE issues. It highlights LLMs' difficulty in generating correct, secure code, unlike previous benchmarks. This is the first to address real-world secure code generation, emphasizing the need for LLMs to understand program context. The analysis shows LLM-generated code often fails compilation due to issues like non-existent struct members, undeclared identifiers, and incorrect code completions. Common errors include missing conditional checks, flawed memory allocations, and integer overflows. Improving LLMs for compilability could enhance pass@1 and secure-pass@1 scores.

Introduction

Background

Overview of large language models (LLMs) in software development

Importance of secure code generation in C/C++ repositories

Objective

To evaluate the performance of 19 LLMs in generating correct, secure code for real-world scenarios

Highlighting the challenges faced by LLMs in understanding program context and generating compilable, secure code

Method

Data Collection

Selection of real-world C/C++ repositories with a focus on 15 common weaknesses (CWE issues)

Gathering a diverse set of code generation tasks that reflect real-world challenges

Data Preprocessing

Cleaning and standardizing the collected data to ensure consistency and quality

Creating a benchmark dataset that accurately represents the complexity and context of real-world code generation tasks

Evaluation Framework

Metrics

Compilation success rate (pass@1)

Correctness of generated code in relation to secure coding practices (secure-pass@1)

Analysis

Detailed examination of LLM-generated code for common errors and issues

Identification of specific types of errors, such as non-existent struct members, undeclared identifiers, and incorrect code completions

Error Categories

Compilation failures due to syntax errors

Logical errors in conditional checks, memory management, and integer overflows

Comparative Analysis

Comparison of LLM performance across different benchmarks and tasks

Highlighting the relative strengths and weaknesses of each model

Results

Compilation Success Rates

Overview of pass@1 scores for each LLM

Analysis of factors affecting compilation success

Secure Code Generation

Examination of secure-pass@1 scores and their implications for real-world security

Discussion on the models' ability to generate code that adheres to secure coding practices

Common Errors

Detailed analysis of the most frequent errors found in LLM-generated code

Insights into the types of errors that pose the greatest challenges for LLMs

Discussion

Challenges and Limitations

Discussion on the inherent difficulties in generating correct, secure code with LLMs

Analysis of the impact of program context on LLM performance

Future Directions

Recommendations for improving LLMs for compilability and secure code generation

Potential areas for further research to enhance LLM capabilities in real-world scenarios

Conclusion

Summary of Findings

Recap of the key insights and results from the SecRepoBench evaluation

Implications

Discussion on the broader implications for the use of LLMs in software development and security

Call to Action

Call for collaboration among researchers, developers, and industry to address the identified challenges and improve LLM performance in secure code generation

Basic info

papers

cryptography and security

artificial intelligence

Advanced features

Insights

What are the common compilation errors found in LLM-generated code according to SecRepoBench?

How does SecRepoBench differ from previous benchmarks in evaluating LLMs for secure code generation?

What improvements are suggested for LLMs to enhance their secure code generation capabilities?

What are the key challenges faced by LLMs in generating secure code as identified by SecRepoBench?