SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories
Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, Yizheng Chen·April 29, 2025
Summary
SecRepoBench evaluates 19 LLMs on real-world secure code generation in C/C++ repositories, focusing on 15 CWE issues. It highlights LLMs' difficulty in generating correct, secure code, unlike previous benchmarks. This is the first to address real-world secure code generation, emphasizing the need for LLMs to understand program context. The analysis shows LLM-generated code often fails compilation due to issues like non-existent struct members, undeclared identifiers, and incorrect code completions. Common errors include missing conditional checks, flawed memory allocations, and integer overflows. Improving LLMs for compilability could enhance pass@1 and secure-pass@1 scores.
Introduction
Background
Overview of large language models (LLMs) in software development
Importance of secure code generation in C/C++ repositories
Objective
To evaluate the performance of 19 LLMs in generating correct, secure code for real-world scenarios
Highlighting the challenges faced by LLMs in understanding program context and generating compilable, secure code
Method
Data Collection
Selection of real-world C/C++ repositories with a focus on 15 common weaknesses (CWE issues)
Gathering a diverse set of code generation tasks that reflect real-world challenges
Data Preprocessing
Cleaning and standardizing the collected data to ensure consistency and quality
Creating a benchmark dataset that accurately represents the complexity and context of real-world code generation tasks
Evaluation Framework
Metrics
Compilation success rate (pass@1)
Correctness of generated code in relation to secure coding practices (secure-pass@1)
Analysis
Detailed examination of LLM-generated code for common errors and issues
Identification of specific types of errors, such as non-existent struct members, undeclared identifiers, and incorrect code completions
Error Categories
Compilation failures due to syntax errors
Logical errors in conditional checks, memory management, and integer overflows
Comparative Analysis
Comparison of LLM performance across different benchmarks and tasks
Highlighting the relative strengths and weaknesses of each model
Results
Compilation Success Rates
Overview of pass@1 scores for each LLM
Analysis of factors affecting compilation success
Secure Code Generation
Examination of secure-pass@1 scores and their implications for real-world security
Discussion on the models' ability to generate code that adheres to secure coding practices
Common Errors
Detailed analysis of the most frequent errors found in LLM-generated code
Insights into the types of errors that pose the greatest challenges for LLMs
Discussion
Challenges and Limitations
Discussion on the inherent difficulties in generating correct, secure code with LLMs
Analysis of the impact of program context on LLM performance
Future Directions
Recommendations for improving LLMs for compilability and secure code generation
Potential areas for further research to enhance LLM capabilities in real-world scenarios
Conclusion
Summary of Findings
Recap of the key insights and results from the SecRepoBench evaluation
Implications
Discussion on the broader implications for the use of LLMs in software development and security
Call to Action
Call for collaboration among researchers, developers, and industry to address the identified challenges and improve LLM performance in secure code generation
Basic info
papers
cryptography and security
artificial intelligence
Advanced features
Insights
What are the common compilation errors found in LLM-generated code according to SecRepoBench?
How does SecRepoBench differ from previous benchmarks in evaluating LLMs for secure code generation?
What improvements are suggested for LLMs to enhance their secure code generation capabilities?
What are the key challenges faced by LLMs in generating secure code as identified by SecRepoBench?