FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving

Xiaohan Lin, Qingxing Cao, Yinya Huang, Haiming Wang, Jianqiao Lu, Zhengying Liu, Linqi Song, Xiaodan Liang·June 20, 2024

Summary

The paper presents FVEL, an interactive formal verification environment that combines large language models (LLMs) with Isabelle for automated theorem proving. FVEL addresses the limitations of traditional verifiers by leveraging Isabelle's formal rules and LLMs' reasoning capabilities. The authors introduce FVELER, a large-scale dataset containing code, dependencies, and verification processes, which is used to fine-tune LLMs for improved performance. Benchmarks show that FVEL, using fine-tuned LLMs like Mistral-7B and Llama3-8B, significantly enhances problem-solving in SV-COMP, with higher accuracy and reduced proof errors. FVELER contributes to formal verification by demonstrating the potential of LLMs in handling formal tasks and providing a comprehensive benchmark for future research.

Key findings

Paper digest

What problem does the paper attempt to solve? Is this a new problem?

The paper aims to address the challenge of formal verification in the context of program synthesis using large language models (LLMs) by proposing an Interactive Formal Verification Environment with LLMs (FVEL) . This paper introduces a novel approach that combines the rigor of formal languages for automated theorem proving, such as Isabelle, with the capabilities of LLMs to enhance the verification process . The integration of LLMs into formal verification is a relatively new approach, leveraging the strengths of both traditional formal verification methods and cutting-edge language models to improve the verification process .

What scientific hypothesis does this paper seek to validate?

This paper aims to validate the scientific hypothesis related to the development of an interactive Formal Verification Environment with Large Language Models (LLMs) via Theorem Proving . The hypothesis revolves around leveraging the evolving large language models for formal verification processes, particularly in the context of program synthesis, to overcome limitations of current formal verification methods . The proposed approach involves transforming code into Isabelle and conducting verification using neural automated theorem proving with an LLM, aiming to combine the rigor of Isabelle's rules with the flexibility of cutting-edge LLMs . The paper seeks to validate the effectiveness of this approach by fine-tuning LLMs with a dataset called FVELER, evaluating them on Code2Inv and SV-COMP, and demonstrating improved problem-solving capabilities and reduced proof errors .

What new ideas, methods, or models does the paper propose? What are the characteristics and advantages compared to previous methods?

The paper "FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving" proposes several innovative ideas, methods, and models in the field of formal verification and theorem proving . Here are some key contributions outlined in the paper:

Loop Invariant Inference through SMT Solving Enhanced Reinforcement Learning: The paper introduces a method for loop invariant inference using Satisfiability Modulo Theories (SMT) solving enhanced by reinforcement learning .
Unified Efficient Fine-Tuning of Language Models: The paper presents "Llamafactory," a unified and efficient approach for fine-tuning over 100 language models .
Low-Rank Adaptation of Large Language Models: The paper introduces "Lora," a method for low-rank adaptation of large language models .
Mastering Uniform Synthesis of Theorem and Proof Data: The paper proposes "MUSTARD," a framework for mastering uniform synthesis of theorem and proof data .
Process-Driven Autoformalization in Lean 4: The paper discusses "Wizardcoder," a method for empowering large language models with evol-instruct .
Automated Theorem Proving with Dynamic-Tree Sampling: The paper introduces "Dt-solver," an approach for automated theorem proving guided by proof-level value function .
Theorem Proving with Retrieval-Augmented Language Models: The paper presents "Leandojo," a method for theorem proving with retrieval-augmented language models .
Neural Theorem Proving with Growing Libraries: The paper introduces "LEGO-prover," a neural theorem proving approach that incorporates growing libraries .

These innovative ideas, methods, and models contribute to advancing the field of formal verification, theorem proving, and the integration of large language models in automated program verification. The paper "FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving" introduces several characteristics and advantages compared to previous methods in the field of formal verification and theorem proving .

Characteristics:
- Integration of Language Models: The paper leverages large language models to enhance the process of formal verification, enabling the use of advanced language models for theorem proving .
- Interactive Environment: FVEL provides an interactive formal verification environment that incorporates large language models, allowing for more efficient and effective verification processes .
- Fine-Tuning and Training: The paper discusses the fine-tuning of language models like Mistral and Llama-3 on the FVELER training set, leading to improved correctness of proofs .
- Challenges Addressed: FVEL addresses challenges such as data scarcity in formal verification and the need for more extensive training data for language models to perform better in theorem proving tasks .
Advantages Compared to Previous Methods:
- Improved Correctness: The fine-tuned Mistral-7B and Llama-3-8B models in FVEL show a significant improvement in the correctness of proofs compared to previous methods, achieving a pass rate of 4.5% to 8.1% on the FVELER test split .
- Interactive Correction: FVEL interacts with tools like PISA and self-corrects based on error messages returned during the verification process, enhancing the overall verification accuracy .
- Efficiency in Verification: The integration of large language models in FVEL results in improved performance in formal verification tasks, especially in handling complex looping conditions and challenging datasets like SV-COMP-47 .
- Comparison with Symbolic Solvers: FVEL demonstrates advantages over symbolic solvers like UAUTOMIZER and ESBMC by providing witness during verification and effectively verifying loop properties and recursive functions .

In summary, the characteristics of FVEL, such as its integration of language models, interactive environment, and fine-tuning processes, offer distinct advantages over previous methods by improving correctness, efficiency, and interaction in the formal verification and theorem proving domain.

Do any related researches exist? Who are the noteworthy researchers on this topic in this field?What is the key to the solution mentioned in the paper?

In the field of interactive formal verification environment with large language models via theorem proving, several related research works have been conducted by noteworthy researchers. Some of the key researchers in this field include:

Albert Qiaochu Jiang
Wenda Li
Jesse Michael Han
Yuhuai Wu
Adharsh Kamath
Aditya Senthilnathan
Saikat Chakraborty
Pantazis Deligiannis
Shuvendu K. Lahiri
Akash Lal
Aseem Rastogi
Subhajit Roy
Rahul Sharma
Gerwin Klein
Kevin Elphinstone
Gernot Heiser
June Andronick
David A. Cock
Philip Derrin
Dhammika Elkaduwe
Kai Engelhardt
Rafal Kolanski
Michael Norrish
Thomas Sewell
Harvey Tuch
Simon Winwood

The key to the solution mentioned in the paper revolves around the development of an interactive formal verification environment that leverages large language models via theorem proving. The paper proposes FVEL, which aims to integrate language models and automated theorem provers to enhance the formal verification process. The study highlights the effectiveness of fine-tuning language models on specific datasets like SV-COMP to improve the accuracy of theorem proving and verification tasks. Additionally, the paper emphasizes the importance of generating accurate lemma specifications and addressing validation errors to ensure the correctness of the verification process .

How were the experiments in the paper designed?

The experiments in the paper were designed by conducting fine-tuning of Mistral and Llama3 models on the FVELER training set and testing them on the FVELER test set and test-hard set. The fine-tuned Mistral-7B and Llama-3-8B models significantly improved the correctness of the proofs, with Mistral-7B achieving a 4.5% improvement on the FVELER test split and Llama-3-8B achieving an 8.1% improvement. On the more complex FVELER test-hard split, Mistral-7B achieved a 5.8% improvement, and Llama-3-8B achieved a 7.5% improvement . The experiments involved using Mistral-7B-Instruct-v0.28 and LLama-3-8B-Instruct models for conducting the experiments on the FVELER test sets. The models were fine-tuned using the LLaMa-Factory framework, and the training samples were transformed into the alpaca format with specific length filtering and batch size settings .

What is the dataset used for quantitative evaluation? Is the code open source?

The dataset used for quantitative evaluation in the FVELER benchmark is the Code2Inv dataset, which contains 133 programs in C, and the SV-COMP dataset, which consists of over 23,000 C programs . The code for data extraction and experiments, along with corresponding documentation, is open source and available on GitHub at https://github.com/FVELER/FVEL .

Do the experiments and results in the paper provide good support for the scientific hypotheses that need to be verified? Please analyze.

The experiments and results presented in the paper provide substantial support for the scientific hypotheses that require verification. The paper introduces FVEL, an interactive Formal Verification Environment with Large Language Models via Theorem Proving, which aims to enhance formal verification using large language models (LLMs) and automated theorem proving . The experiments conducted in the paper involve fine-tuning LLMs with the FVELER dataset and evaluating them on Code2Inv and SV-COMP tasks . The results demonstrate significant improvements in problem-solving capabilities, with FVEL fine-tuned models such as Llama3-8B solving 17.39% more problems and Mistral-7B solving 12% more problems in SV-COMP . Additionally, the proportion of proof errors is reduced, indicating the effectiveness of the proposed approach .

Furthermore, the paper discusses the limitations of current formal verification methods, such as symbolic verifiers and hand-crafted rules, in leveraging the advanced reasoning abilities of modern LLMs . By integrating rigorous formal languages like Isabelle with LLMs, the FVEL environment bridges the gap between traditional verification methods and cutting-edge language models, offering a more comprehensive and flexible approach to verification . The use of automated theorem proving and formal languages like Isabelle provides a solid foundation for conducting rigorous verification processes .

In conclusion, the experiments and results presented in the paper offer strong support for the scientific hypotheses underlying the need for enhanced formal verification using large language models and automated theorem proving. The improvements in problem-solving capabilities and the reduction in proof errors demonstrate the effectiveness of the proposed FVEL environment in advancing the field of formal verification .

What are the contributions of this paper?

The contributions of the paper include:

Interactive Formal Verification Environment: The paper introduces an interactive formal verification environment that utilizes large language models via theorem proving .
Benchmarking FVELER: It presents FVELER, a benchmarking framework for the interactive formal verification environment with detailed construction steps, splits, statistics, and distributions .
Formal Verification Results: The study provides formal verification results achieved through the proposed methods within the interactive formal verification environment .
Ablation Study: An ablation study is conducted to analyze the effectiveness and impact of different components or methods used in the formal verification process .
Automated Theorem Proving: The paper contributes to the field of automated theorem proving by introducing dynamic-tree sampling guided by proof-level value function in the DT-Solver .
Integration of Large Language Models: It integrates large language models into automated program verification, as demonstrated in the Lemur system .
Theorem Proving with Retrieval-Augmented Language Models: The paper introduces Leandojo, a system that combines theorem proving with retrieval-augmented language models .

What work can be continued in depth?

The work that can be continued in depth includes testing the FVEL on C code verification, extending FVEL and FVELER to support more programming languages, and exploring the semantic alignment between lemma statements and program specifications as future research directions . Additionally, advancing formal verification, automated theorem proving, AI for Math, and software engineering can be further explored to enhance the capabilities of large language models in formal verification, contributing to more reliable software development .

Introduction

Background

Evolution of formal verification tools

Limitations of traditional verifiers

Objective

Introducing FVEL: LLMs + Isabelle integration

Aim to enhance verification capabilities

Method

Data Collection and Preparation

FVELER Dataset

Creation of FVELER: Code, dependencies, and verification processes

Scale and diversity of the dataset

Fine-Tuning Large Language Models

LLM Selection

Mistral-7B and Llama3-8B as fine-tuned models

Model Training

Integration of Isabelle formal rules into LLM training

FVEL: LLM-Isabelle Integration

Automated Theorem Proving

LLM-assisted proof generation

Handling formal rules and reasoning

Enhanced Verification Capabilities

Improved problem-solving in SV-COMP

Benchmarks and performance metrics

Results and Evaluation

Benchmarking

SV-COMP: Accuracy and proof error reduction

Comparison with traditional verifiers

Impact on Formal Verification

Advancements in handling formal tasks

Potential for future research

Conclusion

Contribution of FVEL to formal verification landscape

Opportunities and challenges with LLM integration

Directions for future development in the field

Basic info

papers

computation and language

machine learning

artificial intelligence

Advanced features

Insights

How does FVEL integrate LLMs with Isabelle?

What is the primary focus of FVEL?

What is the purpose of the FVELER dataset in the paper?

What improvements does FVEL achieve in SV-COMP benchmarks using fine-tuned LLMs?