Model Checking Using Large Language Models—Evaluation and Future Directions

Sotiris Batsakis, Ilias Tachmazidis, Matthew Mantle, Nikolaos Papadakis, Grigoris Antoniou

Research output: Contribution to journalArticlepeer-review

Abstract

Large language models (LLMs) such as ChatGPT have risen in prominence recently, leading to the need to analyze their strengths and limitations for various tasks. The objective of this work was to evaluate the performance of large language models for model checking, which is used extensively in various critical tasks such as software and hardware verification. A set of problems were proposed as a benchmark in this work and three LLMs (GPT-4, Claude, and Gemini) were evaluated with respect to their ability to solve these problems. The evaluation was conducted by comparing the responses of the three LLMs with the gold standard provided by model checking tools. The results illustrate the limitations of LLMs in these tasks, identifying directions for future research. Specifically, the best overall performance (ratio of problems solved correctly) was 60%, indicating a high probability of reasoning errors by the LLMs, especially when dealing with more complex scenarios requiring many reasoning steps, and the LLMs typically performed better when generating scripts for solving the problems rather than solving them directly.

Original languageEnglish
Article number401
Number of pages34
JournalElectronics (Switzerland)
Volume14
Issue number2
DOIs
Publication statusPublished - 20 Jan 2025

Cite this