TY - JOUR
T1 - Model Checking Using Large Language Models—Evaluation and Future Directions
AU - Batsakis, Sotiris
AU - Tachmazidis, Ilias
AU - Mantle, Matthew
AU - Papadakis, Nikolaos
AU - Antoniou, Grigoris
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/1/20
Y1 - 2025/1/20
N2 - Large language models (LLMs) such as ChatGPT have risen in prominence recently, leading to the need to analyze their strengths and limitations for various tasks. The objective of this work was to evaluate the performance of large language models for model checking, which is used extensively in various critical tasks such as software and hardware verification. A set of problems were proposed as a benchmark in this work and three LLMs (GPT-4, Claude, and Gemini) were evaluated with respect to their ability to solve these problems. The evaluation was conducted by comparing the responses of the three LLMs with the gold standard provided by model checking tools. The results illustrate the limitations of LLMs in these tasks, identifying directions for future research. Specifically, the best overall performance (ratio of problems solved correctly) was 60%, indicating a high probability of reasoning errors by the LLMs, especially when dealing with more complex scenarios requiring many reasoning steps, and the LLMs typically performed better when generating scripts for solving the problems rather than solving them directly.
AB - Large language models (LLMs) such as ChatGPT have risen in prominence recently, leading to the need to analyze their strengths and limitations for various tasks. The objective of this work was to evaluate the performance of large language models for model checking, which is used extensively in various critical tasks such as software and hardware verification. A set of problems were proposed as a benchmark in this work and three LLMs (GPT-4, Claude, and Gemini) were evaluated with respect to their ability to solve these problems. The evaluation was conducted by comparing the responses of the three LLMs with the gold standard provided by model checking tools. The results illustrate the limitations of LLMs in these tasks, identifying directions for future research. Specifically, the best overall performance (ratio of problems solved correctly) was 60%, indicating a high probability of reasoning errors by the LLMs, especially when dealing with more complex scenarios requiring many reasoning steps, and the LLMs typically performed better when generating scripts for solving the problems rather than solving them directly.
KW - large language models
KW - model checking
KW - non-monotonic reasoning
UR - http://www.scopus.com/inward/record.url?scp=85216080701&partnerID=8YFLogxK
U2 - 10.3390/electronics14020401
DO - 10.3390/electronics14020401
M3 - Article
AN - SCOPUS:85216080701
VL - 14
JO - Electronics (Switzerland)
JF - Electronics (Switzerland)
SN - 0039-0895
IS - 2
M1 - 401
ER -