TY - JOUR
T1 - Document Retrieval Augmented Fine-Tuning (DRAFT) for safety-critical software assessment
AU - Bolton, Regan
AU - Sheikhfathollahi, Mohammadreza
AU - Parkinson, Simon
AU - Vulovic, Vanessa
AU - Bamford, Gary
AU - Basher, Dan
AU - Parkinson, Howard
PY - 2026/1/15
Y1 - 2026/1/15
N2 - The evaluation of safety critical software requires a robust evaluation against complex regulatory frameworks, a process traditionally limited by manual evaluation. This paper presents Document Retrieval-Augmented Fine-Tuning (DRAFT), a novel approach that enhances the capabilities of a large language model (LLM) for safety-critical compliance assessment. DRAFT builds upon existing Retrieval-Augmented Generation (RAG) techniques by introducing a novel fine-tuning framework that accommodates our dual-retrieval architecture, which simultaneously accesses both software documentation and applicable reference standards. To fine-tune DRAFT, we develop a semi-automated dataset generation methodology that incorporates variable numbers of relevant documents with meaningful distractors, closely mirroring real-world assessment scenarios. Experiments with GPT-4o-mini demonstrate an improvement in correctness over the baseline model, with qualitative improvements in evidence handling, response structure, and domain-specific reasoning. DRAFT represents a practical approach to improving compliance assessment systems while maintaining the transparency and evidence-based reasoning essential in regulatory domains.
AB - The evaluation of safety critical software requires a robust evaluation against complex regulatory frameworks, a process traditionally limited by manual evaluation. This paper presents Document Retrieval-Augmented Fine-Tuning (DRAFT), a novel approach that enhances the capabilities of a large language model (LLM) for safety-critical compliance assessment. DRAFT builds upon existing Retrieval-Augmented Generation (RAG) techniques by introducing a novel fine-tuning framework that accommodates our dual-retrieval architecture, which simultaneously accesses both software documentation and applicable reference standards. To fine-tune DRAFT, we develop a semi-automated dataset generation methodology that incorporates variable numbers of relevant documents with meaningful distractors, closely mirroring real-world assessment scenarios. Experiments with GPT-4o-mini demonstrate an improvement in correctness over the baseline model, with qualitative improvements in evidence handling, response structure, and domain-specific reasoning. DRAFT represents a practical approach to improving compliance assessment systems while maintaining the transparency and evidence-based reasoning essential in regulatory domains.
KW - Operational Technology
KW - Software Assessments
KW - Large Language Models
KW - Retrieval-Augmented Generation
UR - https://www.scopus.com/pages/publications/105027670660
U2 - 10.1109/ACCESS.2026.3651717
DO - 10.1109/ACCESS.2026.3651717
M3 - Article
SN - 2169-3536
VL - 14
SP - 7152
EP - 7163
JO - IEEE Access
JF - IEEE Access
M1 - 11338754
ER -