Medicine

Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology

AI Insight

This pilot study evaluated the performance of MedGemma 4B-IT, a locally deployed open-weight large language model (LLM), against specialist neurologist responses across 50 real-world NHS Advice & Guidance cases in adult neurology. When responses were assessed blindly, no statistically significant differences were found between the LLM and human specialists across outcome accuracy (82% vs 84%), safety, efficacy, or feasibility scores. However, 10% of LLM responses received concerning low scores compared to 0% of human responses, revealing a clinically meaningful tail risk that aggregate statistics obscured, and human responses were consistently preferred when raters were unblinded.


While LLMs show surface-level parity with specialists in routine clinical queries, the presence of unpredictable low-quality responses poses a patient safety concern that would need to be resolved before any deployment in real clinical workflows such as the NHS Advice & Guidance service.


⚠️ Preprint – Noch nicht peer-reviewed

Dieser Artikel wurde noch nicht von unabhängigen Experten begutachtet. Die Ergebnisse sind vorläufig und sollten mit Vorsicht interpretiert werden.

Background: Large language models (LLMs) demonstrate strong performance in controlled medical environments such as multiple choice exams, but their utility in real-world clinical workflows remains unproven. The NHS Advice & Guidance (A&G) service, where Primary Care clinicians can submit text-based queries to specialists, provides an environment for evaluating the clinical performance of LLMs as a specialist. Methods: We compared responses from MedGemma 4B-IT, an open-weight model deployed locally on hospital infrastructure, against specialist neurologist responses across 50 adult neurology A&G cases from University College London Hospital. Two neurologists and two GPs rated 80 blinded and 20 unblinded responses for outcome, safety, efficacy, and feasibility using standardised criteria; outcome was a binary correct/incorrect, while other domains were scored 1-5. Inter-rater reliability was assessed using intraclass correlation coefficients. Results: Although there were no statistically significant differences between blinded specialist neurologists and LLM responses across any domain (outcome: 84% vs 82%, p=0.67; safety: 3.98 vs 4.02, p=0.85; efficacy: 4.06 vs 3.98, p=0.61; feasibility: 4.39 vs 4.20, p=0.45), 10% of LLM responses received concerning scores ([≤]2 average score) compared to 0% of human responses, indicating potentially clinically important tail risk. Furthermore, unblinded results showed a preference for human responses, with human ratings being preferred across all domains. Only 51% of binary outcomes had unanimous agreement and inter-rater agreement was moderate across other domains (ICC 0.50-0.52). Conclusions: In this pilot study, aggregate scores between blinded human and LLM responses were similar, and no statistically significant differences were detected in this exploratory sample. However, aggregate metrics masked clinically important edge-case failures in LLM responses. Pronounced inter-rater variability and the potential impact of LLM/human syntax on blinded rater judgements highlight the challenges in establishing robust evaluation frameworks for clinical LLM deployment

Source: Large Language Model Performance in UK Advice & Guidance: A Pilot Study in Neurology