AI Insight
This study evaluated eight embedding models on two clinical NLP tasks: detecting semantic differences in synthetic discharge summaries (English, Finnish, Swedish) and retrieving relevant data from real Finnish vascular surgery records. Results showed that content deletion and modification produced greater vector distance changes than paraphrasing, and that model choice significantly affected performance — Qwen3-Embedding-8B achieved zero directional errors in semantic change detection, while multilingual-E5-large erred in 13.8% of cases. A notable finding was that for some models, as few as 0.6–1.2% of vector dimensions were sufficient to replicate full-vector accuracy, a result unexplained by standard dimensionality reduction methods.
Why it matters
Embedding models are foundational components in clinical AI systems, including retrieval-augmented generation and automated record analysis; this work demonstrates that model selection can directly determine whether clinically relevant information is surfaced to healthcare providers, with meaningful differences observed across languages and task types.
⚠️ Preprint – Noch nicht peer-reviewed
Dieser Artikel wurde noch nicht von unabhängigen Experten begutachtet. Die Ergebnisse sind vorläufig und sollten mit Vorsicht interpretiert werden.
Background: Embedding models are an integral part of generative AI architectures, transforming text into embedding vectors that represent semantic content in numerical form. Despite their central role, their performance in clinical settings remains underexplored. We evaluate embedding models across two tasks: semantic difference detection in clinical texts, and data retrieval from patient records. Methods: Eight models were applied to synthetic discharge summaries in English, Finnish, and Swedish. Semantic sensitivity was assessed by introducing controlled perturbations (deletion, modification, and paraphrasing) at three levels of severity; cosine similarity, and L1 and Euclidean distances were computed between the vectors of the original and perturbed texts. Partial vectors were compared to explore dimensionality reduction. Two models with the biggest contrast in semantic difference detection were evaluated on retrieval of relevant information from real Finnish vascular surgery records. Results: Embedding vectors captured semantic differences in clinical text: content deletion and modification produced larger increases in vector distance than paraphrasing. On average, models detected the direction of semantic change correctly, but case-level performance varied considerably. Qwen3-Embedding-8B was the only model with zero directional errors, while multilingual-E5-large erred in 13.8% of cases. In data retrieval, Qwen3-Embedding-8B again outperformed multilingual-E5-large, though the margin was narrower: sufficiency scores were 3.25 vs. 3.17 out of 5 for the first query and 2.25 vs. 1.15 out of 5 for the second query. For some models, as few as 0.6-1.2% of dimensions sufficed to replicate full-vector accuracy; principal component analysis and coordinate-level analysis did not account for this finding. Conclusions: Our results show that the choice of embedding model is important: performance differences between models can be large enough to determine whether clinically relevant information reaches the end user, and model weaknesses can be both task-specific and context-dependent.