Automated Macrolinguistic Discourse Analysis for Transdiagnostic Detection of Language Impairments

medRxiv (Preprint) 21 May 2026 2 min read

AI Insight

Researchers developed an automated pipeline for macrolinguistic discourse analysis that combines automatic speech recognition, utterance segmentation, sentence-level embeddings, and rule-based coherence classification to assess speech in individuals with aphasia and dementia. Applied to story retellings from 309 participants, the system identified main concepts with 83% accuracy and extracted features including semantic distance to a narrative centroid, main concept coverage, and coherence error rates. Logistic regression classifiers trained on these features distinguished aphasia from controls with high accuracy (AUC approximately 0.94), though separation between dementia and other groups was considerably weaker (AUC approximately 0.58 to 0.66).

Why it matters

Manual discourse analysis is time-consuming and difficult to scale in clinical settings, so an automated pipeline could support faster, more accessible screening and longitudinal monitoring of communication disorders in neurological populations. However, the limited diagnostic separation between dementia and other groups suggests the tool requires further refinement before it can reliably serve transdiagnostic clinical purposes.

Confidence

6/10Preprint — not yet peer-reviewedMedicine

Understand the Science

Dementia 27 articles Explore Concept → Aphasia Concept coming soon Speech recognition Concept coming soon

⚠️ Preprint – Noch nicht peer-reviewed

Dieser Artikel wurde noch nicht von unabhängigen Experten begutachtet. Die Ergebnisse sind vorläufig und sollten mit Vorsicht interpretiert werden.

Macrolinguistic discourse analysis offers valuable insight into how patients with neurogenic communication disorders organize and produce informative speech, yet it remains a largely manual and labor-intensive process. We report an automated pipeline for macrolinguistic discourse analysis for individuals with aphasia and dementia that integrates automatic speech recognition (ASR), utterance segmentation, sentence-level embeddings, centroid-based main-concept matching, and rule-based coherence error classification. These algorithms were applied to Cinderella story retellings from 309 participants (113 controls, 102 post-stroke aphasia (PWA), and 94 dementia). The algorithm reliably identified main concepts (83% accuracy against human labels) and derived interpretable features such as semantic distance to a main concept centroid, main concept coverage, and coherence error rates. Crucially, diagnostic classification results showed that logistic-regression classifiers trained on 10 macrolinguistic features distinguished aphasia from controls with high accuracy (AUC {approx} 0.94) but showed weaker separation for dementia (controls vs dementia AUC {approx} 0.66; aphasia vs dementia AUC {approx} 0.58). Semantic distance to the centroid emerged as a robust, informative predictor for diagnostic classification, demonstrating that the ability to produce narrative-aligned speech is clinically important. The automated pipeline enables scalable macrolinguistic discourse analysis that could support screening and longitudinal monitoring of discourse impairments across neurogenic populations.

Source: Automated Macrolinguistic Discourse Analysis for Transdiagnostic Detection of Language Impairments