cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

bioRxiv (Preprint) 19 May 2026 2 min read

AI Insight

cadmus is an open-source Python toolkit designed to automate the retrieval and processing of full-text biomedical literature from multiple sources including PubMed, Crossref, and Europe PMC. In a large-scale test involving over 204,000 publications on developmental disorders, the tool achieved a full-text retrieval rate of 85.2% with institutional access and 54.4% without, with retrieved documents showing high fidelity to originals as measured by a mean cosine similarity score of 0.98. Notably, rarefaction analyses indicate that full-text corpora capture approximately twice the unique biomedical concepts compared to abstract-only corpora, highlighting a significant informational advantage.

Why it matters

This tool substantially lowers the barrier to constructing large, domain-specific biomedical text corpora, which can accelerate natural language processing, literature mining, and knowledge discovery in biomedical research. Improved access to full-text content over abstracts alone could meaningfully enhance the quality of downstream analyses in fields such as genomics, clinical research, and drug discovery.

Confidence

6/10Preprint — not yet peer-reviewedBiology

⚠️ Preprint – Noch nicht peer-reviewed

Dieser Artikel wurde noch nicht von unabhängigen Experten begutachtet. Die Ergebnisse sind vorläufig und sollten mit Vorsicht interpretiert werden.

cadmus is an open-source Python toolkit for automated retrieval and processing of full-text biomedical literature. It utilises programmatic access to PubMed, Crossref, Europe PMC, PMC, and publisher APIs, allowing users to construct large, domain-speci[fi]c corpora with minimal manual intervention. cadmus parses PDF, HTML, XML, and plain text [fi]les, standardising them for downstream biomedical text mining. During the retrieval of a Developmental Disorders Corpus (204,043 publications), it achieved an 85.2% full-text retrieval rate with institutional subscriptions and 54.4% without. To test the [fi]delity of retrieved full-texts, we used ScispaCy to infer the similarity of paired documents from 44,264 open-access PubMed Central [fi]les and the [fi]les retrieved from cadmus, resulting in an average cosine similarity score of 0.98. Rarefaction analyses demonstrated that full-text corpora double the coverage of unique biomedical concepts over abstracts, resulting in better access to the depth of the biomedical information available.

Source: cadmus: a robust pipeline for scalable retrieval of full-text biomedical literature

Source
bioRxiv (Preprint)