Biology

New Protein Function Characterization for Human Paralog Discovery, Scraping the Bottom of the Genomics Barrel

New Protein Function Characterization for Human Paralog Discovery, Scraping the Bottom of the Genomics Barrel

Image generated by AI

AI Insight

Researchers developed a computational framework that combines multiple protein analysis methods (sequence-based, structure-based, and embedding-based) with machine learning models to identify previously uncharacterized protein paralogs in the human genome. Applied to well-studied protein families (proteases and kinases), the approach identified 7 new protease and 3 new kinase paralogs that had lacked proper functional annotation. The method achieved high classification accuracy with ROC-AUC of 0.99 and F1-score of 0.92 on test datasets.


This integrated approach provides a systematic way to discover remaining uncharacterized proteins even in well-studied organisms like humans, which could help complete our understanding of the human proteome. The generalizable methodology can be applied to less-studied organisms and protein families, potentially accelerating functional annotation of proteins across biology.


⚠️ Preprint – Noch nicht peer-reviewed

Dieser Artikel wurde noch nicht von unabhängigen Experten begutachtet. Die Ergebnisse sind vorläufig und sollten mit Vorsicht interpretiert werden.

IIncreasing the number of related protein paralogs is important for fully understanding protein relationships, yet it remains challenging for sequences in the twilight zone. Here, we present an integrated homolog detection framework that combines sequence-based (BLASTp, MMseqs2), structure-based (Foldseek), and embedding-distance-based (PROST) similarity metrics to identify additional paralogs. To characterize functionally related protein pairs, we develop protein-family-specific supervised logistic regression models trained on curated functional annotations from MEROPS proteases and KinHub kinases. The resulting model successfully classifies proteins, with ROC-AUC of 0.99 and F1-score of 0.92 for test datasets. Applying this model, we initially identify 686 protease and 298 kinase new candidates. Subsequent structural validation, and previous annotation comparisons yield 7 new protease and 3 new kinase paralogs in the human proteome, mostly lacking prior functional characterization. An additional outcome is structural identification of catalytically important residues for larger numbers of proteases and kinases. Despite the small number of new paralogs for well-studied proteases and kinases, our results demonstrate that integrating orthogonal homolog approaches with family-specific regression models provides a robust, scalable strategy for discovering new functionally related proteins, which is a generalizable approach for novel protein function discovery and can be applied more broadly to under-annotated proteomes.

Source: New Protein Function Characterization for Human Paralog Discovery, Scraping the Bottom of the Genomics Barrel