AI Insight
Researchers developed machine learning and deep learning models to classify antibodies into five antigen-specific categories (anti-dengue virus, anti-influenza virus, anti-tetanus bacillus, anti-SARS-CoV-2, and anti-Mycobacterium tuberculosis) using features extracted from heavy chain sequences, including physicochemical properties, structural composition, sequence order, and evolutionary information. Among the models tested, a stacking ensemble approach achieved the highest accuracy of 0.7803, outperforming individual tree-based classifiers and a Feature-Based Transformer architecture. SHAP analysis revealed that sequence order and evolutionary features were the most informative predictors, with cysteine identified as the single most influential amino acid in determining antibody specificity.
Why it matters
These computational tools offer a scalable method for characterizing large antibody datasets generated by high-throughput sequencing, which could accelerate the identification and development of therapeutic antibodies against infectious diseases. Sequence-based classification approaches may reduce the need for costly and time-consuming experimental screening in early-stage antibody discovery pipelines.
by Jia Lin, Jiaqi Chen, Linxuan Wan, Weinan He, Yuxin Zhu, Mu Qiao, Fancun Meng, Di Lin, Yan Che, Zicheng Cao
Background
Antibodies play a critical role in immune defense, with their antigen specificity primarily governed by the unique sequences of their heavy chains, rendering them invaluable tools in research and diagnostics. High-throughput sequencing technologies have facilitated comprehensive profiling of the immune repertoire, generating vast antibody sequence datasets that necessitate advanced analytical methods.
Methods
In this study, we utilized curated antibody sequences from NCBI databases to develop computational classification models for categorizing antibodies into predefined antigen classes. We extracted multifaceted features from the heavy chain sequences, encompassing physicochemical properties, structural composition, sequence order, and evolutionary information. These features were input into machine-learning classifiers to predict antigen specificity across five classes of antibodies: anti-dengue virus, anti-influenza virus, anti-tetanus bacillus, anti-SARS-CoV-2, and anti-Mycobacterium tuberculosis.
Results
Five tree-based machine-learning models were employed, with CatBoost achieving the highest accuracy of 0.7713. To further enhance predictive performance, we developed a stacking model leveraging multiple algorithms, resulting in an improved accuracy of 0.7803. Additionally, a Feature-Based Transformer deep-learning architecture was implemented, yielding an accuracy of 0.7399 and an F1-score of 0.6761. To elucidate the key determinants of antibody-antigen interactions, we applied the SHAP framework to assess feature importance. Among the top 30 contributing features, those representing sequence order and evolutionary information predominated, with amino acids such as cysteine (C), isoleucine (I), histidine (H), and phenylalanine (F) exhibiting notable SHAP values. Notably, cysteine (Cys) emerged as the most influential feature, underscoring its critical role in antibody structure and function. Specific antibodies contributed variably to these key features; for instance, the anti-tuberculosis antibody accounted for approximately 11% of a sequence order feature associated with alanine (A), while the anti-SARS-CoV-2 antibody contributed about 9.26% to a feature associated with isoleucine (I).
Conclusions
Our study demonstrates the efficacy of machine-learning and deep-learning approaches in classifying antibodies into specific antigen categories, providing sequence-based insights into features associated with antibody specificity. These findings have significant implications for the mechanistic understanding, isolation, and development of potential therapeutic antibodies.
Source: Computational models for the classification of antibody specificity using heavy chain features