AI Insight
This study developed a multi-modal machine learning framework to investigate associations between ten long non-coding RNAs (lncRNAs) and Type 2 Diabetes Mellitus (T2D) using RNA sequencing data from two independent population cohorts. By integrating expression, secondary-structure, and sequence features from each lncRNA, and evaluating eight classifiers with multiple validation strategies, the authors identified cohort-specific associations, notably GAS5 and XIST in one cohort and MALAT1 and KCNQ1OT1 in the other, with MEG3 consistently emerging as the most influential lncRNA across both. SHAP analysis further enabled subject-level interpretation of these associations, and results were validated against conventional statistical methods.
Why it matters
Identifying lncRNA-based molecular signatures associated with T2D at both population and individual levels could support the development of more targeted, precision medicine approaches for diagnosis and therapeutic intervention. This framework may also be adaptable to other chronic diseases where non-coding RNA regulation plays a role.
arXiv:2605.20747v1 Announce Type: new
Abstract: Long non-coding RNAs (lncRNAs) are emerging regulatory molecules implicated in chronic disease pathogenesis, including Type 2 Diabetes Mellitus (T2D). We investigated ten literature reported lncRNAs associated with T2D: MALAT1, MEG3, MIAT, ANRIL, GAS5, KCNQ1OT1, H19, BCYRN1, XIST, and HOTAIR across two independent population-based RNA-seq cohorts. Single-omics approaches provide an incomplete view of disease biology, therefore, an integrative multi-feature framework was developed, extracting expression, secondary-structure, and sequence features for each lncRNA. Eight machine learning (ML) classifiers were evaluated under stratified k-fold, leave-one-out cross-validation (LOOCV), and repeated hold-out schemes to ensure robust performance estimation. SHAP analysis was applied for subject-level association interpretation. In one cohort, GAS5 and XIST expression features, along with GAS5, MEG3, and ANRIL sequence features, were found to be associated with T2D, while MALAT1 expression and KCNQ1OT1, ANRIL, and MEG3 sequence features were found to be associated in the second cohort. MEG3 was identified by SHAP as the dominant lncRNA in both cohorts. ML results were consistent with established statistical methods while additionally providing population- and subject-level disease association profiles linked to specific molecular feature types. The proposed framework advances mechanistic understanding of T2D and supports lncRNA-based precision medicine.