Biology

Aligning protein sequences becomes provably harder when incorporating structural information

AI Insight

This computational study introduces a formal mathematical model for multiple sequence alignment that incorporates structural information, specifically contact-map data from protein structures. The researchers prove that this structure-informed alignment problem is NP-complete, meaning it cannot be solved efficiently for arbitrary inputs, and demonstrate that even approximate solutions cannot be computed in polynomial time unless P equals NP. These hardness results hold across broad classes of scoring schemes and even for simplified cases involving just two sequences.


The findings establish theoretical limits on what can be efficiently computed when aligning biological sequences with structural constraints, which is relevant for protein structure prediction and comparative genomics. Understanding these computational boundaries helps researchers develop more realistic algorithms and set appropriate expectations for structure-informed alignment tools used in drug design and evolutionary studies.


arXiv:2606.02408v1 Announce Type: cross
Abstract: We formulate a structure-informed multiple sequence alignment problem, denoted MSA-S. The model abstracts biological sequences as strings and structural information as designated position-pairs. It augments a fixed pairwise string score, defined by a fixed non-gap symbol-pair scoring rule and fixed affine gap penalties, with a binary overlap score on designated position-pairs, which can be interpreted as a contact-map overlap score in structural applications. This yields a fixed-score, integer-valued optimization model suitable for complexity-theoretic analysis.
Under this formulation, we show that the decision problem MSA-S-DEC is NP-complete for a broad class of fixed pairwise string scoring schemes. We also show that NP-hardness persists even under the restriction that every designated position-pair set is nonempty and the pair-overlap threshold is strictly positive. For the associated scalarized optimization problem MSA-S-OPT(lambda) with any fixed rational constant lambda >= 1, we further show that, under the canonical unit scheme for the non-gap symbol-pair scoring rule, MSA-S-OPT(lambda) admits no polynomial-time approximation scheme (PTAS) even for two input strings (k = 2), unless P = NP.
These results establish a formal complexity-theoretic baseline for structure-informed multiple sequence alignment.

Source: Structure-Informed Multiple Sequence Alignment: A Formal Model and Hardness Results