AI Insight
This study introduces a new mathematical framework for evaluating genome assemblies in highly repetitive regions like centromeres, where traditional sequence alignment methods fail. Instead of comparing nucleotide sequences directly, the researchers developed a distribution-based metric that compares distances between functional motifs using KL divergence, creating what they call a "centeny representation." When applied to complete human telomere-to-telomere genomes, this method successfully ranks assembly accuracy both genome-wide and for individual chromosomes.
Why it matters
This framework addresses a critical gap in genome assembly quality assessment, particularly for repetitive DNA regions that have historically been difficult to sequence and validate. The method provides a quantitative standard for comparing chromosome-level assemblies and could accelerate validation of newly sequenced genomes, improving our understanding of centromeric regions that play crucial roles in cell division and chromosome stability.
arXiv:2606.11276v1 Announce Type: new
Abstract: Accurate evaluation of genome assemblies within highly repetitive regions, such as centromeres, remains a major open challenge in genomics. Conventional benchmarking relies on sequence alignment, which becomes problematic in regions of high homogeneity and divergence. Here, we framed centromere assembly evaluation as a comparative distribution problem in a compact centeny representation by computing genomic distances between functional motifs, rather than relying on nucleotide sequence. Our distribution-based metric assesses agreement between a query and a target chromosome by comparing their centromeric inter-motif distances rendered by KL divergence. When applied genome-wide to currently available human telomere-to-telomere (T2T) genomes, this approach yields an accuracy ranking for the entire assembly and for each individual chromosome. Altogether, we present a rapid and robust scoring system based on genomes numerical rendering of inter-motif distances, that provides a quantitative standard of assembly integrity in repetitive DNA regions and establishes a bona fide framework for chromosome-level genome-to-genome comparison.
Source: A mathematical framework for centromere-aware evaluation of human genome assemblies