BZPEER

preprint2026arXiv

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on single-omic data - either proteins or nucleic acids - and have seen incredible success in downstream tasks in each domain, with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pretraining limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabeled sequence data, OmniBioTE learns joint representations mapping genes to their corresponding protein sequences. We further demonstrate that OmniBioTE achieves state-of-the-art results predicting the change in Gibbs free energy ({ΔG}) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Compared to single-omic controls trained with identical compute, OmniBioTE also demonstrates superior performance-per-FLOP across both multi-omic and single-omic benchmarks. Together, these results highlight the power of a unified modeling approach for biological sequences and establish OmniBioTE as a foundation model for multi-omic discovery.

Robert J. Steele

What is connected

Connect this record

See the researcher in context

Building this map preview

1 published item(s)

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions