What is LucaVirus?
LucaVirus is a unified nucleic acid–protein language model developed by Alibaba Cloud’s LucaGroup specifically for viruses. It is trained on 25.4 billion nucleotide and amino acid tokens, covering nearly all known viruses. The model learns biologically meaningful representations of the relationships between nucleotide and amino acid sequences. Based on these representations, downstream models can address key challenges in virology, such as identifying viruses hidden in genomic “dark matter,” characterizing unknown protein enzymatic activities, predicting viral evolutionary capacity, and discovering antibody drugs against emerging viruses.
Its protein embeddings can distinguish protein families with high resolution, show strong correlation between embedding distance and genetic distance, and capture rich evolutionary information. The model demonstrates excellent performance in antibody–antigen binding prediction, with accuracy and related metrics surpassing existing models and even structure-based prediction methods.

Key Features of LucaVirus
-
Virus Discovery: Identifies viruses hidden in genomic “dark matter,” enabling scientists to uncover new viral sequences from complex genomic data and expand knowledge of viral diversity.
-
Function Prediction: Characterizes enzymatic activity of unknown proteins by analyzing protein sequences to predict potential biochemical functions, providing clues for understanding viral pathogenic mechanisms and developing antiviral drugs.
-
Evolutionary Analysis: Predicts viral evolutionary potential by modeling evolutionary information from viral sequences, helping researchers understand mutation trends and evolutionary pathways—crucial for public health monitoring and control.
-
Drug Discovery: Identifies potential antibody drugs against emerging viruses by predicting antigen–antibody binding, accelerating antibody drug development and improving preparedness for new infectious diseases.
Technical Principles of LucaVirus
-
Multimodal Data Integration: Combines nucleotide and amino acid sequence data to build a unified nucleic acid–protein language model, learning the complex relationships between them.
-
Large-Scale Data Training: Trained on 25.4 billion nucleotide and amino acid tokens, covering nearly all known viruses, ensuring broad generalization and strong understanding of viral diversity.
-
Evolutionary Information Modeling: Embeddings incorporate viral evolutionary information, enabling the model to capture divergence and homology across viral sequences, supporting evolutionary analyses.
-
Interpretable Embeddings: Produces embeddings that distinguish protein families with high resolution and correlate with genetic distance, offering biologically interpretable representations for virology research.
-
Downstream Task Adaptation: Includes specialized downstream models optimized for virus discovery, function prediction, evolutionary analysis, and drug discovery, enhancing performance in real-world applications.
Project Resources
-
GitHub Repository: https://github.com/LucaOne/LucaVirus
-
HuggingFace Model Hub: https://huggingface.co/collections/LucaGroup/lucavirus-689d9382d0cc09780f380958
Application Scenarios of LucaVirus
-
Public Health Surveillance: Rapidly identifies emerging viruses and monitors evolutionary trends, providing early warnings to public health authorities to guide effective prevention strategies and reduce risks of outbreaks.
-
Disease Diagnosis: Assists medical professionals in more accurately diagnosing viral infections, especially for diseases with similar symptoms caused by different viruses, improving diagnostic accuracy and efficiency.
-
Vaccine Development: Provides critical insights for vaccine design, such as predicting viral antigenic changes, helping to develop more effective vaccines adaptable to viral mutations and enhancing protective efficacy.
-
Drug Development: Accelerates antiviral drug discovery by predicting viral protein functions and drug targets, offering theoretical support for new drug design while reducing R&D costs and time.
-
Biosafety Defense: Detects and identifies potential biological threats such as novel viruses, providing technical support for national and regional biosafety, safeguarding public health and social stability.