If you'd like to collaborate, get in touch at macwiatrak@gmail.com, or DM me directly.
π§΅ 16/n
If you'd like to collaborate, get in touch at macwiatrak@gmail.com, or DM me directly.
π§΅ 16/n
Most importantly, huge shoutout to everyone who made this possible! Thanks to them, the project was first and foremost immense fun!
π§΅ 15/n
We hope that Bacformer will accelerate microbial discovery and we are excited for the future work!
Thatβs it! Thanks for sticking with me through this thread!
π§΅ 14/n
Finally, we used Bacformer to generate sequences of protein families given a prompt. Bacformer generates sequences which span essential functions and resemble real genomes. We also used Bacformer to generate sequences for a desired traits, like oxygen requirement.
π§΅ 13/n
Accurately predicting phenotypic traits opens a possibility to discover the genes that are likely causally associated with specific traits. We used gradient-based attribution to identify the genes involved in sporulation and host adaptation.
π§΅ 12/n
Bacformer can also predict diverse phenotypic traits! We trained Bacformer to predict 139 phenotypes from the genome alone.
We then used high performing phenotypes to annotate our corpus of >1.3M genomes with 32 diverse phenotypic traits.
π§΅ 11/n
By leveraging the whole-genome context, we show how Bacformer boosts performance on gene essentiality and protein function annotation task.
To us, this make sense as gene's essentiality and function is often tied with its genomic neighborhoud in bacteria.
π§΅ 10/n
It can also be used to predict the protein-protein interactions across diverse bacteria.
To do it, we fine-tuned the model on STRING DB and used it to predict the interactome of P. aeruginosa, with the top scoring pairs showing high-confidence interfaces based on AF3.
π§΅ 9/n
It also nails operon detection! We validated our predictions with long-read RNA sequencing, showing how Bacformer can be used for operon identification even in a zero-shot setup.
π§΅ 8/n
Bacformer can be used for a range of bacterial genomics tasks zero-shot or finetuned.
We examined whether Bacformer can uncover the evolutionary relationships by examining if the genome embeddings from Bacformer can be used for clustering without any "species" token.
π§΅ 7/n
If each token is a protein, how do we do pretraining if the space of possible proteins is effectively unbound? We leverage the similarities between proteins and create a discrete vocabulary of 50,000 βprotein familyβ clusters using protein embeddings from a pLM.
π§΅ 6/n
To capture patterns across diverse bacteria, we pretrained Bacformer on a curated corpus of over 1.3M metagenomes spanning over 25,000 species across diverse environments and containing almost 3B proteins.
π§΅ 5/n
Bacformer represents each genome as a sequence of proteins, ordered by their location on the genome, providing a unified representation across bacterial species, and allowing us to learn evolutionary patterns across all bacteria, rather than a single species.
π§΅ 4/n
Why make ML models for bacteriaπ¦ ? 1οΈβ£Bacteria shape every ecosystem, and our own health. 2οΈβ£They are easier to model than mammalian cells: their genomes are small, compact and mostly codingβ; they have no real epigenome.3οΈβ£We have a lot of data to (thanks to metagenomics)!
π§΅ 3/n
Code and tutorials π»: github.com/macwiatrak/B...
Blog βοΈ: macwiatrak.github.io/posts/2025/i...
Pretrained weights π€: huggingface.co/macwiatrak
π§΅ 2/n
π₯ Excited to introduce Bacformer π¦ - the first foundation model for bacterial genomics. Bacformer represents genomes as sequences of ordered proteins, learning the βgrammarβ of how genes are arranged, interact and evolve.
Preprint π: biorxiv.org/content/10.1...
π§΅ 1/n