Maciek Wiatrak's Avatar

Maciek Wiatrak

@macwiatrak

PhD student @ Cambridge Centre for AI in Medicine (CCAIM). I do πŸ’» 🧬 πŸ’Š and love ⛰️.

44
Followers
1
Following
16
Posts
17.07.2025
Joined
Posts Following

Latest posts by Maciek Wiatrak @macwiatrak

If you'd like to collaborate, get in touch at macwiatrak@gmail.com, or DM me directly.

🧡 16/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Most importantly, huge shoutout to everyone who made this possible! Thanks to them, the project was first and foremost immense fun!

🧡 15/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

We hope that Bacformer will accelerate microbial discovery and we are excited for the future work!

That’s it! Thanks for sticking with me through this thread!

🧡 14/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Finally, we used Bacformer to generate sequences of protein families given a prompt. Bacformer generates sequences which span essential functions and resemble real genomes. We also used Bacformer to generate sequences for a desired traits, like oxygen requirement.

🧡 13/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Accurately predicting phenotypic traits opens a possibility to discover the genes that are likely causally associated with specific traits. We used gradient-based attribution to identify the genes involved in sporulation and host adaptation.

🧡 12/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Bacformer can also predict diverse phenotypic traits! We trained Bacformer to predict 139 phenotypes from the genome alone.

We then used high performing phenotypes to annotate our corpus of >1.3M genomes with 32 diverse phenotypic traits.

🧡 11/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

By leveraging the whole-genome context, we show how Bacformer boosts performance on gene essentiality and protein function annotation task.

To us, this make sense as gene's essentiality and function is often tied with its genomic neighborhoud in bacteria.

🧡 10/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

It can also be used to predict the protein-protein interactions across diverse bacteria.

To do it, we fine-tuned the model on STRING DB and used it to predict the interactome of P. aeruginosa, with the top scoring pairs showing high-confidence interfaces based on AF3.

🧡 9/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

It also nails operon detection! We validated our predictions with long-read RNA sequencing, showing how Bacformer can be used for operon identification even in a zero-shot setup.

🧡 8/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Bacformer can be used for a range of bacterial genomics tasks zero-shot or finetuned.

We examined whether Bacformer can uncover the evolutionary relationships by examining if the genome embeddings from Bacformer can be used for clustering without any "species" token.

🧡 7/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

If each token is a protein, how do we do pretraining if the space of possible proteins is effectively unbound? We leverage the similarities between proteins and create a discrete vocabulary of 50,000 β€œprotein family” clusters using protein embeddings from a pLM.

🧡 6/n

21.07.2025 09:55 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

To capture patterns across diverse bacteria, we pretrained Bacformer on a curated corpus of over 1.3M metagenomes spanning over 25,000 species across diverse environments and containing almost 3B proteins.

🧡 5/n

21.07.2025 09:55 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Bacformer represents each genome as a sequence of proteins, ordered by their location on the genome, providing a unified representation across bacterial species, and allowing us to learn evolutionary patterns across all bacteria, rather than a single species.

🧡 4/n

21.07.2025 09:55 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Why make ML models for bacteria🦠? 1️⃣Bacteria shape every ecosystem, and our own health. 2️⃣They are easier to model than mammalian cells: their genomes are small, compact and mostly coding’; they have no real epigenome.3️⃣We have a lot of data to (thanks to metagenomics)!

🧡 3/n

21.07.2025 09:55 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
GitHub - macwiatrak/Bacformer: Modeling whole bacterial genome as a sequence of proteins. Modeling whole bacterial genome as a sequence of proteins. - macwiatrak/Bacformer

Code and tutorials πŸ’»: github.com/macwiatrak/B...
Blog ✍️: macwiatrak.github.io/posts/2025/i...
Pretrained weights πŸ€–: huggingface.co/macwiatrak

🧡 2/n

21.07.2025 09:55 πŸ‘ 1 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸ’₯ Excited to introduce Bacformer 🦠 - the first foundation model for bacterial genomics. Bacformer represents genomes as sequences of ordered proteins, learning the β€œgrammar” of how genes are arranged, interact and evolve.

Preprint πŸ“: biorxiv.org/content/10.1...

🧡 1/n

21.07.2025 09:55 πŸ‘ 91 πŸ” 59 πŸ’¬ 3 πŸ“Œ 2