Damiano Sgarbossa (@damianosg)

With this, the last bit of my PhD at @embl.org is finally out!
We developed salad (sparse all-atom denoising), a family of blazing fast protein structure diffusion models.
Paper: nature.com/articles/s42256-…
Code: github.com/mjendrusch/salad
Data: zenodo.org/records/14711580
1/🧵

24.09.2025 11:58 👍 24 🔁 9 💬 1 📌 0

Two exciting openings with us! 🤖🧬🆎🧫💉
- AI Scientist 👉 lnkd.in/eDXHH4E8
- AI Scientist, Drug Creation 👉 lnkd.in/eEvGyaTR

You'll work on antibody sequence/structure design, antibody-antigen co-folding, antibody-antigen binding prediction, physics-based methodologies, and more!

DMs welcome!

29.08.2025 08:13 👍 3 🔁 2 💬 0 📌 0

🎉 Excited to share that the last paper of my PhD is now published in PRX Life!

We introduce RAG-ESM, a retrieval-augmented framework that makes pretrained protein language models (like ESM2) homology-aware with minimal training cost.

📄 Paper: journals.aps.org/prxlife/abst...

21.08.2025 16:13 👍 7 🔁 2 💬 0 📌 0

ProteomeLM: A proteome-scale language model allowing fast prediction of protein-protein interactions and gene essentiality across taxa Language models starting from biological sequence data are advancing many inference problems, both at the scale of single proteins, and at the scale of genomic neighborhoods. In this paper, we introduce ProteomeLM, a transformer-based language model that reasons on entire proteomes from species spanning the tree of life. Leveraging protein language model embeddings, ProteomeLM is trained to reconstruct masked protein embeddings using the whole proteomic context. It thus learns contextualized protein representations reflecting proteome-scale functional constraints. We show that ProteomeLM spontaneously captures protein-protein interactions (PPI) in its attention coefficients. We demonstrate that it screens whole interactomes orders of magnitude faster than amino-acid coevolution-based methods, and substantially outperforms them. We further develop ProteomeLM-PPI, a supervised PPI prediction network that combines ProteomeLM embeddings and attention coefficients, and achieves state-of-the-art performance across species and benchmarks. Finally, we introduce ProteomeLM-Ess, a supervised predictor of gene essentiality that generalizes across diverse taxa. Our results highlight the power of proteome-scale language models for addressing function and interactions at the organism level. ### Competing Interest Statement The authors have declared no competing interest. European Research Council, https://ror.org/0472cxd90, 851173

[1/8] 📄 New preprint! With Gionata Paolo Zalaffi & Anne-Florence Bitbol, we introduce ProteomeLM, a transformer that processes entire proteomes (prokaryotes and eukaryotes), enabling ultra-fast protein–protein interaction (PPI) prediction across the tree of life.
🔗 www.biorxiv.org/content/10.1...

21.08.2025 13:55 👍 17 🔁 3 💬 1 📌 1

📈 Despite its smaller size, ProtMamba is better than SOTA on conditional sequence generation and competitive with other protein language models on fitness prediction, showing the importance of long-context conditioning.

Read it here: doi.org/10.1093/bioi...
Github repo: github.com/Bitbol-Lab/P...

07.07.2025 16:48 👍 0 🔁 0 💬 0 📌 0

🧬 ProtMamba applications include:
- Generating novel protein sequences conditioned on a given set of homologs,
- Inpainting specific regions within sequences,
- Modeling disordered regions of different protein sequences,
- Predicting the fitness of protein variants.

07.07.2025 16:48 👍 0 🔁 0 💬 1 📌 0

⚙️ ProtMamba is based on Mamba, a state space model that efficiently handles very long sequences. The model uses a fill-in-the-middle training objective, combining autoregressive modeling and masked language modeling to predict amino acids conditioned on the given homologs.

07.07.2025 16:48 👍 0 🔁 0 💬 1 📌 0

🔍 ProtMamba is homology-aware yet alignment-free, meaning it captures evolutionary information without relying on multiple sequence alignments. This allows it to avoid the imperfections of MSAs but still use the information of other homologs to condition the generation!

07.07.2025 16:48 👍 0 🔁 0 💬 1 📌 0

Happy to announce that our paper, "ProtMamba: a homology-aware but alignment-free protein state space model", has been published in Bioinformatics! 🎉

doi.org/10.1093/bioi...

07.07.2025 16:48 👍 6 🔁 2 💬 1 📌 0

Also, a huge thanks to my supervisor Anne-Florence and my defense committee: Bruno Correia @pschwllr.bsky.social @sokrypton.org and Thomas Lemmin

30.06.2025 11:42 👍 1 🔁 0 💬 0 📌 0

I'm really happy to share with you that after 4 years at EPFL I'm finally a PhD! 🎉🎓

Last Friday I defended my thesis titled: "Revealing and Exploiting Coevolution through Protein Language Models".

It was an amazing journey where I met some incredible people. Thank you all ❤️

30.06.2025 11:42 👍 4 🔁 0 💬 1 📌 0

New preprint of @trono-lab.bsky.social and my PhD work!
By modulating SWI/SNF remodeling at ancient transposable elements - LINE/L2s and SINE/MIRs, a "noncanonical" KZFP called ZNF436 protects cardiomyocytes from losing their identity.
🫀heartbeat on 🔁 repeat
www.biorxiv.org/content/10.1... #TEsky

19.05.2025 14:01 👍 53 🔁 23 💬 2 📌 4

Limits of deep-learning-based RNA prediction methods Motivation: In recent years, tremendous advances have been made in predicting protein structures and protein-protein interactions. However, progress in predicting the structure of RNA, either alone or...

In this evaluation of AlphaFold3 (and other methods), we show that (i) accurate predictions are limited to RNA structures/complexes with structural similarity to PDB and (ii) that current methods are bad at estimating the accuracy of the predictions. www.biorxiv.org/content/10.1...

05.05.2025 13:01 👍 86 🔁 31 💬 1 📌 5

This is a work that I did in collaboration with Anne-Florence Bitbol @epfl-ai-center.bsky.social. #CompBio #DeepLearning #ProteinEngineering #AI #MachineLearning #ICLR2025

11.04.2025 14:54 👍 0 🔁 0 💬 0 📌 0

GitHub - Bitbol-Lab/rag-esm Contribute to Bitbol-Lab/rag-esm development by creating an account on GitHub.

RAG-ESM is simple to implement, compatible with pretrained ESM2 checkpoints, and efficient to train (~50–120 GPU hours).

Come check my poster (spotlight) at the MLGenX workshop at ICLR in Singapore!

Code (still WIP): github.com/Bitbol-Lab/r...
Preprint: doi.org/10.1101/2025...

7/7

11.04.2025 14:47 👍 0 🔁 0 💬 1 📌 0

RAG-ESM is trained with a discrete diffusion objective, giving it generative capabilities. RAG-ESM achieves SOTA among sequence-based models for conditional generation and motif scaffolding. It outperforms DPLM (650M), EvoDiff-MSA, and ProtMamba on key benchmarks.

6/7

11.04.2025 14:47 👍 0 🔁 0 💬 1 📌 0

An unexpected result: Several cross-attention heads naturally learn to align the input and context sequences, even though the model is trained on unaligned data. This alignment capability emerges purely from the training objective (no explicit alignment supervision).

5/7

11.04.2025 14:47 👍 1 🔁 0 💬 1 📌 0

Using just one homolog as context, RAG-ESM models (12M and 165M params) outperform ESM2 (650M) on masked token prediction. We obtain a 40–50% reduction in perplexity despite using much fewer parameters.

4/7

11.04.2025 14:47 👍 0 🔁 0 💬 1 📌 0

Conditioning on homologs reduces the effective dimensionality of the search space during inference. Instead of encoding information of entire protein families internally, the model can focus its weights on more nuanced biological features.

3/7

11.04.2025 14:47 👍 0 🔁 0 💬 1 📌 0

What does RAG-ESM do?
It augments ESM2 with a few lightweight cross-attention layers that let us condition the model on retrieved homologous sequences. This allows the model to leverage evolutionary information during inference without retraining.

2/7

11.04.2025 14:47 👍 1 🔁 0 💬 1 📌 0

RAG-ESM logo

📢 Our new preprint is out on bioRxiv! We introduce RAG-ESM, a retrieval-augmented framework that improves pretrained protein language models like ESM2 by making them homology-aware with minimal additional training costs.
🔗 doi.org/10.1101/2025...
💻 github.com/Bitbol-Lab/r...

1/7

11.04.2025 14:47 👍 5 🔁 3 💬 1 📌 1

📢📢 "Proteina: Scaling Flow-based Protein Structure Generative Models"

#ICLR2025 (Oral Presentation)

🔥 Project page: research.nvidia.com/labs/genair/...
📜 Paper: arxiv.org/abs/2503.00710
🛠️ Code and weights: github.com/NVIDIA-Digit...

🧵Details in thread...

(1/n)

04.03.2025 17:09 👍 39 🔁 10 💬 1 📌 4

Excited to share PoET-2, our next breakthrough in protein language modeling. It represents a fundamental shift in how AI learns from evolutionary sequences. 🧵 1/13

11.02.2025 14:30 👍 31 🔁 15 💬 1 📌 0

I'll get straight to the point.

We trained 2 new models. Like BERT, but modern. ModernBERT.

Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff.

It's much faster, more accurate, longer context, and more useful. 🧵

19.12.2024 16:45 👍 619 🔁 147 💬 19 📌 34

Structure-based drug design with equivariant diffusion models - Nature Computational Science This work applies diffusion models to conditional molecule generation and shows how they can be used to tackle various structure-based drug design problems

Extremely pleased to announce that after *checks notes* 2 years, our paper on Structure-based Drug Design with diffusion models has been published in Nature Computational Science (@natcomputsci.bsky.social)!!

Thanks a lot to the great co-authors! Esp
@rne.bsky.social & Yuanqi Du.

10.12.2024 15:10 👍 40 🔁 3 💬 1 📌 0

overview of results for PLAID!

1/🧬 Excited to share PLAID, our new approach for co-generating sequence and all-atom protein structures by sampling from the latent space of ESMFold. This requires only sequences during training, which unlocks more data and annotations:

bit.ly/plaid-proteins
🧵

06.12.2024 17:44 👍 121 🔁 37 💬 1 📌 3

Model Scale vs. Performance curves for ESM C models, with comparisons to ESM2 and other protein LMs. ESMC performs better than existing state of the art for the same model parameter scale.

Introducing ESM Cambrian, a new family of protein language models, focused on creating representations of the underlying biology of proteins.

04.12.2024 17:45 👍 50 🔁 16 💬 1 📌 2

*Very late correction: AMLD will take place in February 2025, not 2024. Obviously you are still on time to submit your abstract for a contributed talk!

03.12.2024 11:47 👍 0 🔁 0 💬 0 📌 0

If you need any other information you can contact me or the other organizers: @rebeccaneeser.bsky.social, @rne.bsky.social, Jeff Guo, Julius Wenckstern, @pschwllr.bsky.social and Bruno Correia

22.11.2024 08:57 👍 2 🔁 0 💬 0 📌 0

We’re happy to announce that our track "AI & the Molecular World" at @appliedmldays.org will take place this year too! Join us in Lausanne on February 13, 2024!

The call for talks is now open! Submit your abstract by January 5, 2024, at:
forms.gle/hu6BEWMN1BcR...

22.11.2024 08:53 👍 9 🔁 3 💬 2 📌 0

Damiano Sgarbossa

Latest posts by Damiano Sgarbossa @damianosg