Shae Mclaughlin (@shaemcl)

Nucleotide GPT: Sequence-Based Deep Learning Prediction of Nuclear Subcompartment-Associated Genome Architecture The spatial organization of the genome within the nucleus is partially determined by its interactions with distinct nuclear subcompartments, such as the nuclear lamina and nuclear speckles, which play...

Here is the model it was trained on! Preprint went up on bioRxiv last week: www.biorxiv.org/content/10.1...

12.12.2024 02:48 👍 1 🔁 1 💬 0 📌 0

There's much more work to be done in evaluating and interpreting the features, but these early findings suggest this could be a valuable approach for understanding what these models learn about the language of life 🧬 8/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

The features have distinct but sometimes overlapping activation patterns, suggesting they might detect different parts of the same alu element or regulatory sequence. There are alot of these! I've only investigated a handful of features, these initial results are promising! 7/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

To visualize where these features activate across the genome, I uploaded the activation sites to the UCSC Genome Browser. Comparing against genomic annotations reveals these features tend to activate in consistent patterns - often near SINE elements 6/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

Or this (less significant) alignment with MEF2A binding site for feature 3990 5/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

Some of these also look like they may be transcription factor binding sites — such as feature 1685 here that gets a highly significant result for alignment with this ZFN460 binding site 4/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

Many of these detect Alu elements —stretches of DNA about 300bp that are abundant through copy-paste events during evolution. Making up over 10% of our genome, these are the most common mobile genetic elements in humans 3/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

To identify these motifs, I looked at the 30bp sequence windows surrounding its strongest activation sites for a given feature. When examining the top activating sequences together, clear patterns emerged—showing that these features reliably detect specific DNA sequences 2/8

12.12.2024 02:47 👍 0 🔁 0 💬 1 📌 0

I trained a sparse autoencoder on the middle layer residual stream of my genome language model and found human-interpretable latent features that consistently detect specific DNA motifs!

🧵1/8

12.12.2024 02:47 👍 3 🔁 2 💬 2 📌 0

Shae Mclaughlin

Latest posts by Shae Mclaughlin @shaemcl