Here is the model it was trained on! Preprint went up on bioRxiv last week: www.biorxiv.org/content/10.1...
Here is the model it was trained on! Preprint went up on bioRxiv last week: www.biorxiv.org/content/10.1...
There's much more work to be done in evaluating and interpreting the features, but these early findings suggest this could be a valuable approach for understanding what these models learn about the language of life 𧬠8/8
The features have distinct but sometimes overlapping activation patterns, suggesting they might detect different parts of the same alu element or regulatory sequence. There are alot of these! I've only investigated a handful of features, these initial results are promising! 7/8
To visualize where these features activate across the genome, I uploaded the activation sites to the UCSC Genome Browser. Comparing against genomic annotations reveals these features tend to activate in consistent patterns - often near SINE elements 6/8
Or this (less significant) alignment with MEF2A binding site for feature 3990 5/8
Some of these also look like they may be transcription factor binding sites β such as feature 1685 here that gets a highly significant result for alignment with this ZFN460 binding site 4/8
Many of these detect Alu elements βstretches of DNA about 300bp that are abundant through copy-paste events during evolution. Making up over 10% of our genome, these are the most common mobile genetic elements in humans 3/8
To identify these motifs, I looked at the 30bp sequence windows surrounding its strongest activation sites for a given feature. When examining the top activating sequences together, clear patterns emergedβshowing that these features reliably detect specific DNA sequences 2/8
I trained a sparse autoencoder on the middle layer residual stream of my genome language model and found human-interpretable latent features that consistently detect specific DNA motifs!
π§΅1/8