Abdul Muntakim Rafi's Avatar

Abdul Muntakim Rafi

@muntakimrafi

PhD candidate @SBME_UBC | Machine Learning | Gene regulation

110
Followers
542
Following
33
Posts
12.11.2024
Joined
Posts Following

Latest posts by Abdul Muntakim Rafi @muntakimrafi

cool work!

25.02.2026 09:39 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
Refining the cis-regulatory grammar learned by sequence-to-activity models by increasing model resolution Chromatin accessibility can be measured genome-wide with ATAC-seq, enabling the discovery of regulatory regions that control gene expression and determine cell type. Deep genomic sequence-to-function ...

a lot of important benchmarks shown here
@lxsasse.bsky.social @saramostafavi.bsky.social

11.03.2025 06:55 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

Amazing. Will get back to you.

30.01.2025 19:42 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

-we haven't yet tried on a varied downstream tasks. I believe there would some downstream tasks where you have more leakage than others
-we haven' tried a varied set of pretrained models
-need to integrate hashFrag.we used chrom. splits before. might have a reason we didn't see much differenc in mag

30.01.2025 19:41 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

this is a very important point. The expression levels for the different test subsets are from a wide range (they are not only sequences with high expression or sequences with low expression).

30.01.2025 19:27 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

-for models that overfit to different degrees, the drop in performance would be different.
-the drop in performance would vary by datasets, tasks, as well.

30.01.2025 19:23 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

I think its possible. Actually this was and remains in our to-do list.

30.01.2025 18:38 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Beware of Data Leakage from Protein LLM Pretraining Pretrained protein language models are becoming increasingly popular as a backbone for protein property inference tasks such as structure prediction or function annotation, accelerating biological res...

therefore, I cannot provide any evidence of leakage happening in transfer learning (for now). But I would suggest to avoid it whereas possible.
take a look at @jmbartoszewicz.bsky.social & Melanias'
www.biorxiv.org/content/10.1...

30.01.2025 18:32 πŸ‘ 3 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

this was a problem I was particularly interested about. to show that leakage can occur when testing fine tuned model on pretraining data. A student from our group pursued it for some time. But we were unable to detect leakage in transfer learning to a degree where everyone would care.

30.01.2025 18:32 πŸ‘ 1 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0

New (and hotly anticipated - at least by me) preprint from my group describing a better way to partition training data for genomic-trained models to solve the long-neglected problem of homology-based data leakage. Thread from first author @muntakimrafi.bsky.social πŸ‘‡

27.01.2025 23:48 πŸ‘ 26 πŸ” 8 πŸ’¬ 3 πŸ“Œ 1
Preview
GitHub - de-Boer-Lab/hashFrag: A command-line tool to mitigate homology-based data leakage in sequence-to-expression models A command-line tool to mitigate homology-based data leakage in sequence-to-expression models - de-Boer-Lab/hashFrag

10/hashFrag is openly available and accessible,so it’s win,win,win:more accurate perf. estimates,better perf. overall,and easy to use.We hope that hashFrag sets the new standard for how data are split for trainin genome models
Github: github.com/de-Boer-Lab/...
Paper: www.biorxiv.org/content/10.1...

27.01.2025 23:04 πŸ‘ 2 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
A. Histogram showing the number of test sequences (y-axis) with corresponding maximum pairwise SW local alignment scores with the
training sequences (x-axis) for both chromosomal splits (blue) and hashFrag-pure (red), with approximately 80% of the sequences for training and 20% for test sets.
B. hashFrag-split trained models outperform chromosomal split trained
models. Performances across 100 replicates (points; y-axes) of different models (columns) on the designed sequences from Gosai et al. (20) when trained on different chromosomal and hashFrag
splits (x-axes). Statistical significance between hashFrag and chromosomally trained models was calculated using the Two-Sample t-test.

A. Histogram showing the number of test sequences (y-axis) with corresponding maximum pairwise SW local alignment scores with the training sequences (x-axis) for both chromosomal splits (blue) and hashFrag-pure (red), with approximately 80% of the sequences for training and 20% for test sets. B. hashFrag-split trained models outperform chromosomal split trained models. Performances across 100 replicates (points; y-axes) of different models (columns) on the designed sequences from Gosai et al. (20) when trained on different chromosomal and hashFrag splits (x-axes). Statistical significance between hashFrag and chromosomally trained models was calculated using the Two-Sample t-test.

9/ Not only do hashFrag generated train-test splits effectively mitigate leakage, but hashFrag-trained models even outperformed chromosomal split-trained models, showing that chromosomal splitting not only introduces train-test leakage but also creates inferior train-val splits.

27.01.2025 23:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
hashFrag removes overestimation of model performance. Model
performance (Pearson π‘Ÿ2; y-axes) across different models (columns) for different chromosomal splits (rows) following the removal of similar sequences using hashFrag-pure at different maximum SW score thresholds (x-axes).

hashFrag removes overestimation of model performance. Model performance (Pearson π‘Ÿ2; y-axes) across different models (columns) for different chromosomal splits (rows) following the removal of similar sequences using hashFrag-pure at different maximum SW score thresholds (x-axes).

8/ We applied hashFrag to test datasets. Across models tested, model performance was inflated by the presence of test sequences that were similar to training sequences. hashFrag revealed more reliable performance measures.

27.01.2025 23:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
Overview of the hashFrag method. Each sequence in the dataset is subjected to the BLASTn algorithm to identify candidate homologous sequences in the dataset. False-positive candidates (denoted with a red β€˜X’) are subsequently removed based on their SW local alignment scores according to a specified threshold, resulting in a network where only probable homologs are connected (solid lines in the network). Cases of detected homology can be used to either filter out homologs from test data for existing data splits, further stratify the test split into subsets based on similarity to the train split, or create new orthogonal data splits.

Overview of the hashFrag method. Each sequence in the dataset is subjected to the BLASTn algorithm to identify candidate homologous sequences in the dataset. False-positive candidates (denoted with a red β€˜X’) are subsequently removed based on their SW local alignment scores according to a specified threshold, resulting in a network where only probable homologs are connected (solid lines in the network). Cases of detected homology can be used to either filter out homologs from test data for existing data splits, further stratify the test split into subsets based on similarity to the train split, or create new orthogonal data splits.

7/ To detect and avoid homology based leakage, we created hashFrag, which leverages BLAST to identify similar sequences and then either (1) filter out the leaked sequences from the test set, (2) stratify the test set into subgroups by distance, or (3) create leakage-free train-test splits.

27.01.2025 23:04 πŸ‘ 2 πŸ” 1 πŸ’¬ 1 πŸ“Œ 0
A. Percentage of GWAS SNVs (y-axis) with SNV doppelgΓ€ngers of each sequence length (x-axis). 
B. Number of fine mapped GWAS SNVs (y-axes) with the corresponding number of SNV doppelgΓ€ngers (x-axes) on other chromosomes in the genome for 41 bp regions

A. Percentage of GWAS SNVs (y-axis) with SNV doppelgΓ€ngers of each sequence length (x-axis). B. Number of fine mapped GWAS SNVs (y-axes) with the corresponding number of SNV doppelgΓ€ngers (x-axes) on other chromosomes in the genome for 41 bp regions

6/ We analyzed GWAS SNVs from OpenTarget with PIP>0.1 and found a substantial percentage of these SNVs have their alternate alleles, along with their flanking sequences, replicated on other chromosomes, often many times.

27.01.2025 23:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
Illustration of (i) homology across chromosomes, (ii) SNVs associated with diseases, and (iii) SNV doppelgΓ€ngers, sequences elsewhere in the genome with an identical sequence to the GWAS alternate allele, including its flanking region.

Illustration of (i) homology across chromosomes, (ii) SNVs associated with diseases, and (iii) SNV doppelgΓ€ngers, sequences elsewhere in the genome with an identical sequence to the GWAS alternate allele, including its flanking region.

5/ An important application of models is to predict the effect of variants. However, variants along with their flanking region can be replicated throughout the genome. Without accounting for homology, you can’t tell if the model’s prediction is based on learned cis-regulatory logic or memorization.

27.01.2025 23:04 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Neural networks trained on different chromosomal splits show
the same trend of varying levels of performance on different degrees of homology. Performance of different models (columns) in Pearson π‘Ÿ2 (y-axes) during model training (x-axes) for different chromosomal spits.

Neural networks trained on different chromosomal splits show the same trend of varying levels of performance on different degrees of homology. Performance of different models (columns) in Pearson π‘Ÿ2 (y-axes) during model training (x-axes) for different chromosomal spits.

4/ We saw a very interesting trend where models fit to the most similar test sequences early during training, faster than they fit the overall training data, making these sequences unreliable for evaluating actual performance.

27.01.2025 23:04 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Model performance on test data depends on similarity to training
data. Performance comparison (Pearson π‘Ÿ2; y-axes) of different models (OverfitNN, DREAM-CNN, DREAM-RNN, DREAM-Attn, and MPRAnn; colors) across varying levels of homology (SW alignment
score, x-axes) in different chromosomal folds.

Model performance on test data depends on similarity to training data. Performance comparison (Pearson π‘Ÿ2; y-axes) of different models (OverfitNN, DREAM-CNN, DREAM-RNN, DREAM-Attn, and MPRAnn; colors) across varying levels of homology (SW alignment score, x-axes) in different chromosomal folds.

3/We created the cheeky OverfitNN as a maximally overfit benchmark, which is nearest neighbor-based and has no understanding of cis regulation. As expected, OverfitNN only works well for closely related sequences, but even neural networks work best for sequences that are similar to their train data.

27.01.2025 23:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 2 πŸ“Œ 0
Homology is common between chromosomes. Histogram showing the number of test sequences (y-axis) with corresponding maximum pairwise SW local alignment scores with the training sequences (x-axis) for both genomic (blue) and dinucleotide shuffled (red) sequences, with training and test sets randomly sampled from distinct
chromosome sets (20,000 each).

Homology is common between chromosomes. Histogram showing the number of test sequences (y-axis) with corresponding maximum pairwise SW local alignment scores with the training sequences (x-axis) for both genomic (blue) and dinucleotide shuffled (red) sequences, with training and test sets randomly sampled from distinct chromosome sets (20,000 each).

2/ We compared regulatory regions against each other using chromosomal splitting and found that many genomic sequences are very similar compared to unrelated sequences. We set out to investigate how this similarity could cause train-test leakage.

27.01.2025 23:04 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Homologous sequences from two different chromosomes can share functional genomic signals. ATAC-seq read counts (y-axis) for two homologous 1000 bp regions (x-axis) on chromosomes 9 and 16 (colours) in K562 cells.

Homologous sequences from two different chromosomes can share functional genomic signals. ATAC-seq read counts (y-axis) for two homologous 1000 bp regions (x-axis) on chromosomes 9 and 16 (colours) in K562 cells.

1/Typically, genome is split into train & test by chromosomes without accounting for homologous sequences. Because similar sequences encode similar activities, a model could conceivably correctly predict the activity of test sequences that are very similar to train sequences just by memorizing them.

27.01.2025 23:04 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

0/ Essential reading for anyone training or using sequence-function models trained on genomic sequences! 🚨 In our new preprint, we explore the ways homology within genomes can cause leakage when training sequence-based models and ways to prevent it

27.01.2025 23:04 πŸ‘ 26 πŸ” 12 πŸ’¬ 1 πŸ“Œ 3
Post image

Had a lot of fun at the CSHL Biological Data Science conference.

Thanks to the scholarship from the "James P. Taylor Foundation for open science" for making it possible.

#cshl

17.11.2024 23:02 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

I am attending the Biological Data Science Meeting at CSHL. Will be giving a talk this Friday morning on the results from the Random Promoter DREAM Challenge. Will also be presenting a poster on a recent work where we address and solve the homology-based leakage in genome trained models.

14.11.2024 07:13 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 1

Amazing collaboration between de Boer lab (@CarldeBoerPhD, myself) and Yachie lab (@yachielab, @nzmyachie, Brett Kiyota)

14.11.2024 06:31 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Kipoi Seminar - Abdul Muntakim Rafi (University of British Columbia)
Kipoi Seminar - Abdul Muntakim Rafi (University of British Columbia) YouTube video by Kipoi Seminar

Thrilled to share our research at the recent @KipoiZoo seminar! 🧬 We showed how chromosomal splitting of genome can cause train-test leakage through sequence homology and proposed a scalable solution to tackle it. Preprint coming soon!

youtu.be/0_08qB0wLoM?...

14.11.2024 06:31 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
A community effort to optimize sequence-based deep learning models of gene regulation - Nature Biotechnology A benchmarking competition improves tools that predict how regulatory regions control gene expression.

9/ With so many cool technologies being developed, it's an exciting time to be involved in sequence modeling! Make sure to use and strive to beat the state-of-the-art when applying models.
πŸ“„ Paper: www.nature.com/articles/s41...
πŸ’» GitHub: github.com/de-Boer-Lab/random-promoter-dream-challenge-2022

14.11.2024 06:25 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

8/ A massive thank you to everyone who was part of this long journey! I feel privileged to be involved in this project and excited that we've managed to present our extensive analysis in a way that's easy to digest.

14.11.2024 06:25 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

7/ We were also able to create even better models! These models not only beat SOTA on our yeast dataset but also outperformed benchmarks on Drosophila and human genomic data. We surpassed well-known models like DeepSTARR (249bp) and ChromBPNet (2114bp)!

14.11.2024 06:25 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

6/ We saw thatΒ how models are trained is even more important than the network architecture. Models with completely different architectures, when trained on our large dataset, started capturing biology similarly.

14.11.2024 06:25 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

5/ The overreaching goal was to understand how diff NN architectures and training strategies affect performance. So, after the challenge, we developed 'Prix Fixe'β€”a framework that lets us mix and match different components from top models to build new ones and see what works best

14.11.2024 06:25 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0