Sina Majidian (@sinamajidian)

Overview of the PlantCAD2 model. (A) Comparison of PlantCAD1 and PlantCAD2 model configurations. PlantCAD2 introduces a longer context window, upgraded architecture (Mamba2), expanded pre-training species set, and scaled model sizes (small: 88M, medium: 311M, large: 694M parameters), while maintaining single-nucleotide tokenization. (B) Schematic of the PlantCAD2 architecture based on Mamba2 with reverse-complement (RC) equivariance, convolutional and state space modules (SSM), and a masked language modeling objective applied to 8,192 bp input sequences. (C) Total throughput (sequences/second) of PlantCAD2 models on NVIDIA H100 80 GB PCIe GPU across batch sizes (1–64). Values on the bar represent mean throughput across batch sizes. (D) Effect of context window length on model performance. The y-axis shows the prediction accuracy of three models when masking the single central token in the held-out test set. (E) Phylogenetic distribution of the 65 angiosperm genomes across flowering plant orders. Numbers in parentheses indicate the number of species included from each order.

PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms
www.biorxiv.org/content/10.1...

10.03.2026 15:47 👍 1 🔁 0 💬 0 📌 0

Figure 1 | GeneCAD architecture compared with existing annotation pipelines. (a) Three pipeline families. Evidence-supported tools (for example BRAKER3/AUGUSTUS) align RNA-seq and proteomic data to produce multi-isoform annotations. Ab initio deep-learning tools (for example Helixer, Tiberius) operate only on DNA and often blur feature boundaries. GeneCAD uses a conservation-aware, foundation-model strategy: it ingests PlantCAD2 embeddings and applies structured decoding to output one canonical, structurally coherent transcript per locus. (b) GeneCAD architecture. Genome sequence is embedded with PlantCAD2 and labeled at single- nucleotide resolution with BILOU tags. An eight-layer ModernBERT encoder produces contextual states that feed three task streams for transcription boundaries, splicing, and translation. A gated MLP fuses the streams to yield per-base logits. A chromosome-wide CRF with empirically derived transition constraints performs Viterbi decoding, enforcing valid feature order and splice consistency. Post- processing reconnects window-split loci when a single in-frame ORF is supported, and predicted CDSs are screened with ReelProtein. The final output is a GFF3 with one canonical transcript per gene.

GeneCAD: Plant Genome Annotation with a DNA Foundation Model
www.biorxiv.org/content/10.1...

GeneCAD improves transcript-level F1 by 8–10% on average over Helixer and BRAKER3!

10.03.2026 15:40 👍 2 🔁 1 💬 1 📌 0

1/ Excited to share my first first-author preprint from my PhD!

We introduce Perseus, a lineage-aware confidence estimation framework for taxonomic classification in long-read metagenomics.

Preprint: www.biorxiv.org/content/10.6...
Code: github.com/matnguyen/Pe...

09.03.2026 15:25 👍 13 🔁 8 💬 1 📌 0

Multi-context seeds enable fast and high-accuracy read mapping - Genome Biology A key step in sequence similarity search is to identify shared seeds between a query and a reference sequence. A well-known tradeoff is that longer seeds offer fast searches but reduce sensitivity in ...

1/ Our paper on Multi-Context Seeds is now out, with @tolyan.bsky.social spearheading the work and contributions from Nicolas and @marcelm.net. We introduce a new seeding concept that improves read alignment accuracy while maintaining speed.
link.springer.com/article/10.1...

09.03.2026 12:22 👍 18 🔁 12 💬 1 📌 0

Fig. 1 | Schematic of the OrthologTransformer model and downstream selec- tion. a Input: a coding DNA sequence from Species A (source), prepended with a source_species token stgt , is encoded. Output: the decoder, conditioned by a tar- get_species token stgt , generates an orthologous coding sequence for Species B, permitting synonymous and conservative non‐synonymous substitutions and indels where supported by ortholog supervision. The model features a 20-layer encoder-decoder structure, with each layer equipped with Add & Normalization layers and Multi-head Attention mechanisms. Species tokens (ssrc, stgt) are prepended to the input sequence, enabling species-specific sequence conversion. b OrthologTransformer employs a two-stage learning approach consisting of pre- training and fine-tuning. In the pretraining phase, the model learns general sequence conversion patterns from many-to-many orthologous relationships across multiple species. In the fine-tuning phase, the model is specialized for spe- cific one-to-one species pair conversions using targeted training data. c During candidate selection, a multi‐objective Monte Carlo Tree Search (MCTS) routine jointly optimizes GC content and mRNA secondary‐structure stability (MFE).

Fig. 5 | Predicted Structures, global and local structural conservations, and sequence-level properties of AI-designed PETase variants. a Predicted tertiary structures of twelve different PETase variants (AI-S1–AI-L5) generated by Ortho- logTransformer with various degrees of sequence modifications. The wild-type PETase structure (PDB entry 5XJH) is shown on the left for reference. The four numbers below each structure denote the counts of the modifications introduced in each variant in the following order: insertions/deletions/synonymous substitu- tions/non‐synonymous substitutions. b Global and local structural conservation of AI-designed PETase variants. TM‐score (global fold similarity), predicted structural stability, backbone RMSD, and per‐residue pLDDT are shown for AI‐designed var- iants (AI-S1–AI-L5), wild‐type (WT), and codon‐optimized (CO). The AI-designed variants, particularly those trained on broader datasets, achieved a favorable bal- ance across these measures, indicating preservation of the PETase fold while per- mitting small, evolution-consistent modifications and highlighting the benefit of multi-objective optimization relative to conventional codon optimization. c Sequence-level properties. GC content and RNA secondary-structure free energy (ΔG) among AI-S1–AI-L5, WT, and CO are shown. The AI-designed variants converge toward the GC composition of B. subtilis (target host), whereas the wild-type I. sakaiensis PETase gene is substantially more GC-rich (~66.7%). The AI-designed sequences also exhibit favorable mRNA secondary-structure energetics. Source data for (b, c) is available in the Source Data file.

Cross-species gene redesign leveraging ortholog information and generative modeling
doi.org/10.1038/s414...

06.03.2026 20:51 👍 2 🔁 0 💬 1 📌 0

The ISCB platform is an excellent place to advertise and find positions in computational biology!
careers.iscb.org/jobs

04.03.2026 14:31 👍 3 🔁 1 💬 0 📌 0

Home

This looks absolutely great. For those of us interested in pangenomes, I am sure this will be a super place to get data and the interface is very clean (plotly). Congrats to the authors (I don't know if they are on bsky): pangbank.genoscope.cns.fr

04.03.2026 10:50 👍 8 🔁 4 💬 1 📌 0

Excited to share our pre-print on the curation of a new bioactivity dataset for metabolic transformations. 🧪🧑‍💻 We were surprised to find that roughly a quarter of our drug-metabolite-target combinations contain metabolites with retained or increased bioactivity relative to the parent drugs! #chemsky

02.03.2026 10:47 👍 10 🔁 3 💬 1 📌 0

Fig. 1 | FANTASIA pipeline overview. Input proteomes are preprocessed to remove sequences based on length and sequence similarity if needed. Then, embeddings are computed, and distance embedding similarity is calculated against the reference database (using two metrics at will). Optionally, it converts the standard GOPredSim output file to the input file format for topGO20 to facilitate its application in a wider biological workflow.

FANTASIA leverages language models to decode the functional dark proteome across the animal tree of life
www.nature.com/articles/s42...

01.03.2026 17:58 👍 2 🔁 0 💬 0 📌 0

Paralog interference contributes to the preservation of genetic redundancy Duplicated self-interacting proteins can interact and interfere with each other’s function. Cisneros, Mattenberger, et al. show that selection against interfering loss-of-function alleles extends the ...

New paper alert: Paralog interference contributes to the preservation of genetic redundancy www.cell.com/current-biol...

28.02.2026 02:00 👍 45 🔁 28 💬 0 📌 0

A very good list of Computational Biology Conferences and their deadlines!
databio.org/conferences/

27.02.2026 14:14 👍 11 🔁 0 💬 0 📌 0

Now out in @natcomms.nature.com :
versions 2.0 of both BiG-SCAPE and BiG-SLiCE! With significant speed and accuracy increases, as well as new interactive functionalities.
Read the full paper here #openaccess:
www.nature.com/articles/s41...

26.02.2026 12:18 👍 37 🔁 18 💬 1 📌 1

Assistant/Associate/Full Professor Computational Biology Assistant/Associate/Full Professor Computational Biology

Sounds like a great opportunity for a professor position in computational biology in the Netherlands careers.universiteitleiden.nl/job/Assistan...

25.02.2026 15:28 👍 13 🔁 13 💬 0 📌 0

Figure 1: Summary of TF-MoDISco

Figure 3: Continuous Jaccard similarity is preferable to cross-correlation for matching seqlets. Green checkmarks indicate matching positions.

TF-MoDISco: Transcription Factor Motif Discovery from
Importance Scores (2017) arxiv.org/abs/1811.00416

YouTube: youtube.com/watch?v=fXPGVJg956E
GitHub: github.com/kundajelab/t...

23.02.2026 12:54 👍 2 🔁 0 💬 0 📌 0

Can we simulate realistic evolutionary trajectories and “replay the tape of life”? In this work, we propose a flexible, generalizable deep learning framework for modeling how the entire protein sequence evolves over time while capturing complex interactions across sites. 1/n
doi.org/10.64898/202...

21.02.2026 17:13 👍 83 🔁 35 💬 3 📌 1

Structural Variants ESEB Special Topic Network ESEB Special Topic Network

🧬We are launching STRiVE, a @eseb.bsky.social Special Topic Network on the evolutionary role of structural genomic variation.

🗓️Std:
29/04: Online seminar w/ L. Rieseberg
8-10/07: Kick-off in Porto

Join us: structuralvariantsstn.github.io #Evolution #Genomics #StructuralVariants #Biology #PopGen

20.02.2026 11:49 👍 14 🔁 9 💬 0 📌 0

Vacancies

My university (Chalmers University of Technology in 🇸🇪) is recruiting an assistant professor in data-driven cell & molecular biology, funded by the DDLS program @scilifelab.se #chemsky #facultychemjobs

The position comes with a nice start-up package

www.chalmers.se/en/about-cha...

19.02.2026 14:37 👍 11 🔁 6 💬 1 📌 0

Come join us again in a next round of this massive online open science community effort! 💪
Sign up using the link in the thread.
It’s great fun, and really helps the scientific community. What more can you ask? 🙂

20.02.2026 05:20 👍 13 🔁 7 💬 0 📌 0

kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations https://www.biorxiv.org/content/10.64898/2026.02.13.705625v1

17.02.2026 05:47 👍 10 🔁 7 💬 0 📌 0

Annotating genomes at increased scale and resolution Nature Reviews Genetics - In this Review, Ji et al. overview how rapidly advancing experimental and computational methods are enabling improved and automated annotation of gene structure and...

Our new review on genome annotation just appeared in @naturerevgenet.bsky.social, with a particular focus on the human genome, with Hayden Ji and Mihaela Pertea: rdcu.be/e4mI1

17.02.2026 12:46 👍 24 🔁 12 💬 0 📌 0

COMBINE-lab - The skeptic’s guide to generative AI assisted coding An easy-to-use, flexible website template for labs, with automatic citations, GitHub tag imports, pre-built components, and more.

I’ve written a post about my recent experiences (successes) with AI coding models; the experiences that caused me to re-evaluate my initial judgements, the surprise I had at what can be accomplished, & some fears I have about these tools. Discussion welcome! combine-lab.github.io/blog/2026/02...

15.02.2026 04:31 👍 51 🔁 15 💬 8 📌 5

Come to Ascona and attend talks from

Maria Brbic
Charlotte Bunne
Faisal Mahmood
Dana Pe’er
Barbara Engelhardt
Caroline Uhler
Julien Gagneur
Marinka Zitnik
Julie Josse (INRIA)
Basile Wicky
Fabian Fröhlich

with a beautiful view of the lake in the Swiss Alps! ascona2026.sciencesconf.org

16.01.2026 11:00 👍 3 🔁 1 💬 0 📌 0

If you are a scientist, working on biology, wondering where to submit your manuscript given the current issues with the academic publishing system, check out wheretopublish.github.io!
We did this thinking change is possible. Let’s make it happen!

11.02.2026 20:31 👍 7 🔁 6 💬 0 📌 0

Figure 1. The genome is enriched with active promoters relative to random DNA. (A) We cloned the random library of 150 bp N-mer sequences (n=17,129, purple), and the genomic library of 100-300 bp sequences (n=91,866, magenta) into the dual-reporter plasmid MR1 (pMR1), which drives the expression of green fluorescent protein (GFP, teal) from inserts on the top DNA strand, and that of red fluorescent protein (RFP, orange) on the bottom strand. We transformed E. coli cells with the plasmid libraries. (B) We sorted the bacterial libraries into fluorescence bins at four fluorescence strengths: none, weak, moderate, and strong for both GFP and RFP (eight bins total) with a cell-sorter. We bulk-sequenced the library inserts from each bin and calculated fluorescence scores in arbitrary units (a.u.) ranging between one (none) and four (strongest) (Methods). (C) The probability that a DNA sequence in the random (purple) and genomic (magenta) libraries is a promoter relative to its AT-content. (D) For 102 position-weight matrices (PWMs) for transcription factors and sigma (σ) factors, we plot the percentage of sequences in each library (purple: random, magenta: genome) that encode at least one putative factor binding site (vertical axis) against the respective PWM’s information content in bits. We test for equality of the frequency distributions between the random and genome libraries with a paired t-test (p=7.48×10−12).

De-novo promoters emerge more readily from random DNA than from genomic DNA
www.biorxiv.org/content/10.1...

10.02.2026 11:17 👍 13 🔁 5 💬 0 📌 0

Home | Timothy Fuqua

I'm looking for a Swiss department to host me for an SNSF Starting Grant.

I research how gene expression evolves and emerges by combining wet lab + computational work in a variety of model systems (E. coli, Drosophila, yeast). More: timothyfuqua.com

If your department might be a match, let’s chat!

10.02.2026 10:42 👍 4 🔁 4 💬 0 📌 0

Introducing The Structural History of Eukarya (SHE): The first proteome-scale phylogeny constructed entirely from 3D structure.
We computed 300 trillion alignments across 1,542 species to map the tree of life. 🧵👇 (1/5)

07.02.2026 08:50 👍 84 🔁 40 💬 2 📌 0

Compbio Asia

Please spread the word:

We invite applications to a two-week Computational Biology workshop in Singapore, June 14-27.

This NSF-funded workshop brings together 16-20 US grad students with international peers.
Apply by March 21: compbioasia.net
🧵 Details below:

05.02.2026 17:22 👍 3 🔁 9 💬 2 📌 1

Tenure-Track Assistant Professor / Associate Professor in Bioinformatics and/or Computational Biology at Aarhus University, Denmark - Vacancy at Aarhus University Vacancy at Department of Molecular Biology and Genetics - BiRC - Bioinformatics Research Center, Aarhus University

Aarhus University is seeking a Tenure-Track Assistant/Associate Professor in Bioinformatics/Co… https://nat.au.dk/en/about-the-faculty/vacant-positions-and-career/job/tenure-track-assistant-professor-associate-professor-in-bioinformatics-and-or-computational-biology-at-aarhus-university-denmark #job

07.02.2026 23:11 👍 19 🔁 33 💬 0 📌 0

Biodiversity Bioinformatics Summer School This School is co-organized by SIB/ELIXIR Switzerland and de.NBI/ELIXIR Germany Overview Biodiversity is fundamental to ecosystem functioning, yet it

#biodiversity #bioinformatics Summer School announcement ... 21-26 June in Siegen, Germany co-organised by @sib.swiss & @denbi.bsky.social www.sib.swiss/training/cou...
🟢 eDNA & ecosystems
🟣 pangenome diversity
🔵 population genetics
🟡 comparative genomics

07.02.2026 12:07 👍 9 🔁 9 💬 0 📌 0

🚨🧪 Announcing our #ICLR2026 Workshop, Generative AI in Genomics (Gen2): Barriers and Frontiers! @iclr-conf.bsky.social

📣Call for: Full workshop papers (5-8 pages) and Tiny papers (2-4 pages)
📅Submission deadline: 7 February 2026 AoE
🌐Learn more: genai-in-genomics.github.io
(1/7)

12.01.2026 03:15 👍 4 🔁 3 💬 1 📌 0

Sina Majidian

Latest posts by Sina Majidian @sinamajidian