Overview of the PlantCAD2 model.
(A) Comparison of PlantCAD1 and PlantCAD2 model configurations. PlantCAD2 introduces a longer context window, upgraded architecture (Mamba2), expanded pre-training species set, and scaled model sizes (small: 88M, medium: 311M, large: 694M parameters), while maintaining single-nucleotide tokenization. (B) Schematic of the PlantCAD2 architecture based on Mamba2 with reverse-complement (RC) equivariance, convolutional and state space modules (SSM), and a masked language modeling objective applied to 8,192 bp input sequences. (C) Total throughput (sequences/second) of PlantCAD2 models on NVIDIA H100 80 GB PCIe GPU across batch sizes (1–64). Values on the bar represent mean throughput across batch sizes. (D) Effect of context window length on model performance. The y-axis shows the prediction accuracy of three models when masking the single central token in the held-out test set. (E) Phylogenetic distribution of the 65 angiosperm genomes across flowering plant orders. Numbers in parentheses indicate the number of species included from each order.
PlantCAD2: A Long-Context DNA Language Model for Cross-Species Functional Annotation in Angiosperms
www.biorxiv.org/content/10.1...
10.03.2026 15:47
👍 1
🔁 0
💬 0
📌 0
Figure 1 | GeneCAD architecture compared with existing annotation pipelines. (a) Three pipeline families. Evidence-supported tools (for example BRAKER3/AUGUSTUS) align RNA-seq and proteomic data to produce multi-isoform annotations. Ab initio deep-learning tools (for example Helixer, Tiberius) operate only on DNA and often blur feature boundaries. GeneCAD uses a conservation-aware, foundation-model strategy: it ingests PlantCAD2 embeddings and applies structured decoding to output one canonical, structurally coherent transcript per locus.
(b) GeneCAD architecture. Genome sequence is embedded with PlantCAD2 and labeled at single- nucleotide resolution with BILOU tags. An eight-layer ModernBERT encoder produces contextual states that feed three task streams for transcription boundaries, splicing, and translation. A gated MLP fuses the streams to yield per-base logits. A chromosome-wide CRF with empirically derived transition constraints performs Viterbi decoding, enforcing valid feature order and splice consistency. Post- processing reconnects window-split loci when a single in-frame ORF is supported, and predicted CDSs are screened with ReelProtein. The final output is a GFF3 with one canonical transcript per gene.
GeneCAD: Plant Genome Annotation with a DNA Foundation Model
www.biorxiv.org/content/10.1...
GeneCAD improves transcript-level F1 by 8–10% on average over Helixer and BRAKER3!
10.03.2026 15:40
👍 2
🔁 1
💬 1
📌 0
1/ Excited to share my first first-author preprint from my PhD!
We introduce Perseus, a lineage-aware confidence estimation framework for taxonomic classification in long-read metagenomics.
Preprint: www.biorxiv.org/content/10.6...
Code: github.com/matnguyen/Pe...
09.03.2026 15:25
👍 13
🔁 8
💬 1
📌 0
Fig. 1 | Schematic of the OrthologTransformer model and downstream selec- tion. a Input: a coding DNA sequence from Species A (source), prepended with a source_species token stgt , is encoded. Output: the decoder, conditioned by a tar- get_species token stgt , generates an orthologous coding sequence for Species B, permitting synonymous and conservative non‐synonymous substitutions and indels where supported by ortholog supervision. The model features a 20-layer encoder-decoder structure, with each layer equipped with Add & Normalization layers and Multi-head Attention mechanisms. Species tokens (ssrc, stgt) are prepended to the input sequence, enabling species-specific sequence conversion. b OrthologTransformer employs a two-stage learning approach consisting of pre- training and fine-tuning. In the pretraining phase, the model learns general sequence conversion patterns from many-to-many orthologous relationships across multiple species. In the fine-tuning phase, the model is specialized for spe- cific one-to-one species pair conversions using targeted training data. c During candidate selection, a multi‐objective Monte Carlo Tree Search (MCTS) routine jointly optimizes GC content and mRNA secondary‐structure stability (MFE).
Fig. 5 | Predicted Structures, global and local structural conservations, and sequence-level properties of AI-designed PETase variants. a Predicted tertiary structures of twelve different PETase variants (AI-S1–AI-L5) generated by Ortho- logTransformer with various degrees of sequence modifications. The wild-type PETase structure (PDB entry 5XJH) is shown on the left for reference. The four numbers below each structure denote the counts of the modifications introduced in each variant in the following order: insertions/deletions/synonymous substitu- tions/non‐synonymous substitutions. b Global and local structural conservation of AI-designed PETase variants. TM‐score (global fold similarity), predicted structural stability, backbone RMSD, and per‐residue pLDDT are shown for AI‐designed var- iants (AI-S1–AI-L5), wild‐type (WT), and codon‐optimized (CO). The AI-designed variants, particularly those trained on broader datasets, achieved a favorable bal- ance across these measures, indicating preservation of the PETase fold while per- mitting small, evolution-consistent modifications and highlighting the benefit of multi-objective optimization relative to conventional codon optimization.
c Sequence-level properties. GC content and RNA secondary-structure free energy (ΔG) among AI-S1–AI-L5, WT, and CO are shown. The AI-designed variants converge toward the GC composition of B. subtilis (target host), whereas the wild-type I. sakaiensis PETase gene is substantially more GC-rich (~66.7%). The AI-designed sequences also exhibit favorable mRNA secondary-structure energetics. Source data for (b, c) is available in the Source Data file.
Cross-species gene redesign leveraging ortholog information and generative modeling
doi.org/10.1038/s414...
06.03.2026 20:51
👍 2
🔁 0
💬 1
📌 0
The ISCB platform is an excellent place to advertise and find positions in computational biology!
careers.iscb.org/jobs
04.03.2026 14:31
👍 3
🔁 1
💬 0
📌 0
Home
This looks absolutely great. For those of us interested in pangenomes, I am sure this will be a super place to get data and the interface is very clean (plotly). Congrats to the authors (I don't know if they are on bsky): pangbank.genoscope.cns.fr
04.03.2026 10:50
👍 8
🔁 4
💬 1
📌 0
Excited to share our pre-print on the curation of a new bioactivity dataset for metabolic transformations. 🧪🧑💻 We were surprised to find that roughly a quarter of our drug-metabolite-target combinations contain metabolites with retained or increased bioactivity relative to the parent drugs! #chemsky
02.03.2026 10:47
👍 10
🔁 3
💬 1
📌 0
Fig. 1 | FANTASIA pipeline overview. Input proteomes are preprocessed to remove sequences based on length and sequence similarity if needed. Then, embeddings are computed, and distance embedding similarity is calculated against the reference database (using two metrics at will). Optionally, it converts the standard GOPredSim output file to the input file format for topGO20 to facilitate its application in a wider biological workflow.
FANTASIA leverages language models to decode the functional dark proteome across the animal tree of life
www.nature.com/articles/s42...
01.03.2026 17:58
👍 2
🔁 0
💬 0
📌 0
A very good list of Computational Biology Conferences and their deadlines!
databio.org/conferences/
27.02.2026 14:14
👍 11
🔁 0
💬 0
📌 0
Now out in @natcomms.nature.com :
versions 2.0 of both BiG-SCAPE and BiG-SLiCE! With significant speed and accuracy increases, as well as new interactive functionalities.
Read the full paper here #openaccess:
www.nature.com/articles/s41...
26.02.2026 12:18
👍 37
🔁 18
💬 1
📌 1
Figure 1: Summary of TF-MoDISco
Figure 3: Continuous Jaccard similarity is preferable to cross-correlation for matching seqlets. Green checkmarks indicate matching positions.
TF-MoDISco: Transcription Factor Motif Discovery from
Importance Scores (2017) arxiv.org/abs/1811.00416
YouTube: youtube.com/watch?v=fXPGVJg956E
GitHub: github.com/kundajelab/t...
23.02.2026 12:54
👍 2
🔁 0
💬 0
📌 0
Can we simulate realistic evolutionary trajectories and “replay the tape of life”? In this work, we propose a flexible, generalizable deep learning framework for modeling how the entire protein sequence evolves over time while capturing complex interactions across sites. 1/n
doi.org/10.64898/202...
21.02.2026 17:13
👍 83
🔁 35
💬 3
📌 1
Structural Variants ESEB Special Topic Network
ESEB Special Topic Network
🧬We are launching STRiVE, a @eseb.bsky.social Special Topic Network on the evolutionary role of structural genomic variation.
🗓️Std:
29/04: Online seminar w/ L. Rieseberg
8-10/07: Kick-off in Porto
Join us: structuralvariantsstn.github.io #Evolution #Genomics #StructuralVariants #Biology #PopGen
20.02.2026 11:49
👍 14
🔁 9
💬 0
📌 0
Vacancies
My university (Chalmers University of Technology in 🇸🇪) is recruiting an assistant professor in data-driven cell & molecular biology, funded by the DDLS program @scilifelab.se #chemsky #facultychemjobs
The position comes with a nice start-up package
www.chalmers.se/en/about-cha...
19.02.2026 14:37
👍 11
🔁 6
💬 1
📌 0
Come join us again in a next round of this massive online open science community effort! 💪
Sign up using the link in the thread.
It’s great fun, and really helps the scientific community. What more can you ask? 🙂
20.02.2026 05:20
👍 13
🔁 7
💬 0
📌 0
kache-hash: A dynamic, concurrent, and cache-efficient hash table for streaming k-mer operations https://www.biorxiv.org/content/10.64898/2026.02.13.705625v1
17.02.2026 05:47
👍 10
🔁 7
💬 0
📌 0
COMBINE-lab - The skeptic’s guide to generative AI assisted coding
An easy-to-use, flexible website template for labs, with automatic citations, GitHub tag imports, pre-built components, and more.
I’ve written a post about my recent experiences (successes) with AI coding models; the experiences that caused me to re-evaluate my initial judgements, the surprise I had at what can be accomplished, & some fears I have about these tools. Discussion welcome! combine-lab.github.io/blog/2026/02...
15.02.2026 04:31
👍 51
🔁 15
💬 8
📌 5
Come to Ascona and attend talks from
Maria Brbic
Charlotte Bunne
Faisal Mahmood
Dana Pe’er
Barbara Engelhardt
Caroline Uhler
Julien Gagneur
Marinka Zitnik
Julie Josse (INRIA)
Basile Wicky
Fabian Fröhlich
with a beautiful view of the lake in the Swiss Alps! ascona2026.sciencesconf.org
16.01.2026 11:00
👍 3
🔁 1
💬 0
📌 0
If you are a scientist, working on biology, wondering where to submit your manuscript given the current issues with the academic publishing system, check out wheretopublish.github.io!
We did this thinking change is possible. Let’s make it happen!
11.02.2026 20:31
👍 7
🔁 6
💬 0
📌 0
Figure 1. The genome is enriched with active promoters relative to random DNA.
(A) We cloned the random library of 150 bp N-mer sequences (n=17,129, purple), and the genomic library of 100-300 bp sequences (n=91,866, magenta) into the dual-reporter plasmid MR1 (pMR1), which drives the expression of green fluorescent protein (GFP, teal) from inserts on the top DNA strand, and that of red fluorescent protein (RFP, orange) on the bottom strand. We transformed E. coli cells with the plasmid libraries. (B) We sorted the bacterial libraries into fluorescence bins at four fluorescence strengths: none, weak, moderate, and strong for both GFP and RFP (eight bins total) with a cell-sorter. We bulk-sequenced the library inserts from each bin and calculated fluorescence scores in arbitrary units (a.u.) ranging between one (none) and four (strongest) (Methods). (C) The probability that a DNA sequence in the random (purple) and genomic (magenta) libraries is a promoter relative to its AT-content. (D) For 102 position-weight matrices (PWMs) for transcription factors and sigma (σ) factors, we plot the percentage of sequences in each library (purple: random, magenta: genome) that encode at least one putative factor binding site (vertical axis) against the respective PWM’s information content in bits. We test for equality of the frequency distributions between the random and genome libraries with a paired t-test (p=7.48×10−12).
De-novo promoters emerge more readily from random DNA than from genomic DNA
www.biorxiv.org/content/10.1...
10.02.2026 11:17
👍 13
🔁 5
💬 0
📌 0
Home | Timothy Fuqua
I'm looking for a Swiss department to host me for an SNSF Starting Grant.
I research how gene expression evolves and emerges by combining wet lab + computational work in a variety of model systems (E. coli, Drosophila, yeast). More: timothyfuqua.com
If your department might be a match, let’s chat!
10.02.2026 10:42
👍 4
🔁 4
💬 0
📌 0
Introducing The Structural History of Eukarya (SHE): The first proteome-scale phylogeny constructed entirely from 3D structure.
We computed 300 trillion alignments across 1,542 species to map the tree of life. 🧵👇 (1/5)
07.02.2026 08:50
👍 84
🔁 40
💬 2
📌 0
Compbio Asia
Please spread the word:
We invite applications to a two-week Computational Biology workshop in Singapore, June 14-27.
This NSF-funded workshop brings together 16-20 US grad students with international peers.
Apply by March 21: compbioasia.net
🧵 Details below:
05.02.2026 17:22
👍 3
🔁 9
💬 2
📌 1
Biodiversity Bioinformatics Summer School
This School is co-organized by SIB/ELIXIR Switzerland and de.NBI/ELIXIR Germany
Overview
Biodiversity is fundamental to ecosystem functioning, yet it
#biodiversity #bioinformatics Summer School announcement ... 21-26 June in Siegen, Germany co-organised by @sib.swiss & @denbi.bsky.social www.sib.swiss/training/cou...
🟢 eDNA & ecosystems
🟣 pangenome diversity
🔵 population genetics
🟡 comparative genomics
07.02.2026 12:07
👍 9
🔁 9
💬 0
📌 0
🚨🧪 Announcing our #ICLR2026 Workshop, Generative AI in Genomics (Gen2): Barriers and Frontiers! @iclr-conf.bsky.social
📣Call for: Full workshop papers (5-8 pages) and Tiny papers (2-4 pages)
📅Submission deadline: 7 February 2026 AoE
🌐Learn more: genai-in-genomics.github.io
(1/7)
12.01.2026 03:15
👍 4
🔁 3
💬 1
📌 0