sina b (@sina.bio) — bluesky.baby

Gennady Gorin, Ph.D. Senior Scientist applying stochastic models for therapeutic discovery

Hi everyone, I am looking for a new industry role in computational biology! Check out my portfolio of genomics, statistics, ML, and biophysics work at gennadygorin.github.io, and reach out if you have any suggestions or open roles!

02.03.2026 20:04 👍 2 🔁 1 💬 1 📌 1

"We therefore argue that publication systems should optimize separately for the dissemination of data and results versus the conveying of novel ideas, and the former should be machine-readable." - This is such a good point. Machine readability is a goal too often viewed as just an addon to >

20.02.2026 10:02 👍 2 🔁 1 💬 1 📌 0

14/ In practice, extract passages from documents, align structured fields (gene names, lab values, etc.) with `taln`, and flag extractions that don't align. `taln` recovers alignments that other tools miss, reducing the number of extractions that need manual review.

11.02.2026 21:55 👍 1 🔁 0 💬 0 📌 0

GitHub - sbooeshaghi/taln: non-contiguous token alignment non-contiguous token alignment. Contribute to sbooeshaghi/taln development by creating an account on GitHub.

13/ The BOAT datasets and `taln` tool can be found here: github.com/sbooeshaghi/...

11.02.2026 21:55 👍 1 🔁 0 💬 1 📌 0

12/ To summarize, verifying LLM-extracted text doesn't require another LLM. Subword tokenization + ordered alignment recovers ~50% more phrases than word-level tokenization, deterministically, in milliseconds. We release `taln`, BOAT, and BIO-BOAT as open-source tools and benchmarks.

11.02.2026 21:55 👍 1 🔁 0 💬 1 📌 0

11/ Unlike LCS and difflib, `taln` enumerates all valid alignments. On BOAT, that's ~8 and on BIO-BOAT ~90 (likely because biomedical terms like "VEGF-A165b" get split into 5+ subword tokens, combinatorially increasing matches.) Domain-specific tokenizers would help here.

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

10/ The remaining errors mostly come from subword boundaries that don't match how the phrase was originally selected. Edge tokens often have inconsistent start/end positions that change where the tokenizer splits the phrase.

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

9/ For non-contiguous phrases, we ablated one interior token from each target and reran alignment. Naive matching failed almost entirely (<1%). Word-level tokenization recovered ~50%. Subword tokenization recovered >96%.

11.02.2026 21:55 👍 1 🔁 0 💬 1 📌 0

8/ We tested word-level and subword tokenization with four alignment tools (taln, LCS, difflib, and naive matching.) On contiguous phrases, naive succeeds as expected, but word-level tokenization yielded ~50% missed alignments. Subword tokenization brought accuracy back to ~99%.

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

7/ With ordered alignment selected, we asked, how much does tokenization matter? We built the BOAT dataset (Berkeley Ordered Alignment of Text) from the Stanford Question Answering Dataset (SQuAD, by Pranav Rajpurkar and Percy Liang), containing 35K source-target pairs (and BIO-BOAT from biorxiv).

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

GitHub - sbooeshaghi/taln: non-contiguous token alignment non-contiguous token alignment. Contribute to sbooeshaghi/taln development by creating an account on GitHub.

6/ Ordered alignment sits at the practical sweet spot in this hierarchy. It allows gaps while preserving order, keeping the number of alignments manageable. We implemented this in a Python tool called `taln`, inspired by pseudoalignment tools like kallisto.

github.com/sbooeshaghi/taln

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

5/ Alignment methods form a hierarchy based on the constraints they impose on this map. Contiguous alignment -> ordered -> permutation -> rearrangement. Each relaxation admits more alignments but the search space grows fast—from linear to factorial to exponential.

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

4/ There's a long history of alignment methods in genomics and NLP. Pseudoalignment (@lpachter, @pmelsted uses set inclusion + k-mers. LCS uses order-preserving subsequences. To select an alignment method, we formalize alignment as a map from target to source.

11.02.2026 21:55 👍 1 🔁 0 💬 1 📌 0

3/ To find these phrases, you need an alignment method that allows gaps, and a tokenization strategy that doesn't merge the text you're looking for with surrounding punctuation (both matter!) We studied how much.

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

2/ Take this sentence from a scRNA-seq paper (by @ADHildreth.) An LLM correctly extracts "natural killer cells" and "CD96" as a cell-type marker gene pair. Both exist in the sentence, but the parenthetical "(NK)" breaks contiguity, so naive string matching fails.

11.02.2026 21:55 👍 0 🔁 0 💬 1 📌 0

1/ LLMs are great at text extraction, but sometimes they hallucinate. A simple way to catch hallucinations is to check if the extracted text actually exists in the source. Turns out this is harder than it sounds. (new paper with Aaron Streets)

www.biorxiv.org/content/10.6...

11.02.2026 21:55 👍 1 🔁 2 💬 1 📌 0

Interesting article on LLM text extraction and the connection of that problem to (computational biology) sequence alignment. www.biorxiv.org/content/10.6... by @sina.bio and Aaron Streets.

11.02.2026 17:53 👍 8 🔁 1 💬 0 📌 0

I agree that machine readability enables these analyses in principle. Developing robust, grounded benchmarks will be essential for evaluating the utility of these kinds of analysis.

04.02.2026 22:13 👍 0 🔁 0 💬 0 📌 0

"publication systems [should] distinguish between dissemination of results & communication of ideas, and optimize them separately. Results should be in explicit, machine-readable form, while narrative text serves as an interpretive layer for human readers" www.biorxiv.org/content/10.6...

03.02.2026 20:34 👍 15 🔁 6 💬 1 📌 2

If you work in AI for Science, take a moment to familiarize yourself with a common failure mode: paranormal citations (or paracites). Our paper describes them.

03.02.2026 17:40 👍 8 🔁 2 💬 0 📌 1

AI hallucinations in science manuscripts are a nuisance. Paranormal citations, or paracites, will be a nightmare.

www.biorxiv.org/content/10.6... (w/ @sina.bio & @lauraluebbert.com).

03.02.2026 17:19 👍 32 🔁 11 💬 2 📌 3

Peer review is often opaque and confusing. @elife.bsky.social worked to change that.

In a new preprint, we show how eLife’s Publish, Review, Curate model makes it possible to evaluate AI-generated reviews (with OpenEval) against human peer review. w/ @lauraluebbert.com and @lpachter.bsky.social

03.02.2026 17:08 👍 17 🔁 8 💬 2 📌 1

single-cell is a fun field. for instance, one of the heavily curated bixbench scenarios is about interpreting the results of a sc analysis and comparing to ground truth. this ground truth is, of course, based on DE analyses with some truly remarkable p-values for n=5

06.03.2025 02:44 👍 2 🔁 1 💬 2 📌 0

How Science Can Adapt to a New Normal Opinion | In the wake of attacks on the research enterprise, scientists need to focus on protecting its fragile infrastructure.

"No one is coming out of the sky to give you your grant money. Your citation portfolio won’t survive this market crash. Your credentials mean nothing. Everything is going to change."

New for @undark.org

undark.org/2025/03/06/o...

06.03.2025 14:35 👍 170 🔁 81 💬 5 📌 31

Is Tanimoto a metric? No. However, here we show how to generate a metric consistent with the Tanimoto similarity. We also explore new properties of this index, and how it relates to other popular alternatives. ### Competi...

Not really my field but I love the abstract!

www.biorxiv.org/content/10.1...

23.02.2025 20:15 👍 2 🔁 1 💬 0 📌 0

“blocking retro nasal sensation with a nose clip significantly reduces the subjective and objective neural responses to sucrose taste”

12.02.2025 18:28 👍 0 🔁 0 💬 0 📌 0

PCA can be thought of as fitting a p-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. If some axis of the ellipsoid is small, then the variance along that axis is also small.

Figure 1b from https://www.nature.com/articles/s41586-024-08531-5/figures/1

Amazing to me how useful looking at data in 2D PCA continues to be, even though the approach sounds crazy on paper—"p-dimensional ellipsoid", rantings of a madman. PCA is the cockroach of dimension reduction. I expect it to be present in any advanced galactic civilization.

10.02.2025 09:44 👍 88 🔁 12 💬 3 📌 1

The End of Science’s Peacetime Opinion | Defending the practice of science from its adversaries will require dealing with some uncomfortable truths.

"One thing is certain: The changes we make ourselves will be healthier than the ones our adversaries demand."

New work for @undark.org:
undark.org/2025/02/06/o...

06.02.2025 13:52 👍 131 🔁 78 💬 6 📌 14

Lastly, the fact that this announcement, from a historically legitimate scientific agency, was made exclusively on Twitter, on a Friday evening, is frankly, shameful.

08.02.2025 00:21 👍 2 🔁 0 💬 0 📌 0

What are the implications of this rate cut? I'm no expert but these outcomes may be impacted:

- fewer faculty hires
- department closures
- fewer students (undergrad/grad), harder for int'l students
- loss of US scientific hegemony

08.02.2025 00:21 👍 1 🔁 0 💬 1 📌 0

sina b

Latest posts by sina b @sina.bio