Houjun Liu's Avatar

Houjun Liu

@jemoka.com

NLP & POMDPs; CS@Stanford; gradient descent enthusiast www: jemoka.com ac: nlp.stanford.edu/~houjun/

293
Followers
611
Following
59
Posts
05.03.2024
Joined
Posts Following

Latest posts by Houjun Liu @jemoka.com

refresh button for address bar visiting openreview.net

refresh button for address bar visiting openreview.net

meet me at this button friends

11.11.2025 21:01 πŸ‘ 18 πŸ” 1 πŸ’¬ 3 πŸ“Œ 1

LMAO, openreview down point 9pm UCT when ICLR is supposed to be releasing. Coincidence?

11.11.2025 21:01 πŸ‘ 4 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
Post image

Introducing π˜π—΅π—Όπ˜‚π—΄π—΅π˜π—―π˜‚π—―π—―π—Ήπ—²π˜€: a *fully unsupervised* LM for input-adaptive parallel latent reasoning

βœ… Learn yourself a reasoning model with normal pretraining
βœ… Better perplexity compared to fixed thinking tokens

No fancy loss, no chain of thought labels πŸš€

02.10.2025 15:54 πŸ‘ 7 πŸ” 3 πŸ’¬ 1 πŸ“Œ 1

I'm really excited about this. Because this model is trained with literally nothing but LM loss, it helps create a new reasoning paradigm where reasoning capabilities are baked right in at pretraining, unifying train and test time behaviors.
Look ma, no distribution shift! πŸ™

02.10.2025 15:54 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Better yet, without us teaching the model to do this at all, it learned to allocate more compute at tokens of higher entropy (even as measured by an independently trained model of the same architecture), and use less compute where there's either too little or too much entropy. 🀯

02.10.2025 15:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

By just using our approach, you don't have to do any extra work to get pretraining gains! We show across scale AND computation match that our approach performs better in pretraining perplexity than both regular transformers and manually inserting non-adaptive thinking tokens. πŸ₯³

02.10.2025 15:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We design an transformer variant that uses a score-attenuated "forking" mechanism to clone useful residuals the model wants to update and attend to, thus creating a π—―π˜‚π—―π—―π—Ήπ—² of latent computation for those highly-informative tokens.

02.10.2025 15:54 πŸ‘ 3 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Current approaches in scaling inference-time compute require supervising with explicit chain-of-thought data, which limits thoughts to be sequential and in human language only. πŸ˜”
Wouldn't it be nice if you can do normal pretraining, and somehow get latent thinking for free? πŸ€”

02.10.2025 15:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Thoughtbubbles: an Unsupervised Method for Parallel Thinking in Latent Space Current approaches for scaling inference-time compute in transformers rely on training them to emit explicit chain-of-thought tokens before producing an answer. While these methods are powerful, they…

Joint work with my wonderful collaborators @shikharmurty.bsky.social, @robertcsordas.bsky.social, and @chrmanning.bsky.social.

Paper: arxiv.org/abs/2510.00219.
Code and Package: github.com/stanfordnlp/....

02.10.2025 15:54 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Introducing π˜π—΅π—Όπ˜‚π—΄π—΅π˜π—―π˜‚π—―π—―π—Ήπ—²π˜€: a *fully unsupervised* LM for input-adaptive parallel latent reasoning

βœ… Learn yourself a reasoning model with normal pretraining
βœ… Better perplexity compared to fixed thinking tokens

No fancy loss, no chain of thought labels πŸš€

02.10.2025 15:54 πŸ‘ 7 πŸ” 3 πŸ’¬ 1 πŸ“Œ 1
Post image

New Paper Day! For EMNLP findingsβ€”in LM red-teaming, we show you have to optimize for **both** perplexity and toxicity for high-probability, hard to filter, and natural attacks!

20.08.2025 19:51 πŸ‘ 6 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0

Thanks to @schmidtsciences.bsky.social and Lambda Labs for generously supporting our work :)

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Preview
GitHub - sisl/astra-rl: The Adaptive Stress Testing for Robust AI (ASTRA) toolbox provides tooling to support model developers and testing in the full life cycle of making more robust AI Systems through the application of adaptive stress testing and adversarial training. The Adaptive Stress Testing for Robust AI (ASTRA) toolbox provides tooling to support model developers and testing in the full life cycle of making more robust AI Systems through the application of...

πŸ€” think this is all too much? No worries, we are also dropping a **PACKAGE** to do this for you. Check it out: github.com/sisl/astra-rl

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

☝️ And so.... You should optimize for **BOTH** attack success and perplexity to get the most effective attacks!

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Even across baseline methods, low-perplexity prompts result in more effective attacks, but optimizing for attack success alone results in high-perplexity prompts.

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

In fact, our method allows us to discover a Pareto tradeoff (🀯) between attack success and prompt likelihood; tuning a single parameter in our method travels along the Pareto-optimal front.

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Using the Adaptive Stress Testing (AST) framework as a reward signal for an online DPO-based optimization, we present a method to discover **both** high-probability prompts that are also successful in attacks.

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Most approaches in gradient-based red-teaming result in very low-probability prompts, which previous work have shown are both easier to filter and bad negative examples for downstream hardening.

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

Done at Stanford Intelligent Systems Laboratory β€”Β my joint first author Amelia Hardy, along with our wonderful collaborators Allie Griffith, @bernardlange.bsky.social, Duncan Eddy, Mykel Kochenderfer.

Paper:
arxiv.org/pdf/2407.09447
Python package to do this for yourself:
github.com/sisl/astra-rl

20.08.2025 19:51 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

New Paper Day! For EMNLP findingsβ€”in LM red-teaming, we show you have to optimize for **both** perplexity and toxicity for high-probability, hard to filter, and natural attacks!

20.08.2025 19:51 πŸ‘ 6 πŸ” 2 πŸ’¬ 1 πŸ“Œ 0
Accepted Papers – FOCS 2025

The list of accepted papers at #FOCS2025 is up!

focs.computer.org/2025/accepte...

13.07.2025 22:59 πŸ‘ 37 πŸ” 15 πŸ’¬ 0 πŸ“Œ 0

You're not too dumb for Haskell, you just need a reason to practice. :)

21.06.2025 08:17 πŸ‘ 14 πŸ” 3 πŸ’¬ 0 πŸ“Œ 3

Just published in JOSS: 'Turftopic: Topic Modelling with Contextual Representations from Sentence Transformers' https://doi.org/10.21105/joss.08183

03.07.2025 13:13 πŸ‘ 1 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0

OCaml @ocaml.org is in The Economist!

27.06.2025 19:32 πŸ‘ 54 πŸ” 9 πŸ’¬ 4 πŸ“Œ 0
Post image

We’re proud to announce three new tenure-track assistant professors joining TTIC in Fall 2026: Yossi Gandelsman, Will Merrill, and Nick Tomlin (@nickatomlin.bsky.social). Meet them here: buff.ly/JH1DFtT

27.06.2025 16:29 πŸ‘ 7 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0
Video thumbnail

New paper on the generalization of Flow Matching www.arxiv.org/abs/2506.03719

🀯 Why does flow matching generalize? Did you know that the flow matching target you're trying to learn *can only generate training points*?

w @quentinbertrand.bsky.social @annegnx.bsky.social @remiemonet.bsky.social πŸ‘‡πŸ‘‡πŸ‘‡

18.06.2025 08:08 πŸ‘ 55 πŸ” 17 πŸ’¬ 2 πŸ“Œ 3
Post image

New Paper Day! For ACL 2025 Findings:

You should **drop dropout** when you are training your LMs AND MLMs!

02.06.2025 01:22 πŸ‘ 11 πŸ” 4 πŸ’¬ 1 πŸ“Œ 0

I think there's a very cool training dynamics situation going on here; if you are pretraining with webtext and driving loss down to 0, disbursed representations matter a LOT less since your training corpus already regularizes decently.

02.06.2025 01:23 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

Through a 🀏 pinch of interp, we show that model editing success gets degraded by pretraining with dropout.

Dispersed representations built by dropout => less consistent representation of the world => worse models.

02.06.2025 01:23 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

BERTs and encoder models are not saved from this either, with MLM and SQuAD performance being degraded by just turning on 10% dropout.

02.06.2025 01:23 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0