Zhaofeng Wu's Avatar

Zhaofeng Wu

@zhaofengwu

PhD student @ MIT | Previously PYI @ AI2 | MS'21 BS'19 BA'19 @ UW | zhaofengwu.github.io

377
Followers
124
Following
15
Posts
17.11.2024
Joined
Posts Following

Latest posts by Zhaofeng Wu @zhaofengwu

Preview
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, w...

Paper: arxiv.org/abs/2503.11751

It has been a very fun project. Thanks so much to all my collaborators Michi, Andrew, Yoon, Asli, and Marjan!

18.03.2025 16:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image

πŸ’‘A simple method improves robustness: including an aux loss that encourages reward similarity between paraphrases βš–οΈ This generalizes to improving RM perf on diverse reWordBench transformations. More surprisingly, during alignment, regularized RMs lead to better outputs too πŸ“ˆ

18.03.2025 16:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

We create a benchmark 🌟reWordBench🌟 that consists of systematically transformed instances from RewardBench that maintain their semantics/ranking πŸŽ› On it, all top RMs on RewardBench degrade in accuracy ⏬ regardless of their size and type (classifier vs. generative)

18.03.2025 16:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

E.g., all math instances in RewardBench have an artifact: the preferred responses have the results in \boxed{} and the rejected responses put the results after a `# Answer` markdown header πŸ’€ Flipping the format πŸ”„ consistently degrades SOTA RM accuracy, up to >22% πŸ“‰

18.03.2025 16:01 πŸ‘ 0 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed πŸƒ 🧡

18.03.2025 16:01 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Preview
Like human brains, large language models reason about diverse data in a general way MIT researchers find large language models process diverse types of data, like different languages, audio inputs, images, etc., similarly to how humans reason about complex problems. Like humans, LLMs...

Like human brains, large language models reason about diverse data in a general way.

A new study shows LLMs represent different data types based on their underlying meaning & reason about data in their dominant language: bit.ly/3QrZvyy

19.02.2025 22:30 πŸ‘ 36 πŸ” 8 πŸ’¬ 5 πŸ“Œ 1

To appear @ #ICLR2025! We show that LMs represent semantically-equivalent inputs across languages, modalities, etc. similarly. This shared representation space is structured by the LM's dominant language, which is also relevant to recent phenomena where LMs "think" in ChineseπŸ€„οΈ in EnglishπŸ”  contexts

22.01.2025 18:10 πŸ‘ 11 πŸ” 2 πŸ’¬ 0 πŸ“Œ 0

We have released our code at github.com/ZhaofengWu/s.... We hope that this could be useful for future studies understanding the how LMs work!

17.12.2024 15:26 πŸ‘ 3 πŸ” 1 πŸ’¬ 0 πŸ“Œ 0
poster for paper

poster for paper

excited to be at #NeurIPS2024! I'll be presenting our data mixture inference attack πŸ—“οΈ Thu 4:30pm w/ @jon.jon.ke β€” stop by to learn what trained tokenizers reveal about LLM development (‼️) and chat about all things tokenizers.

πŸ”— arxiv.org/abs/2407.16607

11.12.2024 22:08 πŸ‘ 13 πŸ” 4 πŸ’¬ 0 πŸ“Œ 0
Post image

31% of US adults use generative AI for healthcare 🀯But most AI systems answer questions assertivelyβ€”even when they don’t have the necessary context. Introducing #MediQ a framework that enables LLMs to recognize uncertaintyπŸ€”and ask the right questions❓when info is missing: 🧡

06.12.2024 22:51 πŸ‘ 68 πŸ” 14 πŸ’¬ 2 πŸ“Œ 2


We hope our observations could inspire more work on understanding πŸ” model representations & algorithms and on controlling models; eventually leading to better models.πŸ¦™

This has been a super fun project with co-authors
@velocityyu.bsky.social, Dani, Jiasen, and Yoon!

02.12.2024 18:08 πŸ‘ 1 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0

πŸ“3️⃣ we can intervene in this β€œsemantic hub” using English tokens to predictably & reliably steer πŸŽ›οΈ model behavior, even with non-English/non-language inputs. This means that the β€œsemantic hub” is not a vestigial byproduct of pretraining, but it causally affects model output.

02.12.2024 18:08 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0

πŸ“2️⃣ this β€œsemantic hub” is scaffolded by tokens in English, which allows representations of inputs from other languages/modalities to be interpreted and controlled in English (e.g. in our main figure). πŸ“š

02.12.2024 18:08 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

For English-centric models (analogously for others)πŸ“1️⃣ semantically-equiv. inputs from distinct data types (e.g. English-Chinese parallel sentences; or an image & its caption) have similar repr. in intermediate transformer layers πŸ–‡, functioning as this transmodal β€œsemantic hub”

02.12.2024 18:08 πŸ‘ 1 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

Neuroscience studies posit that the human brain follows a β€œhub-and-spoke” model where a transmodal semantic β€œhub” integrates info. from modality-specific β€œspokes” regions πŸ•Έ We hypothesize that LMs have a similar β€œsemantic hub” that abstractly processes info. (fig from Ralph+17)

02.12.2024 18:08 πŸ‘ 2 πŸ” 0 πŸ’¬ 1 πŸ“Œ 0
Post image

πŸ’‘We find that models β€œthink” πŸ’­ in English (or in general, their dominant language) when processing distinct non-English or even non-language data types 🀯 like texts in other languages, arithmetic expressions, code, visual inputs, & audio inputs‼️ πŸ§΅β¬‡οΈ arxiv.org/abs/2411.04986

02.12.2024 18:08 πŸ‘ 11 πŸ” 1 πŸ’¬ 1 πŸ“Œ 2

πŸ™‹πŸ»β€β™‚οΈ thanks!

25.11.2024 21:14 πŸ‘ 0 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Post image
22.11.2024 17:56 πŸ‘ 2 πŸ” 0 πŸ’¬ 0 πŸ“Œ 0
Video thumbnail

🚨New dataset + challenge🚨

We release ASL STEM Wiki: the first signing dataset of STEM articles!

πŸ“° 254 Wikipedia articles
πŸ“Ή ~300 hours of ASL interpretations
πŸ‘‹ New task: automatic sign suggestion to make STEM education more accessible

microsoft.com/en-us/resear...
🧡 #EMNLP2024

19.11.2024 00:18 πŸ‘ 37 πŸ” 10 πŸ’¬ 2 πŸ“Œ 0