Zhaofeng Wu (@zhaofengwu)

reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs Reward models have become a staple in modern NLP, serving as not only a scalable text evaluator, but also an indispensable component in many alignment recipes and inference-time algorithms. However, w...

Paper: arxiv.org/abs/2503.11751

It has been a very fun project. Thanks so much to all my collaborators Michi, Andrew, Yoon, Asli, and Marjan!

18.03.2025 16:01 👍 0 🔁 0 💬 0 📌 0

💡A simple method improves robustness: including an aux loss that encourages reward similarity between paraphrases ⚖️ This generalizes to improving RM perf on diverse reWordBench transformations. More surprisingly, during alignment, regularized RMs lead to better outputs too 📈

18.03.2025 16:01 👍 0 🔁 0 💬 1 📌 0

We create a benchmark 🌟reWordBench🌟 that consists of systematically transformed instances from RewardBench that maintain their semantics/ranking 🎛 On it, all top RMs on RewardBench degrade in accuracy ⏬ regardless of their size and type (classifier vs. generative)

18.03.2025 16:01 👍 0 🔁 0 💬 1 📌 0

E.g., all math instances in RewardBench have an artifact: the preferred responses have the results in \boxed{} and the rejected responses put the results after a `# Answer` markdown header 💀 Flipping the format 🔄 consistently degrades SOTA RM accuracy, up to >22% 📉

18.03.2025 16:01 👍 0 🔁 0 💬 1 📌 0

Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). ⚠️ But we found that SOTA RMs are brittle 🫧 and easily flip predictions when the inputs are slightly transformed 🍃 🧵

18.03.2025 16:01 👍 2 🔁 0 💬 1 📌 0

Like human brains, large language models reason about diverse data in a general way MIT researchers find large language models process diverse types of data, like different languages, audio inputs, images, etc., similarly to how humans reason about complex problems. Like humans, LLMs...

Like human brains, large language models reason about diverse data in a general way.

A new study shows LLMs represent different data types based on their underlying meaning & reason about data in their dominant language: bit.ly/3QrZvyy

19.02.2025 22:30 👍 36 🔁 8 💬 5 📌 1

To appear @ #ICLR2025! We show that LMs represent semantically-equivalent inputs across languages, modalities, etc. similarly. This shared representation space is structured by the LM's dominant language, which is also relevant to recent phenomena where LMs "think" in Chinese🀄️ in English🔠 contexts

22.01.2025 18:10 👍 11 🔁 2 💬 0 📌 0

We have released our code at github.com/ZhaofengWu/s.... We hope that this could be useful for future studies understanding the how LMs work!

17.12.2024 15:26 👍 3 🔁 1 💬 0 📌 0

poster for paper

excited to be at #NeurIPS2024! I'll be presenting our data mixture inference attack 🗓️ Thu 4:30pm w/ @jon.jon.ke — stop by to learn what trained tokenizers reveal about LLM development (‼️) and chat about all things tokenizers.

🔗 arxiv.org/abs/2407.16607

11.12.2024 22:08 👍 13 🔁 4 💬 0 📌 0

31% of US adults use generative AI for healthcare 🤯But most AI systems answer questions assertively—even when they don’t have the necessary context. Introducing #MediQ a framework that enables LLMs to recognize uncertainty🤔and ask the right questions❓when info is missing: 🧵

06.12.2024 22:51 👍 68 🔁 14 💬 2 📌 2

We hope our observations could inspire more work on understanding 🔍 model representations & algorithms and on controlling models; eventually leading to better models.🦙

This has been a super fun project with co-authors
@velocityyu.bsky.social, Dani, Jiasen, and Yoon!

02.12.2024 18:08 👍 1 🔁 0 💬 0 📌 0

📍3️⃣ we can intervene in this “semantic hub” using English tokens to predictably & reliably steer 🎛️ model behavior, even with non-English/non-language inputs. This means that the “semantic hub” is not a vestigial byproduct of pretraining, but it causally affects model output.

02.12.2024 18:08 👍 1 🔁 0 💬 1 📌 0

📍2️⃣ this “semantic hub” is scaffolded by tokens in English, which allows representations of inputs from other languages/modalities to be interpreted and controlled in English (e.g. in our main figure). 📚

02.12.2024 18:08 👍 1 🔁 0 💬 1 📌 0

For English-centric models (analogously for others)📍1️⃣ semantically-equiv. inputs from distinct data types (e.g. English-Chinese parallel sentences; or an image & its caption) have similar repr. in intermediate transformer layers 🖇, functioning as this transmodal “semantic hub”

02.12.2024 18:08 👍 1 🔁 0 💬 1 📌 0

Neuroscience studies posit that the human brain follows a “hub-and-spoke” model where a transmodal semantic “hub” integrates info. from modality-specific “spokes” regions 🕸 We hypothesize that LMs have a similar “semantic hub” that abstractly processes info. (fig from Ralph+17)

02.12.2024 18:08 👍 2 🔁 0 💬 1 📌 0

💡We find that models “think” 💭 in English (or in general, their dominant language) when processing distinct non-English or even non-language data types 🤯 like texts in other languages, arithmetic expressions, code, visual inputs, & audio inputs‼️ 🧵⬇️ arxiv.org/abs/2411.04986

02.12.2024 18:08 👍 11 🔁 1 💬 1 📌 2

🙋🏻‍♂️ thanks!

25.11.2024 21:14 👍 0 🔁 0 💬 0 📌 0

22.11.2024 17:56 👍 2 🔁 0 💬 0 📌 0

🚨New dataset + challenge🚨

We release ASL STEM Wiki: the first signing dataset of STEM articles!

📰 254 Wikipedia articles
📹 ~300 hours of ASL interpretations
👋 New task: automatic sign suggestion to make STEM education more accessible

microsoft.com/en-us/resear...
🧵 #EMNLP2024

19.11.2024 00:18 👍 37 🔁 10 💬 2 📌 0

Zhaofeng Wu

Latest posts by Zhaofeng Wu @zhaofengwu