Paper: arxiv.org/abs/2503.11751
It has been a very fun project. Thanks so much to all my collaborators Michi, Andrew, Yoon, Asli, and Marjan!
Paper: arxiv.org/abs/2503.11751
It has been a very fun project. Thanks so much to all my collaborators Michi, Andrew, Yoon, Asli, and Marjan!
π‘A simple method improves robustness: including an aux loss that encourages reward similarity between paraphrases βοΈ This generalizes to improving RM perf on diverse reWordBench transformations. More surprisingly, during alignment, regularized RMs lead to better outputs too π
We create a benchmark πreWordBenchπ that consists of systematically transformed instances from RewardBench that maintain their semantics/ranking π On it, all top RMs on RewardBench degrade in accuracy β¬ regardless of their size and type (classifier vs. generative)
E.g., all math instances in RewardBench have an artifact: the preferred responses have the results in \boxed{} and the rejected responses put the results after a `# Answer` markdown header π Flipping the format π consistently degrades SOTA RM accuracy, up to >22% π
Robust reward models are critical for alignment/inference-time algos, auto eval, etc. (e.g. to prevent reward hacking which could render alignment ineffective). β οΈ But we found that SOTA RMs are brittle π«§ and easily flip predictions when the inputs are slightly transformed π π§΅
Like human brains, large language models reason about diverse data in a general way.
A new study shows LLMs represent different data types based on their underlying meaning & reason about data in their dominant language: bit.ly/3QrZvyy
To appear @ #ICLR2025! We show that LMs represent semantically-equivalent inputs across languages, modalities, etc. similarly. This shared representation space is structured by the LM's dominant language, which is also relevant to recent phenomena where LMs "think" in ChineseποΈ in Englishπ contexts
We have released our code at github.com/ZhaofengWu/s.... We hope that this could be useful for future studies understanding the how LMs work!
poster for paper
excited to be at #NeurIPS2024! I'll be presenting our data mixture inference attack ποΈ Thu 4:30pm w/ @jon.jon.ke β stop by to learn what trained tokenizers reveal about LLM development (βΌοΈ) and chat about all things tokenizers.
π arxiv.org/abs/2407.16607
31% of US adults use generative AI for healthcare π€―But most AI systems answer questions assertivelyβeven when they donβt have the necessary context. Introducing #MediQ a framework that enables LLMs to recognize uncertaintyπ€and ask the right questionsβwhen info is missing: π§΅
We hope our observations could inspire more work on understanding π model representations & algorithms and on controlling models; eventually leading to better models.π¦
This has been a super fun project with co-authors
@velocityyu.bsky.social, Dani, Jiasen, and Yoon!
π3οΈβ£ we can intervene in this βsemantic hubβ using English tokens to predictably & reliably steer ποΈ model behavior, even with non-English/non-language inputs. This means that the βsemantic hubβ is not a vestigial byproduct of pretraining, but it causally affects model output.
π2οΈβ£ this βsemantic hubβ is scaffolded by tokens in English, which allows representations of inputs from other languages/modalities to be interpreted and controlled in English (e.g. in our main figure). π
For English-centric models (analogously for others)π1οΈβ£ semantically-equiv. inputs from distinct data types (e.g. English-Chinese parallel sentences; or an image & its caption) have similar repr. in intermediate transformer layers π, functioning as this transmodal βsemantic hubβ
Neuroscience studies posit that the human brain follows a βhub-and-spokeβ model where a transmodal semantic βhubβ integrates info. from modality-specific βspokesβ regions πΈ We hypothesize that LMs have a similar βsemantic hubβ that abstractly processes info. (fig from Ralph+17)
π‘We find that models βthinkβ π in English (or in general, their dominant language) when processing distinct non-English or even non-language data types π€― like texts in other languages, arithmetic expressions, code, visual inputs, & audio inputsβΌοΈ π§΅β¬οΈ arxiv.org/abs/2411.04986
ππ»ββοΈ thanks!
π¨New dataset + challengeπ¨
We release ASL STEM Wiki: the first signing dataset of STEM articles!
π° 254 Wikipedia articles
πΉ ~300 hours of ASL interpretations
π New task: automatic sign suggestion to make STEM education more accessible
microsoft.com/en-us/resear...
π§΅ #EMNLP2024