Did it.. work?!
Did it.. work?!
Loved giving my second tutorial on steering vectors at #CardiffNLPWorkshop Lots of enthusiastic participants! @cardiffnlp.bsky.social
#CardiffNLPWorkshop off to a flying start with talks from Jennifer Foster and Marianna Apidianaki @cardiffnlp.bsky.social
Pleased to say this has been accepted to ACL System Demos :)
Come to my hackathon! Last one was super fun I promise
Shoutout to supervisors Liam Turner and Luis Espinosa-Anke and @cardiffnlp.bsky.social. I'm also interested in future collaborations on the topic so please message if you are interested :)
I highly encourage people to play around, you can get started in just a few lines. Here's a Colab notebook:
tinyurl.com/yysmb45c
Note that the results from this Colab won't be the best because it's using a smaller model to reduce loading times. I would recommend using at least a 7B.
As part of our validation, we see if we can reduce stereotypicality in outputs from Mistral 7B, using GPT-4o as a judge. There is a notable reduction compared to baselines and prompting, which is cool.
For those that are new to the topic, steering vectors are constructed using a set of paired sentences, where one elicits a 'positive' activation of neurons and the other elicits a 'negative' activation of neurons - by taking the difference, we isolate activations responsible for a certain 'concept'.
π¨ NEW PAPER ALERT π¨
Dialz: A Python Toolkit for Steering Vectors
ArXiv: arxiv.org/abs/2505.06262
Docs: cardiffnlp.github.io/dialz/
Repo: github.com/cardiffnlp/d...
A Python package to help you create, apply and visualise steering vectors for anything you want - from sycophancy to bias.
New friends! Old friends! Please register if youβd like 2 whole days packed with NLP fun
Super interesting!
Love this take: "Society appears far more willing to critically examine and address bias in AI systems than confront human bias directly"
Iβd hire you
I am still in need of emergency reviewers for ARR this cycle for the computational social science track, please DM me if you have capacity π
Do it! When interviewers ask me about them itβs usually a good sign that itβs a nice workplace.
The work presents the first systematic investigation of steering vectors for bias mitigation, and we demonstrate that SVE is a powerful and computationally efficient strategy for reducing bias in LLMs, with broader implications for enhancing AI safety.
Building on these promising results, we introduce Steering Vector Ensembles (SVE), a method that averages multiple individually optimized steering vectors, each targeting a specific bias axis such as age, race, or gender.
When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.2%, 4.7%, and 3.2% over the baseline for Mistral, Llama, and Qwen, respectively.
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes. We employ Bayesian optimization to systematically identify effective contrastive pair datasets across nine bias axes.
NEW PAPER π
Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs
ArXiv: arxiv.org/abs/2503.05371
GitHub: github.com/groovychoons...
Extremely Unofficial Blog Post: zarasiddique.com/blog/shiftin...
Strongly encourage you to register for our free NLP workshop, previously had speakers from DeepMind, Microsoft, Amazon and top university NLP labs etc. and itβs looking like itβs going to be a great line up this year too.
If you canβt make it, please share with others who may be interested!
We've created a Cardiff NLP Starter Pack to make it easy to follow #NLP researchers at Cardiff Uni.
Super interesting work!
OpenAI furious DeepSeek might have stolen all the data OpenAI stole from us
π www.404media.co/openai-furio...
Severance episode 2, Traitors final AND this
It's a weekend of watching for me πΏ
Need to be spending less time on deepseek and more time on deep sleep π΄
Any thoughts to whether this would extend well to more traditional CompSci courses?
Welcome Wikipedian!
And totally agree.
+1 I would also like to see this.