Building a more robust model definitely helps. But it cannot be the only line of defense. You have to sandbox the model, just like we sandbox OS processes to contain the damage of a memory corruption vuln.
Building a more robust model definitely helps. But it cannot be the only line of defense. You have to sandbox the model, just like we sandbox OS processes to contain the damage of a memory corruption vuln.
Prompt injection attacks are the AI version of stack smashing from the 90s. Yet, most efforts are trying to defend against this by hoping to build better robust models (aka, computer programs). Do you see the issue here?
SAGAI'25 will investigate the safety, security, and privacy of GenAI agents from a system design perspective. We are experimenting with a new "Dagstuhl" like seminar with invited speakers and discussion. Really excited about this workshop at IEEE Security and Privacy Symposium.
We found a way to compute optimization-based LLM prompt injections on proprietary models by misusing the fine tuning interface. Set learning rate to near zero, you get loss values on candidate attack tokens, without really changing base model. Tested on Gemini.
arxiv.org/abs/2501.09798
I'm teaching a grad course on LLM Security at UCSD. In addition to academic papers, I've included material from the broader community.
I'm looking for 1 good article on LLM agent security. Send me recs!
cseweb.ucsd.edu/~efernandes/...
FEEL THE AGI!
2024 is ending and it marks just over 2 years at UCSD. Here is a short summary of things we've been doing.
www.linkedin.com/pulse/some-s...
Is there a GenAI service (or services) that will allow me to upload an image and then specify some text that modifies the image, and get back a new image with those modifications? Eg, say I upload a picture of spiderman in a seated position with text "convert this spiderman into standing position"
Most work has focused on privesc for some "forbidden knowledge" and IMO this has muddied JB a LOT. If you ignore the "make me a bomb" type issues, you will realize there's a lot more that can be done with JB attacks.
I think that I've finally come to a reasonable definition of GenAI jailbreaking. A jailbreak is a privilege escalation. It allows the attacker to force the model to undertake arbitrary instructions, regardless of whatever safeguards might be in place.
I will go one step further. To become a bike lane/traffic planner, you have to ride the bike lane yourself.
NEW: For the last few months, officials at Britainβs NCA have explained to me how they discovered and disrupted two massive Russian money laundering rings.
The networks have moved billions each year andβunusuallyβhave been caught swapping cash for crypto with drugs gangs
π§΅ A wild thread...
its got that 70s look
A good explainer on the security pitfalls of "AI Agents"
spectrum.ieee.org/ai-agents
π’ Our latest report reveals that the US storefront of Amazon uses a system to restrict shipments of certain products. We found 17k+ products that were restricted from being shipped to specific regions, with the most common type of product being books π.
citizenlab.ca/2024/11/anal...
My Christmas break plan is to learn Rust. Any pointers to resources that you found particularly useful?
STORY with @lhn.bsky.social: Meta is speaking out about pig butchering scams for the first timeβit says it has removed 2 million pig butchering accounts this year.
In one instance, OpenAI alerted Meta to criminals using ChatGPT to generate comments used in scams
@mattburgess1.bsky.social has covered AI security stuff.
I will be adopting this terminology as well.
My postdoc Charlie Murphy is on the academic job market this fall. He's doing really hard technical work on building constraint solvers and synthesis engines. You should interview him
pages.cs.wisc.edu/~tcmurphy4/
New idea for Anthropic's computer use agent. Task it with going thru my Twitter, finding those folks here, and following them.
first thing I did after joining this new twitter was follow a bunch of PL folks. And some security folks.