New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! ๐งต
New paper: Finetuning on narrow domains leaves traces behind. By looking at the difference in activations before and after finetuning, we can interpret what it was finetuned for. And so can our interpretability agent! ๐งต
I think it's pretty hard to disentangle them really, I was initially skeptical of the (very convenient) argument from the labs about them not being orthogonal, but I'm increasingly buying it.
Right, predictions were 30+ years well into 2020.
www.metaculus.com/questions/51...
Banger
Congratulations!
It would have been an all-too-convenient refrain for the "I don't believe in that sci-fi nonsense" AI Safety scepticism line
She shall know your ways as if born to them
Truly an excellent milestone.
Although concessions do follow in (incorrect) episode preferences. Sleepytime falling out of favour was crushing.
Yeah I try to follow a similar approach
The fora draw a clear distinction between upvotes / agreement votes so I think the culture of upvoting contributions stems from that maybe?
Bizarre that was included in the screenshot, doesn't seem like it belongs to the same class as the others at all.
Crushing it
#Tradle #992 1/6
๐ฉ๐ฉ๐ฉ๐ฉ๐ฉ
oec.world/en/games/tra...
Now grappling with whether I'd be in that group or not.
Not everyone, then we'd have to read them. If only those inclined to make one did then the mere existence of the doc would probably clarify 90% of scenarios.
"Oh they've got one of those docs, we are probably cool"
The Settlers of Catan Problem
I've been really noticing autumn this year
Prediction markets not giving great odds of this going through:
manifold.markets/ZviMowshowit...
Hoping this generalizes into alignment research
The HS2 bat tunnel is even worse value for money once you factor in the updated direct cash transfer effectiveness estimates.
It's an edit of EA forum debate week interface:
forum.effectivealtruism.org/topics/anima...
Good idea, for more details that might help: forum.effectivealtruism.org/topics/draft...
Oh yes, an entirely intentional error on my part in the spirit of #DAW ๐ฌ
deleting dating apps because i want to meet someone the old fashioned way (we caught a wild pig together while not sharing a common language, then met 12 years later under the tree we planted)
This is a Draft Amnesty Week #DAW draft. It may not be polished, up to my usual standards, fully thought through, or fully fact-checked.
FYI it's Draft Amnesty Week on TPOB, where users can publish scrappy, draft-y, or incomplete posts with impunity. #DAW
Circles, but it's just a different app for each emigrating TPOT
On new beginnings: This week I handed in my notice, ending 10 years in Product Management, capital markets to start as an Alignment Research Manager in January! ๐
Good ~Morning Agus