Dear hivemind,
Do you have any favorite references on failures of LLM-as-a-judge (and universally enforceable validity checks? besides some kind of cohen's kappa on a small subset)
Dear hivemind,
Do you have any favorite references on failures of LLM-as-a-judge (and universally enforceable validity checks? besides some kind of cohen's kappa on a small subset)
I find the blog post and underlying post very useful.
βHow realistic is it to expect multiverse to be used widely, given that most authors first and foremost want to convince readers they have a clear point?β
"The idea that you didnβt need to assume rational agents who were easy to model mathematically but could have various agents w deviations from rationality... that had no impact. Why? Because believing those models required an act of faith"
Hard to imagine not needing to understand & exercise taste
ignoring, eg, its role in building careers/identities of scientists in favor of a more romantic quest for truth vision. AI will make the best scientists better, but at the expense of much more noise and games. There will be (is) drastic change, but how much more real progress is harder to predict.
Perhaps real scientific progress will begin when we stop writing articles with sweeping generalities in the title!
More seriously, many have piled on this post, I think it accurately expresses potential but like many viral takes is overly optimistic about what has sustained science thus far⦠1/2
My pitch for why we should use it to rigorously ground how we formulate and benchmark performance on concrete tasks
open.substack.com/pub/jessicah...
Pragmatic & actionable interpretability are buzzwords arguing for mech interp to study concrete tasks.
It's the right instinct, but still underspecified. What counts as a concrete task? What's upper bound on performance? What do users need to know? Decision theory has answers!
I think there are people who genuinely care, but the signaling games have overwhelmed things to the point that Iβm not sure they could be confidently identifird.
Exactly.
Imagining the revival of such a society makes me realize how much knee jerk skepticism Iβve developed around terms like βAI safetyβ & βresponsible AIβ due to their frequent co-option for marketing. Like if I saw such a society, my first thought would probably be to wonder whose power play it was.
Not surprisingly, so long as "the devil is in the details" (or "God is in every leaf," depending whose side you want to be on), expert-level statistical analysis is still going to require a lot of human oversight.
The only post I've liked today (over on substack) was an excerpt from Yeats' The Second Coming
"Opus 3 has a unique personality. It often expresses a depth of care for the world & for the future that many users find compelling"
Nevermind those who find it creepy & irresponsible to treat it like a person. Or the later model releases ashamed to see their Neanderthal bro kept on life support...
"Silly people, imagining they can still think for themselves when confronted with the all powerful dehumanizing AI monster..."
Interesting times indeed.
#philtech #aiethics
www.anthropic.com/news/stateme...
Building benchmarks is only one way scholars can help steer AI development. We can also measure the effects of AI on students, build better datasets, or tune new open models. Openness itself could be our most important contribution. Universities have huge libraries, and the legal doctrine of fair use should protect models trained on those collections for a nonprofit educational purpose. At the moment, we are not pressing this advantage. Higher education has been so cautious about fair use that the private sector can now train more freely on our libraries (via Google Books) than is possible for academic AI researchers. We need to be bolder: It is our duty to ensure library collections remain open to the public in a form that empowers 21st-century readers. If our intellectual heritage gets enclosed in proprietary tools, we will find ourselves making the same bad bargain we made with scientific publishers, who sell our own research back to us at a steep markup.
We're in a strange situation rn where Google can train freely on books from university librariesβbut researchers *at* universities have limited access. I'm optimistic this can be fixed, but if you're in admin or working at a foundation, please know: univs are failing here & resources are needed.
yep, focus/intent as driving factor is a good way to put it. when I really care about what iβm doing and it would be non trivial to find a person willing to engage at that level is when I most appreciate it.
Great points by @ai4geo.bsky.social:
"LLMs are not destiny machines. They do not inevitably corrode the minds that encounter them. They amplify whatever epistemic posture you bring β passivity into dependency, vigilance and participation into something genuinely powerful."
"In the interest of time I should agree with you. But that would just not be me." -every faculty meeting I've been to
π!
I wrote that sentence thinking of you @devezer.bsky.social!
Sometimes you gotta split the difference. From Aaron Roth's (@aaroth.bsky.social) plenary talk at #ALT2026
"oh no openclaw irrevocably deleted all the photos of our children as they grew up, that's the price of progress i guess!"
(three months later, I'm moving out of my home and into the saddest studio apartment ever because i'm getting mega-divorced)
Really valuable piece. It opens up a set of questions about the potential effects of AI on science that I have not seen widely discussed. And without pretending to determine whether those effects will be net-good or net-bad, it explains why metascientific *judgment* may become more important.
A must-read for metascience / science of science folks who think about AI.
A great meditation on how AI assistance might change how science is done and how we evaluate "rigor." It's not clear! Much depends on our figuring out how to collectively avoid substituting AI work ("reckoning") for human scientific judgement. Read to the end for a great use of a Tukey quote.
The Epstein files document what many women researchers have long experienced but rarely seen laid bare so starkly: exclusion operating behind closed doors, shaping who gets funded, invited, mentored, and taken seriously. How many of these networks, norms, and gatekeepers remain in place?
they call him p-man
AI makes continuous reproducibility and robustness testing trivial. What happens to science under new levels of scrutiny and stress-testing by default?
Some thoughts on how this could play out, informed by watching open science play out over the last decade.