Super cool work! I hope this insight can help policymakers think more clearly about childrens safety on tech / AI platforms.
Super cool work! I hope this insight can help policymakers think more clearly about childrens safety on tech / AI platforms.
'Corporate Childrearing' is forthcoming in Duke Law Journal (26-27). Corporations play a huge role in children's identity formation. The piece reshapes the family law triangle into a square to make their influence and their intrusion on family relationships explicit.
papers.ssrn.com/sol3/papers....
JHU computer scientists including @williamjurayj.bsky.social propose a method that allows #AI models to spend more time thinking through problems & uses a confidence score to determine when the AI should say "I don't know" rather than risking a wrong answer, which is crucial for high-stakes domains.
a 3D graph with the X axis of compute budget, Y axis of accuracy, and Z axis of confidence threshold. The chart shows that accuracy increases with higher compute and confidence thresholds, though the trade-off tends to be fewer questions answered overall.
You can't just be right, you have to know you're right. Good advice for LLMs, according to new Johns Hopkins research. Sometimes no answer is better than a wrong one - life or death choices in medicine, for example, or big financial decisions. π§΅
and here I was thinking you were out at the Opera π€―
It's been a joy working with @jeff-cheng.bsky.social & Ben Van Durme on this project. And huge thanks to @alexmartin314.bsky.social, @miriamsw.bsky.social, @marcmarone.com, @orionweller.bsky.social, and everyone else who gave very helpful feedback over the past weeks.
To our knowledge this is the first work to raise this point in the new area of LLM test-time scaling, but the community has been aware of this for a long time. E.g., the Watson effort on Jeopardy, and a push by Jordan Boyd-Graber to reward systems that hold back dubious answers.
We propose the standard evaluation format of βJeopardy oddsβ: win a point when youβre right, lose a point when youβre wrong. Here we see compute scaling distinctions that were hidden when evaluating under a zero-risk setting. Selection functions matter!
We test DeepSeek-R1 and find that scaling test-time compute can substantially increase a modelβs confidence in correct answers, drawing a wider gap between correct and incorrect answers.
π¨ You are only evaluating a slice of your test-time scaling model's performance! π¨
π We consider how modelsβ confidence in their answers changes as test-time compute increases. Reasoning longer helps models answer more confidently!
π: arxiv.org/abs/2502.13962
You might look into behavior cloning agents, which is a pretty robust space (e.g. arxiv.org/abs/2209.05451)
I could be misunderstanding what you're looking for though, since this feels very different from the CogAI/SOAR items you point to.
Iβd say a key factor is whether a personβs put in a good faith effort to be right for the right reasons. But Iβm to other explanations!
In many ways, the Vision Pro hits on both categories.
At this point, I would probably buy a cellular phone that they made
I think 17th century English were more likely to be enjoying Tea than Coffee
π
Did you recently visit an Apple store?
I saw this happen live, it was tragic
I noticed a lot of starter packs skewed towards faculty/industry, so I made one of just NLP & ML students: go.bsky.app/vju2ux
Students do different research, go on the job market, and recruit other students. Ping me and I'll add you!