Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.
arXivππ€
Efficient Inference for Noisy LLM-as-a-Judge Evaluation
By Chen, Lu, Li et al
12.01.2026 16:56
π 1
π 1
π¬ 0
π 0
Laptop screen showing mobile app appointment review screens with ratings, service details, date/time, location, and a completion confirmation.
Completed health appointment flow. β€οΈβπ©Ή
#designsky #buildinpublic #indiehackers #webdesign #design #webdesigner #website #websitedesign #promote #spotlight
13.01.2026 07:01
π 3
π 2
π¬ 1
π 0
Need some love for @labourlewis.bsky.social on the list
12.01.2026 17:58
π 1
π 0
π¬ 0
π 0
We've done a Quiet Riot starter pack for all those (finally!) heading over from the other place.
bsky.app/starter-pack...
12.01.2026 12:00
π 123
π 51
π¬ 9
π 2
This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.
arXivππ€
How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness
By Zurich), Zurich), Zurich)
10.01.2026 04:04
π 0
π 1
π¬ 0
π 0
Elon Musk I
@elonmusk β’2h
They want any excuse for censorship
X
Basil the Great β’ @BasilTheGreat β’ 4h
The UK Labour Government is threatening to block X but won't say a word about ChatGPT and Gemini
Why?
We know why.
X stands for freedom of speech.
They don't care about Al images, they care about people learning the truth.
If Iβd built a noncing machine for free public use, I too would pretend that I donβt understand why people are so upset
09.01.2026 23:36
π 81
π 16
π¬ 6
π 0
π§΅4/5
Despite its name the EAA has Global Reach and applies to any company, regardless of location (including the UK, US, etc.), that sells covered products or services to consumers within the European Union. It includes private sector businesses from manufacturers to service providers.
28.06.2025 19:10
π 0
π 0
π¬ 1
π 0
π§΅3/5
The EAA covered a wide range of products & services and includes:
Computers & OS
Smartphones
TV Equipment
Telephony Services
ATMs & Kiosks
Banking Services
E-books
E-commerce
Also includes transport services (air, bus, rail), and audio-visual media services.
28.06.2025 19:10
π 0
π 0
π¬ 1
π 0
π§΅2/5
The primary goal is to improve the lives of persons with disabilities and older people by removing accessibility barriers in key digital products and services, ensuring they can participate fully in society.
28.06.2025 19:10
π 0
π 0
π¬ 1
π 0
Colourful patterns
π§΅1/5
The full compliance deadline for European Accessibility Act 2025 happened today (28th June 2025).
#a11y #accessibility #eaa
28.06.2025 19:10
π 1
π 0
π¬ 1
π 0
Iβve been exploring how we trust AI or more why we donβt trust AI in the same ways as people as well as an exercise in AI co-creation.
#ai #llm #articifialintelligence #futureofai #aiethics #machinelearning #deeplearning #ux #userexperience
open.substack.com/pub/steevero...
28.06.2025 09:57
π 2
π 0
π¬ 0
π 0
This looks to be a neat feature in the accessibility settings on iOS. The puzzle is much larger than low vision and there is much more to inclusive design. Approaches like these functions can change the conversation for how we approach accessible #ios26 #liquidglass #wwdc25
10.06.2025 22:24
π 1
π 0
π¬ 0
π 0
@bradleystacey.bsky.social
02.05.2025 15:36
π 1
π 0
π¬ 0
π 0
π¨ MAJOR: Google offers FREE Gemini Advanced AI tools to U.S. college students until June 2026 via Google One AI Premium plan!
πΉ Eligibility: .edu email, 15+ months free
πΉ Strategic move to dominate EdTech & convert students later
#google #geminiadvanced #veo2 #notebooklm
17.04.2025 22:08
π 0
π 0
π¬ 0
π 0
π¨ OpenAI's new models "think with images"! β’ o3 & o4-mini manipulate images during reasoning β’ All tools: web search, code, image gen β’ 91.6% on AIME 2024, 20% fewer errors β’ For ChatGPT Plus, Pro, Teams #openai #o4mini #chatgpt
16.04.2025 21:47
π 1
π 0
π¬ 0
π 0
@bradleystacey.bsky.social
11.04.2025 13:23
π 1
π 0
π¬ 0
π 0
I'm curious about workplace language.
What's your take on the term 'vibe coding'?
Would love to hear your thoughts!
#UXResearch #WorkplaceCulture #Tech #Coding #SoftwareDevelopment #AI
11.04.2025 11:26
π 1
π 0
π¬ 1
π 0
Wonder what this might beβ¦.
#AI
10.04.2025 18:52
π 0
π 0
π¬ 0
π 0
Screenshot of the press release. Full text in the link.
Here we go - EU Member States approve first batch of EU trade retaliation against Trump tariffs ec.europa.eu/commission/p...
09.04.2025 13:28
π 133
π 57
π¬ 6
π 2
How People with Disabilities Use the Web
Introduces how people with disabilities, including people with age-related impairments, use the Web.
People who want to make the web accessible need to understand the many different ways that people with disabilities use the web. This W3C resource offers a good introduction to how disabled people navigate the web, and barriers they commonly encounter.
www.w3.org/WAI/people-u...
25.01.2025 23:11
π 113
π 54
π¬ 1
π 1
Santa jumping on a trampoline in winter
Santa jumping on a trampoline in Autumn
A meditating capybara in a Santa hat on a mountain
A capybara in a Santa hat meditating on a city street
Itβs been a while since I used ImageFX from Google - and itβs pretty consistent
#AI #GenAI #ImageGen
20.12.2024 00:09
π 0
π 0
π¬ 0
π 0
Whatβs funny is if we take conversational AI a slight delay might actually aid authenticity by mimicking natural thinking time, especially for complex queries. However, too long a delay could negatively impact experienced users who expect instant replies.
30.11.2024 12:34
π 0
π 0
π¬ 0
π 0
π―% was on a site earlier and the delay was excessive to say the least - wasnβt helped by having no states on the button to even indicate something was happening.
29.11.2024 18:08
π 0
π 0
π¬ 0
π 0
7/7
But! If you can't make it faster, make it feel faster! Progress bars, animations, and feedback can hack user psychology to extend patience thresholds. β‘
28.11.2024 19:56
π 0
π 0
π¬ 0
π 0
6/7
10+ seconds = ABANDONMENT π«
The psychological contract is broken. Users feel disrespected and frustrated. Your brain says "system failure" even if it's just slow.
28.11.2024 19:56
π 0
π 0
π¬ 1
π 0