SteeveUX (@steeveroyce.com)

Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as "LLM-as-a-judge." In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.

arXiv📈🤖
Efficient Inference for Noisy LLM-as-a-Judge Evaluation
By Chen, Lu, Li et al

12.01.2026 16:56 👍 1 🔁 1 💬 0 📌 0

Laptop screen showing mobile app appointment review screens with ratings, service details, date/time, location, and a completion confirmation.

Completed health appointment flow. ❤️‍🩹

#designsky #buildinpublic #indiehackers #webdesign #design #webdesigner #website #websitedesign #promote #spotlight

13.01.2026 07:01 👍 3 🔁 2 💬 1 📌 0

The History of Web Design, 1993–2012: Season 5 Launch Introducing Cybercultural's history of web design, from the grey web pages of 1993 to the colorful, mobile-centric web designs of 2012. A celebration of the peak years of personal websites and blogs.

A history of web design, from the grey web pages of 1993 to the colorful, mobile-centric web designs of 2012. A celebration of the peak years of personal websites and blogs. By Richard MacManus.

cybercultural.com/p/history-of...

13.01.2026 17:07 👍 26 🔁 6 💬 2 📌 0

Introducing Cowork | Claude | Claude Claude Code's agentic capabilities, now for everyone. Give Claude access to your files and let it organize, create, and edit documents while you focus on what matters.

Love the work @anthropic.com are doing! claude.com/blog/cowork-... such a solid set of tools! #ai #llm #ml

12.01.2026 21:11 👍 1 🔁 0 💬 0 📌 0

Need some love for @labourlewis.bsky.social on the list

12.01.2026 17:58 👍 1 🔁 0 💬 0 📌 0

We've done a Quiet Riot starter pack for all those (finally!) heading over from the other place.

bsky.app/starter-pack...

12.01.2026 12:00 👍 123 🔁 51 💬 9 📌 2

This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT's outputs but also carry over into subsequent human-human communication.

arXiv📈🤖
How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness
By Zurich), Zurich), Zurich)

10.01.2026 04:04 👍 0 🔁 1 💬 0 📌 0

Elon Musk I @elonmusk •2h They want any excuse for censorship X Basil the Great • @BasilTheGreat • 4h The UK Labour Government is threatening to block X but won't say a word about ChatGPT and Gemini Why? We know why. X stands for freedom of speech. They don't care about Al images, they care about people learning the truth.

If I’d built a noncing machine for free public use, I too would pretend that I don’t understand why people are so upset

09.01.2026 23:36 👍 81 🔁 16 💬 6 📌 0

European accessibility act The European accessibility act is a directive that aims to improve the functioning of the internal market for accessible products and services, by removing barriers created by divergent rules in Membe...

🧵5/5

commission.europa.eu/strategy-and...

28.06.2025 19:10 👍 0 🔁 0 💬 0 📌 0

🧵4/5

Despite its name the EAA has Global Reach and applies to any company, regardless of location (including the UK, US, etc.), that sells covered products or services to consumers within the European Union. It includes private sector businesses from manufacturers to service providers.

28.06.2025 19:10 👍 0 🔁 0 💬 1 📌 0

🧵3/5

The EAA covered a wide range of products & services and includes:
Computers & OS
Smartphones
TV Equipment
Telephony Services
ATMs & Kiosks
Banking Services
E-books
E-commerce
Also includes transport services (air, bus, rail), and audio-visual media services.

28.06.2025 19:10 👍 0 🔁 0 💬 1 📌 0

🧵2/5

The primary goal is to improve the lives of persons with disabilities and older people by removing accessibility barriers in key digital products and services, ensuring they can participate fully in society.

28.06.2025 19:10 👍 0 🔁 0 💬 1 📌 0

Colourful patterns

🧵1/5

The full compliance deadline for European Accessibility Act 2025 happened today (28th June 2025).

#a11y #accessibility #eaa

28.06.2025 19:10 👍 1 🔁 0 💬 1 📌 0

I’ve been exploring how we trust AI or more why we don’t trust AI in the same ways as people as well as an exercise in AI co-creation.

#ai #llm #articifialintelligence #futureofai #aiethics #machinelearning #deeplearning #ux #userexperience

open.substack.com/pub/steevero...

28.06.2025 09:57 👍 2 🔁 0 💬 0 📌 0

This looks to be a neat feature in the accessibility settings on iOS. The puzzle is much larger than low vision and there is much more to inclusive design. Approaches like these functions can change the conversation for how we approach accessible #ios26 #liquidglass #wwdc25

10.06.2025 22:24 👍 1 🔁 0 💬 0 📌 0

GitHub Copilot: Meet the new coding agent GitHub Copilot has a new feature: a coding agent that can implement a task or issue, run in the background with GitHub Actions, and more.

Really excited at this from @microsoft.com github.blog/news-insight... exciting times! Particularly around vision capabilities.

#build2025

19.05.2025 18:06 👍 1 🔁 0 💬 0 📌 0

@bradleystacey.bsky.social

02.05.2025 15:36 👍 1 🔁 0 💬 0 📌 0

UK could save billions by ending hunger – not slashing benefits Researchers at Trussell have found that the UK government could save billions if it increased universal credit to help tackle hunger.

UK could save billions by ending hunger – not slashing benefits

www.bigissue.com/news/social-...

30.04.2025 17:21 👍 565 🔁 196 💬 14 📌 6

🚨 MAJOR: Google offers FREE Gemini Advanced AI tools to U.S. college students until June 2026 via Google One AI Premium plan!
🔹 Eligibility: .edu email, 15+ months free
🔹 Strategic move to dominate EdTech & convert students later
#google #geminiadvanced #veo2 #notebooklm

17.04.2025 22:08 👍 0 🔁 0 💬 0 📌 0

🚨 OpenAI's new models "think with images"! • o3 & o4-mini manipulate images during reasoning • All tools: web search, code, image gen • 91.6% on AIME 2024, 20% fewer errors • For ChatGPT Plus, Pro, Teams #openai #o4mini #chatgpt

16.04.2025 21:47 👍 1 🔁 0 💬 0 📌 0

@bradleystacey.bsky.social

11.04.2025 13:23 👍 1 🔁 0 💬 0 📌 0

I'm curious about workplace language.

What's your take on the term 'vibe coding'?

Would love to hear your thoughts!
#UXResearch #WorkplaceCulture #Tech #Coding #SoftwareDevelopment #AI

11.04.2025 11:26 👍 1 🔁 0 💬 1 📌 0

Wonder what this might be….
#AI

10.04.2025 18:52 👍 0 🔁 0 💬 0 📌 0

Screenshot of the press release. Full text in the link.

Here we go - EU Member States approve first batch of EU trade retaliation against Trump tariffs ec.europa.eu/commission/p...

09.04.2025 13:28 👍 133 🔁 57 💬 6 📌 2

How People with Disabilities Use the Web Introduces how people with disabilities, including people with age-related impairments, use the Web.

People who want to make the web accessible need to understand the many different ways that people with disabilities use the web. This W3C resource offers a good introduction to how disabled people navigate the web, and barriers they commonly encounter.

www.w3.org/WAI/people-u...

25.01.2025 23:11 👍 113 🔁 54 💬 1 📌 1

Santa jumping on a trampoline in winter

Santa jumping on a trampoline in Autumn

A meditating capybara in a Santa hat on a mountain

A capybara in a Santa hat meditating on a city street

It’s been a while since I used ImageFX from Google - and it’s pretty consistent

#AI #GenAI #ImageGen

20.12.2024 00:09 👍 0 🔁 0 💬 0 📌 0

What’s funny is if we take conversational AI a slight delay might actually aid authenticity by mimicking natural thinking time, especially for complex queries. However, too long a delay could negatively impact experienced users who expect instant replies.

30.11.2024 12:34 👍 0 🔁 0 💬 0 📌 0

💯% was on a site earlier and the delay was excessive to say the least - wasn’t helped by having no states on the button to even indicate something was happening.

29.11.2024 18:08 👍 0 🔁 0 💬 0 📌 0

7/7

But! If you can't make it faster, make it feel faster! Progress bars, animations, and feedback can hack user psychology to extend patience thresholds. ⚡

28.11.2024 19:56 👍 0 🔁 0 💬 0 📌 0

6/7

10+ seconds = ABANDONMENT 🚫
The psychological contract is broken. Users feel disrespected and frustrated. Your brain says "system failure" even if it's just slow.

28.11.2024 19:56 👍 0 🔁 0 💬 1 📌 0

SteeveUX

Latest posts by SteeveUX @steeveroyce.com