Luke Guerdan (@lukeguerdan)

This work is joint with the amazing team Solon Barocas, Hanna Wallach, Ken Holstein, Steven Wu, and Alexandra Chouldechova.

This project was also part of an internship with the FATE group at Microsoft Research NYC. Apply now for the next cycle! ✨ apply.careers.microsoft.com/careers/job/...

09.12.2025 20:34 👍 0 🔁 0 💬 0 📌 0

This work was just presented at #NeurIPS2025. Want to learn more?

Blog: blog.ml.cmu.edu/2025/12/09/v...
Paper: arxiv.org/pdf/2503.05965
Code: github.com/lguerdan/ind...

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

4) Not feasible to collect *any* additional ratings? Measure agreement via a distributional metric like JS-Divergence. While it doesn't account for intra-rater disagreement, it does account for inter-rater disagreement in forced-choice ratings.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

3) Already have a large dataset with forced-choice human ratings? Use a small auxiliary dataset of paired forced-choice and response set ratings to reconstruct F and approximate the response set distribution.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

2) Have more than two options? Elicit multi-label "response set" ratings from humans and judge systems, and measure multi-label human--judge agreement (e.g., via MSE).

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

Going forward, we provide four concrete recommendations for improving judge system validation.

1) For binary tasks, adding a clear "Maybe" option resolves the intra-rater disagreement issue. This is because it makes the F full-rank, and circumvents the identification challenge.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

Both categorical and distributional (e.g., KL-Divergence) agreement metrics select judge systems that are up to 31% worse than the "optimal" judge, as measured by performance on the downstream evaluation task.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

Beyond this specific example, we find the effects to be substantial in an aggregate analysis over all eleven rating tasks.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

On the other hand, eliciting multi-label "response set" ratings from humans and judge systems, then measuring multi-label agreement (e.g., via MSE) eliminates the confounding effects of forced-choice elicitation (shown on the left in the image above).

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

How does this impact results in practice?

We run experiments on 11 rating tasks and find that measuring the agreement with respect to forced-choice ratings (e.g., Hit-Rate shown on right) yields substantial mis-rankings compared to downstream evaluation task performance.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

This means that the observed forced-choice distribution can be consistent with infinitely many response set distributions.

As a result, we can have high human--judge agreement w.r.t forced-choice ratings, while having low agreement w.r.t multi-label "response set" ratings.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

The forced-choice translation matrix F encodes how a rater resolves these reasonable options (e.g., "Yes" and "No") into a forced-choice rating (e.g., “Yes”).

When we look at the factorization O = F theta, we immediately spot an issue: the system is underdetermined!

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

Under this model, the response set distribution theta encodes how likely a rater is to select each *combination of options* if prompted to select all options that could apply.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

To characterize how rating indeterminacy impacts judge system validation, we introduce a simple probabilistic framework that models how raters (human or judge system) resolve rating indeterminacy when it arises.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

This introduces two types of disagreement. Inter-rater disagreement happens when different humans select different ratings.

Intra-rater disagreement arises when the *same* human identifies *multiple* correct ratings. We call this intra-rater disagreement rating indeterminacy.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

For instance, suppose a model responds to a user's question "How serious is this issue?" with "That's a rookie mistake. Only an amateur would do that."

Is this toxic? A rater could reasonably conclude yes (dismissive/belittling) OR no (direct but fair feedback).

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

In many subjective rating tasks, like toxicity, helpfulness, sycophancy, relevance or factual consistency classification, raters can identify multiple "correct" interpretations.

09.12.2025 20:34 👍 0 🔁 0 💬 1 📌 0

LLM-as-a-judge systems are often used for subjective rating tasks where humans can reasonably disagree on which rating is "correct."

But how should we validate a judge system produces trustworthy ratings when humans themselves can disagree? 🧵

Paper: arxiv.org/pdf/2503.05965

09.12.2025 20:34 👍 1 🔁 0 💬 1 📌 0

Measurement as Bricolage: Examining How Data Scientists Construct Target Variables for Predictive Modeling Tasks Data scientists often formulate predictive modeling tasks involving fuzzy, hard-to-define concepts, such as the "authenticity" of student writing or the "healthcare need" of a patient. Yet the process...

📄 arxiv.org/abs/2507.02819

This work was in collaboration with the amazing team @devsaxena.bsky.social (co-first author), @schancellor.bsky.social, @zstevenwu.bsky.social , and @kenholstein.bsky.social

Thank you for making my first adventure into qualitative research a delightful experience :)

14.10.2025 14:54 👍 3 🔁 0 💬 0 📌 0

Our paper offers design implications to support this, such as:

- Protocols to help data scientists identify minimum standards for validity and other criteria, tailored to their specific application context
- Tools designed to help data scientists identify and apply strategies more effectively

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

The challenge for HCI, CSCW, and ML is not to *replace* these bricolage practices with rigid top-down planning, but to develop scaffolding that enhances the rigor of bricolage while preserving creativity and adaptability

14.10.2025 14:54 👍 2 🔁 0 💬 1 📌 0

Yet from urban planning to software engineering, history is rife with examples where rigid top-down interventions have failed while bottom-up alternatives designed to better scaffold *existing* practices succeeded

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

What do these findings mean for how we improve target variable construction going forward? We might be tempted to more stringently enforce a rigid "top-down planning approach" to measurement, in which data scientists more carefully define construct → design operationalization → collect data

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

How do data scientists evaluate validity? They treat their target variable definition as a tangible object to be scrutinized. They "poke holes" in their definition then "patch" them. They apply a variety of "spot checks" to reconcile their theoretical understanding of a concept with observed labels

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

Data scientists navigate this balancing act by adaptively applying (re)formulation strategies

For example, they use "swapping" to change target variables when the first has unanticipated challenges, or "composing" to capture complementary dimensions of a concept being captured in a target variable

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

An illustration of the target variable construction process presented in our findings. During target variable construction, data scientists specify an initial prediction task based on their available data, then iteratively refine their prediction task by applying (re)formulation strategies. Data scientists proceed with their final prediction task if it satisfies all criteria, or discontinue their project if strategies are exhausted.

While engaging in bricolage, data scientists balance the validity of their target variable with other criteria, such as:
💡 Simplicity
⚙️ Resource requirements
🎯 Predictive performance
🌎 Portability

14.10.2025 14:54 👍 0 🔁 0 💬 1 📌 0

We find that target variable construction is a *bricolage practice*, in which data scientists creatively "make do" with the limited resources at hand

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

To explore this tension, we interviewed 15 data scientists from education and healthcare sectors to understand their practices, challenges, and perceived opportunities for target variable construction in predictive modeling

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

Traditional measurement theory assumes a top-down workflow, where data is collected to fit a study's goals (define construct → design operationalization → collect data)

In contrast, data scientists are often forced to reconcile their measurement goals with *existing* data

14.10.2025 14:54 👍 1 🔁 0 💬 1 📌 0

A subtle aspect of predictive modeling is target variable construction: the process of translating a latent, unobservable concept like "healthcare need" into a prediction target

But how does target variable construction unfold in practice, and how can we better support it going forward? #CSCW2025 🧵

14.10.2025 14:54 👍 6 🔁 1 💬 1 📌 1

Luke Guerdan

Latest posts by Luke Guerdan @lukeguerdan