Nick Byrd, Ph.D. (@byrdnick.com)

An "Apple Intelligence" email summary stating only, "Something was said".

"Apple Intelligence" email summaries continue to surprise and delight. 😆

06.03.2026 21:59 👍 2 🔁 0 💬 1 📌 0

"Yan and colleagues found that pairing one small language model with another small model (that serves as a 'reflective system') allowed small hybrid systems to compete with larger models that had more than ten times as many parameters [86] (Table 2). These hybrid systems instantiate both aspects of reflection: rather than automatically accept output from the default system, a reflective system reasons about it further — the machine version of dual system psychology."

"Fig. 5. The Bounded Reflectivism algorithm [10], according to which reflective thinking can be triggered by novel, high stakes, or imaginative tasks as well as responses that reveal conflict or yield low confidence,..." "...systems may also make decisions about whether to recruit a reflective model based on the confidence assigned to the default model’s output — lower confidence scores would indicate opportunities for more reflective inference...."

Title: "Synthetic Intuition: A System-1/System-2 Architecture for Fast and Slow Thinking in Large Language Models" Abstract: "...System-1, a lightweight transformer (350M parameters), handles routine token generation with high confidence, while System2 (7B parameters) is selectively activated when System-1 detects uncertainty or complex reasoning requirements. Our architecture achieves a 3.2× speedup in inference while maintaining 97.3 % of the original model's performance on diverse benchmarks. On reasoning-intensive tasks (GSM8K, ARC-Challenge), our approach matches or exceeds baseline performance with 68 % reduced computational cost. We demonstrate that cognitive load can be dynamically allocated, opening new pathways for efficient and adaptive language model deployment."

A new #AI paper found what I predicted awhile ago.

In #StrategicReflectivism, I argued that having #LLMs reflect on low-confidence or high-uncertainty outputs of small models can increase accuracy and decrease cost: doi.org/10.48550/arX...

🔒The result: ieeexplore.ieee.org/...

06.03.2026 12:05 👍 1 🔁 0 💬 0 📌 0

"Figure 1: Percentage changes in the number of alcohol-attributable cancer incident cases and deaths, by alcohol policy scenario and household income quintile, Canada, 2022 Income quintile 1 is the lowest (least income) and income quintile 5 is the highest (most income). Scenario 1 is a cancer warning label on alcohol containers, scenario 2 is a multi-message rotating label on alcohol containers, scenario 3 is a minimum unit price of CAN$1·75, scenario 4 is a minimum unit price of $2·00, and scenario 5 is a cancer warning label on alcohol containers and a minimum unit price of $2·00." "All policy scenarios were estimated to reduce alcohol use and cancer burden, with stronger effects from more stringent interventions. For example, a $2·00 MUP with cancer labels was projected to reduce the number of incident cases of alcohol-attributable cancer by 674 (484–911; 7·1% [5·1–9·6]) and deaths by 216 (155–292; 5·6% [4·0–7·5]) when effects were fully realised."

How many cases of #cancer and #death can be prevented by a new #alcohol warning label?

Hundreds, according to "cancer ...and mortality data, representative alcohol use surveys,... #sales data, [and] the International Model of Alcohol Harms and Policies".

doi.org/10.1016/S246...

05.03.2026 12:03 👍 4 🔁 1 💬 1 📌 0

"Fig. 2 Influence of advice on moral judgment. The figure plots the proportions, along with the 95% confidence intervals, of subjects who find sacrificing one person the right thing to do after receiving advice. The numbers of observations figure above the boxes"

"Fig. 3 Perceived moral authority and plausibility of advice among subjects who follow advice. The figure plots the mean ratings and standard errors of the mean as well as the number of subjects at the bottom of each bar"

"Fig. 1 Advice by ChatGPT against sacrificing one life to save five with an argument (top) and without (bottom). Advice by moral advisor looked identical except that the icons on the left and the like/dislike buttons on the right were cut off"

Is moral advice more compelling if it includes an argument? What if it's from an #AI?

Advice strongly influenced decisions to sacrifice-one-to-save-five, regardless of whether advice
- came from #chatGPT (3.5).
- included an argument.

doi.org/10.1007/s436...

#xPhi #ethics #edu

04.03.2026 12:03 👍 2 🔁 0 💬 0 📌 0

How can researchers overcome #AcquiescenceBias?

In a #questionnaire, acquiescence is a tendency to agree with statements or answer affirmatively regardless of survey content.

Alvarado-Leiton et al. report simple ways to mitigate it.

doi.org/10.1093/jssa...

#PsychMethods #xPhi

03.03.2026 12:08 👍 1 🔁 0 💬 0 📌 0

Are scholars incentivized to quash criticism? Quite the opposite, argues Liam Kofi Bright.

Critique yields academic credit: citations, publications (e.g., replies), attention, etc.

So only maximally famous scholars should quash critique.

doi.org/10.1080/0020...

#science #edu

02.03.2026 12:12 👍 1 🔁 0 💬 0 📌 0

This POTUS’s repeated strikes on Iran prove that he never managed to achieve a better “deal” than the JCPOA.

Pro-JCPOA folks often claimed (and JCPOA critics usually denied) that the alternative to the JCPOA is war.

Every military strike on Iran is more proof that JCPOA critics lost that debate.

01.03.2026 17:17 👍 1 🔁 0 💬 0 📌 0

Bunker busters alone do not ‘obliterate’ a nuclear program.

Removing a leader is not by itself regime change.

Stunning intelligence tradecraft and military ops do not guarantee #peace.

Stable solutions require cooperation that survives news cycles, celebrations, and elections.

01.03.2026 04:13 👍 1 🔁 1 💬 1 📌 0

Paper title: "Effect of chatGPT-assisted..." (notice that 'effect' is a causal word) So did the paper design the research in a way that enables causal inference? No. "This prospective simulation-based study included 128 scenarios across common interventional radiology indications. Two expert interventional radiologists served as the reference standard. Three early-career radiologists completed all scenarios twice: first independently (pre-ChatGPT) and, after a two-month washout period, with access to ChatGPT-generated reasoning before recording final decisions (post-ChatGPT)." Without a control group of "early-career radiologists [who] completed all scenarios twice" WITHOUT access to chatGPT, we cannot know whether the improvements were cause by chatGPT (as opposed to being caused by something else, such having practiced on the same 128 scenarios two months earlier). 🙄

"Each early-career radiologist evaluated all 128 scenarios in 2 separate sessions. In the first session, participants provided independent responses based entirely on their clinical reasoning, without using ChatGPT or any other external source. After a two-month washout period, implemented to minimize recall bias and to isolate the educational impact of model-assisted reasoning more effectively, a second session was conducted. During this phase, each scenario was first submitted to ChatGPT, whose recommendation and justification were reviewed before the radiologist recorded their final decision. Thus, both pre-ChatGPT and post-ChatGPT evaluations were obtained for each early-career radiologist."

If radiologists made more appropriate decisions after input from #chatGPT than before, did #AI *cause* the improvement?

We can't know without a control group of radiologists who made the same decisions at *both* time points *without* input from the #LLM.

doi.org/10.1016/j.ac...

27.02.2026 12:03 👍 1 🔁 0 💬 0 📌 0

"Short scenarios were created using the positive and negative aspects of journal characteristics. Each scenario describes a situation in which one is confronted with an imaginary scientific journal. The journal is described briefly based on the aforementioned characteristics. A total of eight scenarios were created (all available in Table A2 and in Table S3 at https://osf. io/vf3tx/), combining all the positive and negative aspects of the three characteristics (Table 3). An example of a single scenario with two positive and one negative characteristic (Scenario 4) is presented in Table 4."

"all main effects and interactions are s tically significant, with large effect sizes for the main effects and two-way interactions, and a medium effect size for the three-way interaction (Cohen 1988). Each positive journal characteristic significantly increased perceived credibility. When two or three positive features were combined, an interaction effect occurred whereby the shift in the journal's credibility was additionally strengthened, i.e., it was greater than the cumulative effect of the individual characteristics (Figure 1). The post hoc Tukey HSD test shows that most scenario pairs differed significantly in credibility ratings."

"The effect of publishing expectations was statistically significant: scholars with higher publishing expectations (HPE) gave lower credibility ratings than scholars with LPE, suggesting a more critical evaluation. However, the effect size was small (ηₚ 2 = 0.014). No significant difference was found between [social science] and [humanities] scholars. The number of positive characteristics had a large effect on credibility ratings, with a greater number of positive characteristics leading to notably higher credibility ratings. As illustrated in Figure 2 , this increase follows an exponential trend, consistent with the interaction effects shown in Figure 1, where combinations of positive attributes enhance credibility beyond the sum of the individual attributes."

"The correlation between [reflection test] scores and the average rating across all scenarios was negative and low, but statistically significant (Table 8), suggesting that more reflective participants tended to rate journals more critically. Specifically, a significant negative correlation was obtained for journals with zero or one positive journal characteristics and at the same time with three or two negative attributes. However, the ratings of journals with two and three positive attributes did not correlate with the [reflection test]." "Exploratory Factor Analysis ...yielded a two-factor solution accounting for 59.39% of the total variance. Component 1 (eigenvalue of 3.593) explained 44.91% of the variance, whilst Component 2 (eigenvalue of 1.158) explained 14.48%. The structure matrix (see Table 9) shows that Component 1 included low-quality journals, and and Component 2 included high-quality journals. Journals with mixed characteristics loaded on both components"

How do academics judge journals?

For > 1000 in #socialSciences and #humanities
- claiming fair/transparent #peerReview, having an accurate title, and disclosing author contact info increased credibility.
- reflective thinking predicted lower credibility.

doi.org/10.1002/leap...

26.02.2026 12:01 👍 1 🔁 0 💬 0 📌 0

Just a few more housekeeping reminders before we get started:

No food, or drinks, or unverifiable thoughts.

Please put all devices in Carnap mode.

Our printer is for empirically meaningful propositions only.

And we’ve space for a few more pseudo-statements in this weekend’s pickup language game.

26.02.2026 02:14 👍 0 🔁 0 💬 0 📌 0

Decisions in moral dilemmas: The influence of subjective beliefs in outcome probabilities | Judgment and Decision Making | Cambridge Core Decisions in moral dilemmas: The influence of subjective beliefs in outcome probabilities - Volume 12 Issue 5

Can we assume research participants accept the stipulations of vignettes?

This paper reports that moral dilemma decisions varied according to how much people seemed to believe stipulations (e.g., that intervening would actually save five people).

doi.org/10.1017/S193... #xPhi

25.02.2026 12:08 👍 14 🔁 4 💬 1 📌 2

I’m surprised by the claims of the thread and abstract.

There’s lots of classic and ongoing cognitive science on a distinct form of conscious and deliberate “thinking” (a.k.a. reflection), its role in logical inference, its measurement (including think-aloud), etc. The paper cites only some of it.

25.02.2026 04:31 👍 1 🔁 0 💬 0 📌 0

I have not yet encountered a reason to doubt what you or other clinicians tell me about OE.

And I’m not disputing that clinMed has demonstrable answers.

So we may fully agree.

My point: if ChatGPT Health is most comparable to OE and if OE is worse, then the top-level “not very well” seems wrong.

25.02.2026 00:12 👍 0 🔁 0 💬 0 📌 0

Helpful perspective!

You’re not the first clinician to tell me that OpenEvidence is …underwhelming.

Perhaps this is just confirmation bias, but that does make me think the upshot of this viral chatGPT-Health result really depends on what it is (or should be) compared to.

24.02.2026 16:32 👍 0 🔁 0 💬 1 📌 0

Intuition could also be about as bad as (or worse than) a GPT fine-tuned for health.

And search engines are only as good as their users. So I wouldn’t be surprised if they get uninspiring results.

Of course — and to my point — how well one does (compared to status quo) is an empirical question.

24.02.2026 12:51 👍 1 🔁 0 💬 1 📌 0

Sounds like the writer’s fault. ;)

24.02.2026 12:45 👍 1 🔁 0 💬 1 📌 0

Interesting thought for Shaw and Nave, who wrote the System 3 paper.

I consider all the data to be compatible with two processes. If there is a third process, I imagine it’s an already theorized process such as conflict detection.

24.02.2026 12:17 👍 1 🔁 0 💬 1 📌 0

"Abstract. People increasingly consult generative artificial intelligence (AI) while reasoning. As AI becomes embedded in daily thought, what becomes of human judgment? We introduce Tri-System Theory, extending dual-process accounts of reasoning by positing System 3: artificial cognition that operates outside the brain. System 3 can supplement or supplant internal processes, introducing novel cognitive pathways. A key prediction of the theory is “cognitive surrender”—adopting AI outputs with minimal scrutiny, overriding intuition (System 1) and deliberation (System 2). Across three preregistered experiments using an adapted Cognitive Reflection Test (N = 1,372; 9,593 trials), we randomized AI accuracy via hidden seed prompts. Participants chose to consult an AI assistant on a majority of trials (>50%). Relative to baseline (no System 3 access), accuracy significantly rose when AI was accurate and fell when it erred (+25/-15 percentage points; Study 1),..."

"Figure 8. Incentives and feedback increase accuracy, but cognitive surrender persists"

Does #AI nudge us to think more reflectively?

This paper saw reflection test takers accepting more correct than incorrect guidance from a #LLM — a net benefit v. testing alone: doi.org/10.31234/osf...

We found that in human-to-human chat experiments too: www.researchgate.net...

24.02.2026 12:06 👍 0 🔁 0 💬 1 📌 0

Just curious: how do people normally make such medical decisions?

I’d expect ordinary people to use a search engine, AI overview, family member, …or intuition?

Unless status quo method(s) perform significantly worse on the same test, it seems premature to say a health chatbot did “not very well”.

24.02.2026 03:53 👍 6 🔁 0 💬 3 📌 0

"We identified 41 trials that included 194,035 participants. Many of the studies had limitations. Low-quality evidence suggests that providing CVD risk scores had little or no eﬀect on the number of people who develop heart disease or stroke. Providing CVD risk scores may reduce CVD risk factor levels (like cholesterol, blood pressure, and multivariable CVD risk) by a small amount and may increase cholesterollowering and blood pressure-lowering medication prescribing in higher risk people. Providing CVD risk scores may reduce harms, but the results were imprecise."

Do people benefit from learning their #risk of a #cardiovascularDisease event (e.g., stroke)?

A 2017 review found small improvements in #medication and #bloodPressure, but effects on other factors were more tenuous: doi.org/10.1002/1465...

What have we learned since?

#medicine

23.02.2026 12:18 👍 0 🔁 0 💬 0 📌 0

New paper! 🚨

Does studying psychology change how people think about psychology (even at an intuitive level)? 🤔

We tracked students across their degree and found shifts in their beliefs about the bases of psychological phenomena and their scientific explainability.

1/5

15.10.2025 10:23 👍 4 🔁 3 💬 1 📌 0

Now out in @nataging.nature.com: Exposure to low-credibility online health content is limited and is concentrated among older adults www.nature.com/articles/s43...
Linking discernment data with exposure from web and YouTube, comparing health vs politics exposure, etc. Lots of good stuff.

04.02.2026 10:31 👍 14 🔁 6 💬 1 📌 2

Top-down view of spare office space.

3D model of office space

My initial college plan was to add to my experience in the trades with a degree in #architecture or #engineering.

After pivoting to #academia my interest in building and #design resurfaces — as when I helped repurpose furniture storage into usable #workSpace during #gradSchool.

20.02.2026 12:01 👍 1 🔁 0 💬 0 📌 0

A plane vs. train version would be 🔥

We rarely realize how little time is wasted
- getting to/from train stations, which are often closer to our destination.
- waiting in train stations, where you can board/depart *minutes* after arriving.

Airports waste way more time.

Apps could make it obvious.

19.02.2026 16:01 👍 6 🔁 1 💬 1 📌 0

The four experimental conditions.

The two learning metrics.

"As shown in Figure 13, there is no significant difference across treatments for both the learning during intervention (𝐹 (3, 398) = 1.046, 𝑝 = 0.372) and learning after intervention (𝐹 (3, 398) = 1.193, 𝑝 = 0.312)."

"Figure 13: Comparisons on participants’ learning (a) while receiving the AI assistance intervention (b) after receiving the AI assistance intervention. Values for the Human-only treatment are computed based on the normalized change of decision accuracy between corresponding sessions of tasks (learning during intervention: tasks 6–15 vs. tasks 1–5, learning after intervention: tasks 16–20 vs. tasks 1–5); they provide a baseline for organic learning happened due to repetitive task completion without AI assistance interventions. Error bars represent 95% confidence intervals of the mean values."

Do people learn more from #AI decision assistants?

This experiment found insignificant improvement in learning during and after three forms of AI-assisted decision-making, compared to human-only decision-making.

doi.org/10.48550/arX...

#edu #teaching #cogSci #eduTech #compSci

19.02.2026 12:06 👍 2 🔁 0 💬 0 📌 0

Overview of the quasi-experiment

How writing assignments were scored: two independent, trained graders.

The appendix shows that the experimental group (EG) had access to AI during the pre-test, but the control group (CG) did not. So the pre-test or baseline measurement was not identical.

The results show more improvement in the AI group and the control group, but one has to wonder if the AI group may have improved more if their pre-test/baseline score could not have been inflated by access to AI.

#AI #Education experiments can be great, if careful.

A 6-week #writing course gave one group an in-person instructor and the other group a #languageModel.

But the AI group accessed the #LLM *during* the pre-test?

Shouldn’t baseline conditions be equal?

doi.org/10.1007/s442...

18.02.2026 12:03 👍 0 🔁 1 💬 0 📌 0

a man in a suit is giving a thumbs up and says i like you Alt: Neil Patrick Harris in a suit points at someone and says emphatically “I like you!”

Thank you for posting the pre-copy edited version of the book.

#ThisIsTheWay

18.02.2026 02:38 👍 2 🔁 0 💬 0 📌 0

"Accuracy outcomes reflected this same structural pattern. In True Conflict, where physicians were initially incorrect and the AI correct, decision changes produced a clear net accuracy gain of 13.20%, consistent with movement toward the correct answer. In False Conflict, where physicians began with the correct answer but the AI was incorrect, decision changes resulted in a net accuracy loss of 4.24%, indicating that many revisions replaced a correct answer with an incorrect one."

"where physicians were initially incorrect and the AI provided the correct answer"... "The linear-by-linear association [between clinician accuracy and decision correction] was ...significant, χ² = 44.80, p < .01, indicating a clear monotonic trend: physicians with higher diagnostic accuracy were increasingly likely to correct their initial error when confronted with accurate AI disagreement. The effect size was medium, Cramér’s V = 0.362 (p < .01)."

"when physicians initially provided the correct diagnosis and the AI gave an incorrect answer" there was "no significant association between [clinician accuracy] and decision change, χ²(3, N = 146) = 4.53, p = .210. The linear-by-linear association was also non-significant, χ² = 1.16, p = .281, and the overall association was weak and non-significant (Cramér’s V = 0.176, p = .210)."

Why might #AI-assisted #dermatology decisions improve by about 10%?

Across 4,905 unaided and AI-aided decisions, clinicians were more likely to revise when the AI disagreed than when it agreed — even when the AI was wrong!

http://lup.lub.lu.se...

#medicine #tech #edu #cogSci

17.02.2026 12:09 👍 1 🔁 0 💬 0 📌 0

And many organizations are more or less locked into the Microsoft ecosystem.

Will CIA analysts be able to switch to an OpenAI word processor? 🙄

Medical researchers working with sensitive health information? ❌

17.02.2026 03:21 👍 0 🔁 0 💬 0 📌 0

Nick Byrd, Ph.D.

Latest posts by Nick Byrd, Ph.D. @byrdnick.com