An "Apple Intelligence" email summary stating only, "Something was said".
"Apple Intelligence" email summaries continue to surprise and delight. π
@byrdnick.com
I study how to improve decisions and well-being at @GeisingerCollege.bsky.social. π gScholar: shorturl.at/uBDPW βΆοΈ youtube.com/@ByrdNick π¨βπ» psychologytoday.com/us/blog/upon-reflection π byrdnick.com/blog ποΈ byrdnick.com/pod
An "Apple Intelligence" email summary stating only, "Something was said".
"Apple Intelligence" email summaries continue to surprise and delight. π
"Yan and colleagues found that pairing one small language model with another small model (that serves as a 'reflective system') allowed small hybrid systems to compete with larger models that had more than ten times as many parameters [86] (Table 2). These hybrid systems instantiate both aspects of reflection: rather than automatically accept output from the default system, a reflective system reasons about it further β the machine version of dual system psychology."
"Fig. 5. The Bounded Reflectivism algorithm [10], according to which reflective thinking can be triggered by novel, high stakes, or imaginative tasks as well as responses that reveal conflict or yield low confidence,..." "...systems may also make decisions about whether to recruit a reflective model based on the confidence assigned to the default modelβs output β lower confidence scores would indicate opportunities for more reflective inference...."
Title: "Synthetic Intuition: A System-1/System-2 Architecture for Fast and Slow Thinking in Large Language Models" Abstract: "...System-1, a lightweight transformer (350M parameters), handles routine token generation with high confidence, while System2 (7B parameters) is selectively activated when System-1 detects uncertainty or complex reasoning requirements. Our architecture achieves a 3.2Γ speedup in inference while maintaining 97.3 % of the original model's performance on diverse benchmarks. On reasoning-intensive tasks (GSM8K, ARC-Challenge), our approach matches or exceeds baseline performance with 68 % reduced computational cost. We demonstrate that cognitive load can be dynamically allocated, opening new pathways for efficient and adaptive language model deployment."
A new #AI paper found what I predicted awhile ago.
In #StrategicReflectivism, I argued that having #LLMs reflect on low-confidence or high-uncertainty outputs of small models can increase accuracy and decrease cost: doi.org/10.48550/arX...
πThe result: ieeexplore.ieee.org/...
"Figure 1: Percentage changes in the number of alcohol-attributable cancer incident cases and deaths, by alcohol policy scenario and household income quintile, Canada, 2022 Income quintile 1 is the lowest (least income) and income quintile 5 is the highest (most income). Scenario 1 is a cancer warning label on alcohol containers, scenario 2 is a multi-message rotating label on alcohol containers, scenario 3 is a minimum unit price of CAN$1Β·75, scenario 4 is a minimum unit price of $2Β·00, and scenario 5 is a cancer warning label on alcohol containers and a minimum unit price of $2Β·00." "All policy scenarios were estimated to reduce alcohol use and cancer burden, with stronger effects from more stringent interventions. For example, a $2Β·00 MUP with cancer labels was projected to reduce the number of incident cases of alcohol-attributable cancer by 674 (484β911; 7Β·1% [5Β·1β9Β·6]) and deaths by 216 (155β292; 5Β·6% [4Β·0β7Β·5]) when effects were fully realised."
How many cases of #cancer and #death can be prevented by a new #alcohol warning label?
Hundreds, according to "cancer ...and mortality data, representative alcohol use surveys,... #sales data, [and] the International Model of Alcohol Harms and Policies".
doi.org/10.1016/S246...
"Fig. 2 Influence of advice on moral judgment. The figure plots the proportions, along with the 95% confidence intervals, of subjects who find sacrificing one person the right thing to do after receiving advice. The numbers of observations figure above the boxes"
"Fig. 3 Perceived moral authority and plausibility of advice among subjects who follow advice. The figure plots the mean ratings and standard errors of the mean as well as the number of subjects at the bottom of each bar"
"Fig. 1 Advice by ChatGPT against sacrificing one life to save five with an argument (top) and without (bottom). Advice by moral advisor looked identical except that the icons on the left and the like/dislike buttons on the right were cut off"
Is moral advice more compelling if it includes an argument? What if it's from an #AI?
Advice strongly influenced decisions to sacrifice-one-to-save-five, regardless of whether advice
- came from #chatGPT (3.5).
- included an argument.
doi.org/10.1007/s436...
#xPhi #ethics #edu
How can researchers overcome #AcquiescenceBias?
In a #questionnaire, acquiescence is a tendency to agree with statements or answer affirmatively regardless of survey content.
Alvarado-Leiton et al. report simple ways to mitigate it.
doi.org/10.1093/jssa...
#PsychMethods #xPhi
Are scholars incentivized to quash criticism? Quite the opposite, argues Liam Kofi Bright.
Critique yields academic credit: citations, publications (e.g., replies), attention, etc.
So only maximally famous scholars should quash critique.
doi.org/10.1080/0020...
#science #edu
This POTUSβs repeated strikes on Iran prove that he never managed to achieve a better βdealβ than the JCPOA.
Pro-JCPOA folks often claimed (and JCPOA critics usually denied) that the alternative to the JCPOA is war.
Every military strike on Iran is more proof that JCPOA critics lost that debate.
Bunker busters alone do not βobliterateβ a nuclear program.
Removing a leader is not by itself regime change.
Stunning intelligence tradecraft and military ops do not guarantee #peace.
Stable solutions require cooperation that survives news cycles, celebrations, and elections.
Paper title: "Effect of chatGPT-assisted..." (notice that 'effect' is a causal word) So did the paper design the research in a way that enables causal inference? No. "This prospective simulation-based study included 128 scenarios across common interventional radiology indications. Two expert interventional radiologists served as the reference standard. Three early-career radiologists completed all scenarios twice: first independently (pre-ChatGPT) and, after a two-month washout period, with access to ChatGPT-generated reasoning before recording final decisions (post-ChatGPT)." Without a control group of "early-career radiologists [who] completed all scenarios twice" WITHOUT access to chatGPT, we cannot know whether the improvements were cause by chatGPT (as opposed to being caused by something else, such having practiced on the same 128 scenarios two months earlier). π
"Each early-career radiologist evaluated all 128 scenarios in 2 separate sessions. In the first session, participants provided independent responses based entirely on their clinical reasoning, without using ChatGPT or any other external source. After a two-month washout period, implemented to minimize recall bias and to isolate the educational impact of model-assisted reasoning more effectively, a second session was conducted. During this phase, each scenario was first submitted to ChatGPT, whose recommendation and justification were reviewed before the radiologist recorded their final decision. Thus, both pre-ChatGPT and post-ChatGPT evaluations were obtained for each early-career radiologist."
If radiologists made more appropriate decisions after input from #chatGPT than before, did #AI *cause* the improvement?
We can't know without a control group of radiologists who made the same decisions at *both* time points *without* input from the #LLM.
doi.org/10.1016/j.ac...
"Short scenarios were created using the positive and negative aspects of journal characteristics. Each scenario describes a situation in which one is confronted with an imaginary scientific journal. The journal is described briefly based on the aforementioned characteristics. A total of eight scenarios were created (all available in Table A2 and in Table S3 at https://osf. io/vf3tx/), combining all the positive and negative aspects of the three characteristics (Table 3). An example of a single scenario with two positive and one negative characteristic (Scenario 4) is presented in Table 4."
"all main effects and interactions are s tically significant, with large effect sizes for the main effects and two-way interactions, and a medium effect size for the three-way interaction (Cohen 1988). Each positive journal characteristic significantly increased perceived credibility. When two or three positive features were combined, an interaction effect occurred whereby the shift in the journal's credibility was additionally strengthened, i.e., it was greater than the cumulative effect of the individual characteristics (Figure 1). The post hoc Tukey HSD test shows that most scenario pairs differed significantly in credibility ratings."
"The effect of publishing expectations was statistically significant: scholars with higher publishing expectations (HPE) gave lower credibility ratings than scholars with LPE, suggesting a more critical evaluation. However, the effect size was small (Ξ·β 2 = 0.014). No significant difference was found between [social science] and [humanities] scholars. The number of positive characteristics had a large effect on credibility ratings, with a greater number of positive characteristics leading to notably higher credibility ratings. As illustrated in Figure 2 , this increase follows an exponential trend, consistent with the interaction effects shown in Figure 1, where combinations of positive attributes enhance credibility beyond the sum of the individual attributes."
"The correlation between [reflection test] scores and the average rating across all scenarios was negative and low, but statistically significant (Table 8), suggesting that more reflective participants tended to rate journals more critically. Specifically, a significant negative correlation was obtained for journals with zero or one positive journal characteristics and at the same time with three or two negative attributes. However, the ratings of journals with two and three positive attributes did not correlate with the [reflection test]." "Exploratory Factor Analysis ...yielded a two-factor solution accounting for 59.39% of the total variance. Component 1 (eigenvalue of 3.593) explained 44.91% of the variance, whilst Component 2 (eigenvalue of 1.158) explained 14.48%. The structure matrix (see Table 9) shows that Component 1 included low-quality journals, and and Component 2 included high-quality journals. Journals with mixed characteristics loaded on both components"
How do academics judge journals?
For > 1000 in #socialSciences and #humanities
- claiming fair/transparent #peerReview, having an accurate title, and disclosing author contact info increased credibility.
- reflective thinking predicted lower credibility.
doi.org/10.1002/leap...
Just a few more housekeeping reminders before we get started:
No food, or drinks, or unverifiable thoughts.
Please put all devices in Carnap mode.
Our printer is for empirically meaningful propositions only.
And weβve space for a few more pseudo-statements in this weekendβs pickup language game.
Can we assume research participants accept the stipulations of vignettes?
This paper reports that moral dilemma decisions varied according to how much people seemed to believe stipulations (e.g., that intervening would actually save five people).
doi.org/10.1017/S193... #xPhi
Iβm surprised by the claims of the thread and abstract.
Thereβs lots of classic and ongoing cognitive science on a distinct form of conscious and deliberate βthinkingβ (a.k.a. reflection), its role in logical inference, its measurement (including think-aloud), etc. The paper cites only some of it.
I have not yet encountered a reason to doubt what you or other clinicians tell me about OE.
And Iβm not disputing that clinMed has demonstrable answers.
So we may fully agree.
My point: if ChatGPT Health is most comparable to OE and if OE is worse, then the top-level βnot very wellβ seems wrong.
Helpful perspective!
Youβre not the first clinician to tell me that OpenEvidence is β¦underwhelming.
Perhaps this is just confirmation bias, but that does make me think the upshot of this viral chatGPT-Health result really depends on what it is (or should be) compared to.
Intuition could also be about as bad as (or worse than) a GPT fine-tuned for health.
And search engines are only as good as their users. So I wouldnβt be surprised if they get uninspiring results.
Of course β and to my point β how well one does (compared to status quo) is an empirical question.
Sounds like the writerβs fault. ;)
Interesting thought for Shaw and Nave, who wrote the System 3 paper.
I consider all the data to be compatible with two processes. If there is a third process, I imagine itβs an already theorized process such as conflict detection.
"Abstract. People increasingly consult generative artificial intelligence (AI) while reasoning. As AI becomes embedded in daily thought, what becomes of human judgment? We introduce Tri-System Theory, extending dual-process accounts of reasoning by positing System 3: artificial cognition that operates outside the brain. System 3 can supplement or supplant internal processes, introducing novel cognitive pathways. A key prediction of the theory is βcognitive surrenderββadopting AI outputs with minimal scrutiny, overriding intuition (System 1) and deliberation (System 2). Across three preregistered experiments using an adapted Cognitive Reflection Test (N = 1,372; 9,593 trials), we randomized AI accuracy via hidden seed prompts. Participants chose to consult an AI assistant on a majority of trials (>50%). Relative to baseline (no System 3 access), accuracy significantly rose when AI was accurate and fell when it erred (+25/-15 percentage points; Study 1),..."
"Figure 8. Incentives and feedback increase accuracy, but cognitive surrender persists"
Does #AI nudge us to think more reflectively?
This paper saw reflection test takers accepting more correct than incorrect guidance from a #LLM βΒ a net benefit v. testing alone: doi.org/10.31234/osf...
We found that in human-to-human chat experiments too: www.researchgate.net...
Just curious: how do people normally make such medical decisions?
Iβd expect ordinary people to use a search engine, AI overview, family member, β¦or intuition?
Unless status quo method(s) perform significantly worse on the same test, it seems premature to say a health chatbot did βnot very wellβ.
"We identified 41 trials that included 194,035 participants. Many of the studies had limitations. Low-quality evidence suggests that providing CVD risk scores had little or no eο¬ect on the number of people who develop heart disease or stroke. Providing CVD risk scores may reduce CVD risk factor levels (like cholesterol, blood pressure, and multivariable CVD risk) by a small amount and may increase cholesterollowering and blood pressure-lowering medication prescribing in higher risk people. Providing CVD risk scores may reduce harms, but the results were imprecise."
Do people benefit from learning their #risk of a #cardiovascularDisease event (e.g., stroke)?
A 2017 review found small improvements in #medication and #bloodPressure, but effects on other factors were more tenuous: doi.org/10.1002/1465...
What have we learned since?
#medicine
New paper! π¨
Does studying psychology change how people think about psychology (even at an intuitive level)? π€
We tracked students across their degree and found shifts in their beliefs about the bases of psychological phenomena and their scientific explainability.
1/5
Now out in @nataging.nature.com: Exposure to low-credibility online health content is limited and is concentrated among older adults www.nature.com/articles/s43...
Linking discernment data with exposure from web and YouTube, comparing health vs politics exposure, etc. Lots of good stuff.
Top-down view of spare office space.
3D model of office space
My initial college plan was to add to my experience in the trades with a degree in #architecture or #engineering.
After pivoting to #academia my interest in building and #design resurfaces β as when I helped repurpose furniture storage into usable #workSpace during #gradSchool.
A plane vs. train version would be π₯
We rarely realize how little time is wasted
- getting to/from train stations, which are often closer to our destination.
- waiting in train stations, where you can board/depart *minutes* after arriving.
Airports waste way more time.
Apps could make it obvious.
The four experimental conditions.
The two learning metrics.
"As shown in Figure 13, there is no significant difference across treatments for both the learning during intervention (πΉ (3, 398) = 1.046, π = 0.372) and learning after intervention (πΉ (3, 398) = 1.193, π = 0.312)."
"Figure 13: Comparisons on participantsβ learning (a) while receiving the AI assistance intervention (b) after receiving the AI assistance intervention. Values for the Human-only treatment are computed based on the normalized change of decision accuracy between corresponding sessions of tasks (learning during intervention: tasks 6β15 vs. tasks 1β5, learning after intervention: tasks 16β20 vs. tasks 1β5); they provide a baseline for organic learning happened due to repetitive task completion without AI assistance interventions. Error bars represent 95% confidence intervals of the mean values."
Do people learn more from #AI decision assistants?
This experiment found insignificant improvement in learning during and after three forms of AI-assisted decision-making, compared to human-only decision-making.
doi.org/10.48550/arX...
#edu #teaching #cogSci #eduTech #compSci
Overview of the quasi-experiment
How writing assignments were scored: two independent, trained graders.
The appendix shows that the experimental group (EG) had access to AI during the pre-test, but the control group (CG) did not. So the pre-test or baseline measurement was not identical.
The results show more improvement in the AI group and the control group, but one has to wonder if the AI group may have improved more if their pre-test/baseline score could not have been inflated by access to AI.
#AI #Education experiments can be great, if careful.
A 6-week #writing course gave one group an in-person instructor and the other group a #languageModel.
But the AI group accessed the #LLM *during* the pre-test?
Shouldnβt baseline conditions be equal?
doi.org/10.1007/s442...
Thank you for posting the pre-copy edited version of the book.
#ThisIsTheWay
"Accuracy outcomes reflected this same structural pattern. In True Conflict, where physicians were initially incorrect and the AI correct, decision changes produced a clear net accuracy gain of 13.20%, consistent with movement toward the correct answer. In False Conflict, where physicians began with the correct answer but the AI was incorrect, decision changes resulted in a net accuracy loss of 4.24%, indicating that many revisions replaced a correct answer with an incorrect one."
"where physicians were initially incorrect and the AI provided the correct answer"... "The linear-by-linear association [between clinician accuracy and decision correction] was ...significant, ΟΒ² = 44.80, p < .01, indicating a clear monotonic trend: physicians with higher diagnostic accuracy were increasingly likely to correct their initial error when confronted with accurate AI disagreement. The effect size was medium, CramΓ©rβs V = 0.362 (p < .01)."
"when physicians initially provided the correct diagnosis and the AI gave an incorrect answer" there was "no significant association between [clinician accuracy] and decision change, ΟΒ²(3, N = 146) = 4.53, p = .210. The linear-by-linear association was also non-significant, ΟΒ² = 1.16, p = .281, and the overall association was weak and non-significant (CramΓ©rβs V = 0.176, p = .210)."
Why might #AI-assisted #dermatology decisions improve by about 10%?
Across 4,905 unaided and AI-aided decisions, clinicians were more likely to revise when the AI disagreed than when it agreed βΒ even when the AI was wrong!
http://lup.lub.lu.se...
#medicine #tech #edu #cogSci
And many organizations are more or less locked into the Microsoft ecosystem.
Will CIA analysts be able to switch to an OpenAI word processor? π
Medical researchers working with sensitive health information? β