Home New Trending Search
About Privacy Terms
#
#LanguageModels
Posts tagged #LanguageModels on Bluesky
Preview
Search robot thinks for itself A robot that can locate lost items on command – this is the latest development at the Technical University of Munich (TUM).

Researchers, including Benjamin Bogenberger, developed a robot that combines #LanguageModels with #3Dvision to locate misplaced objects by building a spatial map and estimating likely locations: go.tum.de/730486

#Robotics #AI

📷A. Schmitz

3 1 0 0
Preview
TurboSparse: Democratizing AI via Efficient dReLU Sparsification

TurboSparse democratizes access to LLMs for researchers and smaller organizations. #languagemodels

2 1 1 0

👏 Congratulations on this achievement and all the best for Cecilia’s new role as postdoctoral researcher at the @cam.ac.uk!

#NLP #PhDDefense #MultilingualAI #CulturalAI #LanguageModels #UKPLab #TUDarmstadt @cs-tudarmstadt.bsky.social

1 0 0 0
Preview
TurboSparse Limitations: The Impact of 150B Token Recovery Training

While achieving 90% sparsity, TurboSparse models currently utilize 1% of the training tokens used by Llama, with further training expected. #languagemodels

1 0 0 0
Preview
dReLU Sparsification: High-Performance 90% Sparsity for Next-Gen LLMs

Learn how this breakthrough makes large language models (LLMs) more accessible and environmentally friendly. #languagemodels

0 0 0 0
Preview
TurboSparse Mobile: 22x Faster Mixtral Inference on PowerInfer-2

Learn how PowerInfer-2 leverages extreme sparsity for a 22.2x speedup over llama.cpp. #languagemodels

0 0 0 0
Preview
TurboSparse Inference: 4.6x Faster LLM Decoding via Hybrid GPU-CPU Computing

Achieve up to 2.28x speedup on pure CPU and 4.64x in hybrid GPU-CPU environments compared to llama.cpp baselines. #languagemodels

0 0 0 0
Preview
TurboSparse: Elite Inference Speed via dReLU Sparsity

Achieve 2-5x faster LLM decoding on RTX 4090 and mobile devices using TurboSparse. Experience 97% parameter sparsity without performance loss. #languagemodels

0 0 0 0
Preview
TurboSparse Efficiency: Achieving 97% Parameter Sparsity in Mixtral-47B

Discover how TurboSparse-Mistral-7B and Mixtral-47B leverage ReLUfication to reach up to 90% neuron inactivity, reducing active parameters to just 3% #languagemodels

0 0 0 0
Preview
Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items Background: Multiple-choice examinations (MCQs) are widely used in medical education to ensure standardized and objective assessment. Developing high-quality items requires both subject expertise and methodological rigor. Large language models (LLMs) offer new opportunities for automated item generation. However, most evaluations rely on general-purpose prompting, and psychometric comparisons with faculty-written items remain scarce. Objective: This study aimed to evaluate whether a fine-tuned LLM can generate MCQs (Type A) in anesthesiology with psychometric properties comparable to those written by expert faculty. Methods: The study was embedded in the regular written anesthesiology examination of the eighth-semester medical curriculum with 157 students. The examination comprised 30 single best-answer MCQs, of which 15 were generated by senior faculty and 15 by a fine-tuned GPT-based model. A custom GPT-based (GPT-4) model was adapted with anesthesiology lecture slides, the National Competence-Based Learning Objectives Catalogue (NKLM 2.0), past examination questions, and faculty publications using supervised instruction-tuning with standardized prompt–response pairs. Item analysis followed established psychometric standards. Results: In total, 29 items (14 expert, 15 LLM-generated) were analyzed. Expert-generated questions had a mean difficulty of 0.81 (SD 0.19), point-biserial correlation of 0.19 (SD 0.07), and discrimination index of 0.09 (SD 0.08). LLM-generated items had a mean difficulty of 0.79 (SD 0.18), point-biserial correlation of 0.17 (SD 0.04), and discrimination index of 0.08 (SD 0.11). Mann-Whitney tests revealed no significant differences between expert- and LLM-generated items for difficulty (=.38), point-biserial correlation coefficient (=.96), or discrimination index (=.59). Categorical analyses confirmed no significant group differences. Both sets, however, showed only modest psychometric quality. Conclusions: Supervised fine-tuned LLMs are capable of generating MCQs with psychometric properties comparable to those written by experienced faculty. Given the limitations and cohort-dependency of psychometric indices, automated item generation should be considered a complement rather than a replacement for manual item writing. Further research with larger item sets and multi-institutional validation is needed to confirm generalizability and optimize integration of LLM-based tools into assessment development.

JMIR Formative Res: Fine-Tuned Large Language Models for Generating Multiple-Choice Questions in Anesthesiology: Psychometric Comparison With Faculty-Written Items #Anesthesiology #MedicalEducation #MultipleChoiceQuestions #LearningAssessment #LanguageModels

0 0 0 0

Big congratulations to all authors! 🚀

#ICLR2026 #MachineLearning #AIResearch #RepresentationLearning #InformationRetrieval #DenseRetrieval #SelfSupervisedLearning #LanguageModels #NLP #UKPLab #ICLR2026

@cmu.edu @tencent.bsky.social @tuda.bsky.social @cs-tudarmstadt.bsky.social @microsoft.com

1 0 0 0
Preview
Uptake of Large Language Models by London Medical Students: Exploratory Qualitative Interview Study Background: The popularity of large language models (LLMs) has grown exponentially across health care. Despite the wealth of literature on proposed applications in medical education, there remains a critical gap regarding their real-world use, benefits, and challenges as experienced by medical students themselves. Objective: We aimed to explore qualitatively and characterize the perceived benefits, facilitators, and barriers associated with the use of LLMs among a cohort of London-based medical students. Methods: Semistructured interviews were conducted with 15 medical students from preclinical and clinical stages at London-based medical schools. Guided by the technology acceptance model, interview transcripts underwent an inductive thematic analysis to identify themes on actual system use, perceived usefulness, ease of use, and attitudes toward LLMs. Results: All participants reported frequent use of ChatGPT for concise topic summarization, clarification of complex concepts, generation of examination-style questions, and summarization of research. Students described LLMs as a complementary tool to traditional materials, valuing their immediacy (“Instead of getting a textbook, I can ask ChatGPT to summarise something in X words and read it in under a minute”) and ease of use. Peer demonstration and device-agnostic accessibility emerged as key facilitators. Of note, wider applications such as simulating clinical interviews were discovered through peers rather than through formal teaching. Significant barriers were reported. Hallucinations, fabricated references, and outdated information led to loss of trust, with more junior students finding inaccurate outputs difficult to detect (“I stopped using it because I found it to be inaccurate, and I don’t want to be learning the wrong things”). Half of the participants interviewed reported a sense of overreliance, defaulting to its use for answers with a perceived loss of critical thinking ability. Students noted inequalities in access to advanced features and voiced concerns about privacy when using LLMs in clinical scenarios. Conclusions: LLMs have been widely adopted by medical students. While students perceived the efficiency, flexibility, and conversational interface of LLMs as beneficial, substantial reservations remain regarding their reliability, potential de-skilling, and the loss of academic integrity. These findings underpin the urgent need for curricula to both support safe LLM use and also adapt assessment and teaching strategies for artificial intelligence (#AI)–augmented student practice. Future research should broaden geographical representation, investigate applications in low-resource settings, and integrate educators’ perspectives to establish future curricular guidance in an artificial intelligence (#AI) era.

JMIR Formative Res: Uptake of Large Language Models by London Medical Students: Exploratory Qualitative Interview Study #MedicalEducation #LanguageModels #HealthCareInnovation #DigitalHealth #MedicalStudents

0 0 0 0
Preview
DeepSeek vs. ChatGPT: A Battle of AI Language Models Artificial intelligence (AI) has rapidly evolved, transforming industries and redefining how we interact with technology. Among the most significant advancements in AI are large language models (LL...

DeepSeek vs. ChatGPT: A Battle of AI Language Models
www.ekascloud.com/our-blog/dee...
#DeepSeek
#ChatGPT
#DeepSeekVsChatGPT
#AIBattle
#AIComparison
#LanguageModels
#LargeLanguageModels
#GenerativeAI
#ArtificialIntelligence
#AITrends
#TechDebate

0 0 0 0
Preview
Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and #usability Study Background: The evolution of language models, particularly large language models, has introduced transformative potential for psychological assessment, challenging traditional rating scale methods that have dominated clinical practice for over a century. Objective: This study aimed to develop and validate an automated assessment paradigm that integrates natural language processing with conventional measurement tools to assess depressive symptoms, exploring its #feasibility as a novel approach in psychological evaluation. Methods: A cohort of 115 participants, including 28 (24.3%) individuals diagnosed with depression, completed the Beck Depression Inventory Fast Screen via a custom ChatGPT interface (BDI-FS-GPT) and the Chinese version of the Patient Health Questionnaire–9 (PHQ-9). Statistical analyses included the Spearman correlation (PHQ-9 vs BDI-FS-GPT scores), Cohen κ (diagnostic agreement), and area under the curve (AUC) evaluation. Results: Spearman analysis revealed a moderate correlation between PHQ-9 and BDI-FS-GPT scores. The Cohen κ indicated moderate diagnostic agreement between the PHQ-9 and the BDI-FS-GPT (κ=0.43; 76.5% agreement), substantial agreement between the BDI-FS-GPT and the clinical diagnosis (κ=0.72; 88.7% agreement), and moderate agreement between the PHQ-9 and the clinical diagnosis (κ=0.55; 71.4% agreement). The BDI-FS-GPT demonstrated excellent diagnostic accuracy (AUC=0.953) at a cutoff of 3, detecting 89.3% of participants with depression with an 11.5% false-positive rate compared to the PHQ-9 (AUC=0.859) at a cutoff of 5 (sensitivity=71.4%; false-positive rate=13.8%). Participants also reported significantly higher satisfaction with the automated assessment compared to the traditional scale (P=.02). Conclusions: The automated assessment paradigm framework combines the interactivity and personalization of natural language processing–powered tools with the psychometric rigor of traditional scales, suggesting a preliminary #feasibility paradigm for future psychological assessment. Its ability to enhance engagement while maintaining reliability and validity provides encouraging evidence, warranting validation in larger and more diverse studies as large language model technology advances.

JMIR Formative Res: Evaluating the Efficacy of AI-Based Interactive Assessments Using Large Language Models for Depression Screening: Development and #usability Study #AI #MentalHealth #DepressionScreening #LanguageModels #PsychologicalAssessment

0 0 0 0

Overview: Hacker News debated Recursive Language Models (RLMs). Are they truly novel, or just a repackaging of RAG/sub-agents? Discussion focused on the LLM's context interaction, recursion, and the current absence of specific training in their implementation. #LanguageModels 1/6

0 0 1 0
Preview
Context Window Expansion: Transform Your AI Performance in 2025 Context Window Expansion: Transform Your AI Performance in 2025 Table of Contents * → What is Context Window Expansion? * → The Evolution of Context Windows * → Key Benefits of Expanded Context * → Challenges and Limitations * → Best Practices for Implementation * → Frequently Asked Questions What is Context Window Expansion? Context window expansion represents one of the most significant breakthroughs in artificial intelligence technology. Simply put, a context window is the amount of information a large language model (LLM) can process and "remember" at any given time. Think of it as the AI's working memory—the larger the window, the more data it can consider when generating responses. When ChatGPT first launched in late 2022, it could only process about 2,048 tokens (roughly 1,500 words). Today's advanced models like Google's Gemini can handle up to 2 million tokens—equivalent to processing over 3,000 pages of text simultaneously. This exponential growth has revolutionized how businesses and developers leverage AI technology. The Evolution of Context Windows in AI Models The journey of context window technology has been nothing short of remarkable. In 2018-2019, maximum context windows were limited to just 512-1,024 tokens. The original GPT-3.5 started with 4,096 tokens, which was later expanded to 8,192 tokens with GPT-3.5-Turbo. Major Milestones in Context Length * 2022-2023: GPT-4 launched with 8,192 tokens, later expanded to 128,000 tokens * 2023: Anthropic's Claude introduced 100,000-token context windows * 2024: Meta's Llama 3.1 reached 128,000 tokens, while Google Gemini 1.5 achieved 2 million tokens * 2025: Meta's Llama 4 announced a groundbreaking 10 million token context window This rapid expansion has enabled AI systems to transition from handling simple conversations to processing entire libraries of information in a single session. Key Benefits of Expanded Context Windows 1. Enhanced Document Processing Capabilities Organizations can now process comprehensive documents—from technical manuals to financial reports—in their entirety. This eliminates the need to break documents into smaller chunks, preserving context and improving accuracy in analysis. 2. Extended Conversation Memory AI chatbots and assistants can now maintain coherent conversations spanning hours or even days. They remember earlier discussion points, creating more natural and productive interactions without losing critical context. 3. Cache Augmented Generation (CAG) Larger context windows enable more effective use of CAG, where models can reference substantial caches of information within their context. This improves generation latency compared to traditional retrieval-augmented generation (RAG) by eliminating extra retrieval steps. 4. Improved Code Analysis Developers can now debug entire codebases in a single session. AI models can understand complex interdependencies across multiple files, providing more accurate suggestions and identifying issues that span the entire project. 5. Multimodal Data Integration Extended contexts support processing video, audio, images, and text simultaneously—perfect for applications like insurance claims processing where multiple data types need analysis together. Challenges and Limitations of Long Context Windows While expanded context windows offer tremendous benefits, they're not without drawbacks: Performance Degradation Issues Research shows that LLMs don't uniformly process information across their entire context window. Models perform best when relevant information appears at the beginning or end of inputs, with accuracy decreasing for content in the middle—a phenomenon known as the "lost in the middle" problem. Increased Computational Costs Processing longer contexts requires exponentially more computing power. Requirements scale quadratically with sequence length—doubling input tokens means quadrupling processing power. This translates to higher operational costs for enterprises. Slower Response Times As context length increases, output generation becomes progressively slower. Each new token requires computing relationships with all preceding tokens, creating latency issues for real-time applications. Signal-to-Noise Ratio Concerns More context isn't always better. Studies demonstrate that longer prompts can have lower accuracy than shorter, focused ones. Unnecessary information dilutes the signal, potentially confusing the model. Security Vulnerabilities Larger context windows create expanded attack surfaces for adversarial prompts. Research from Anthropic shows that increasing context length also increases vulnerability to jailbreaking attempts and harmful content generation. Best Practices for Implementing Context Window Expansion Be Strategically Selective Don't maximize context window usage simply because capacity exists. Include only information essential for your specific task. Quality trumps quantity when it comes to context optimization. Structure Information Intelligently Position the most critical information early in your context window. Given the "lost in the middle" phenomenon, strategic placement significantly impacts model performance. Monitor Performance Metrics Continuously track generation speed, output quality, and operational costs. This data helps identify your optimal context size—the sweet spot between comprehensive context and efficient processing. Adopt Hybrid Approaches Consider combining CAG for frequently used information with RAG for broader knowledge bases. This hybrid strategy leverages the strengths of both approaches while mitigating their individual limitations. Implement Efficient Tokenization Understand that tokenization varies by language and model. Generally, one token equals approximately 0.75 words in English. Optimize your prompts to maximize information density within token constraints. Test Before Deploying Experiment with different context lengths for your specific use cases. The ideal window size varies depending on application requirements, content type, and performance priorities. Frequently Asked Questions What is the largest context window available in 2025? As of 2025, Meta's Llama 4 offers the largest publicly announced context window at 10 million tokens. Google's Gemini 1.5 Pro provides 2 million tokens, while most commercial models like GPT-4 and Claude offer 128,000-500,000 tokens. The optimal size depends on your specific use case rather than simply choosing the largest available. How does context window size affect AI accuracy? Context window size has a nuanced relationship with accuracy. While larger windows enable processing more information, they can also reduce precision due to the "lost in the middle" problem. Models perform best with relevant information at the beginning or end of prompts. Strategic information placement and focused context often outperform simply maximizing window usage. What's the difference between context window and training data? Context windows represent the AI's "working memory" during a specific session, while training data is the vast corpus used to initially teach the model. Context windows handle immediate inputs and conversation history, whereas training data provides foundational knowledge. Both are essential but serve different purposes in AI functionality. Do larger context windows always cost more? Yes, most AI providers charge based on token usage, so larger context windows directly increase costs per query. However, prompt caching can reduce expenses for frequently reused content. The key is balancing context length with actual necessity—unnecessarily long prompts waste resources without improving results. Monitor usage and optimize based on performance metrics. Will context windows continue expanding indefinitely? While engineers continue pushing boundaries, practical limitations exist around computational costs, processing speed, and diminishing returns. Some researchers speculate about near-infinite context windows, but current trends suggest we're approaching a plateau where optimization and intelligent use become more valuable than raw expansion. Future progress will likely focus on efficiency rather than just size. Found This Article Valuable? Help others discover insights about AI context window expansion by sharing this comprehensive guide! Share on Twitter Share on Facebook Share on LinkedIn Key Takeaways Context window expansion has revolutionized AI capabilities, growing from 2,048 tokens in 2022 to 10 million tokens in 2025. This enables processing entire documents, maintaining extended conversations, and supporting multimodal analysis. However, benefits come with tradeoffs including increased costs, slower response times, and potential accuracy issues with unnecessarily long contexts. The most effective implementations strategically balance context length with performance needs, positioning critical information strategically and monitoring metrics continuously. As AI technology evolves, success lies not in maximizing context windows but in using them intelligently for specific applications. { "@context": "https://schema.org", "@type": "Article", "headline": "Context Window Expansion: Transform Your AI Performance in 2025", "description": "Discover how context window expansion is revolutionizing AI technology in 2025. Learn about benefits, challenges, and best practices for implementing expanded context windows in large language models, from 2,048 to 10 million tokens.", "image": "https://sspark.genspark.ai/cfimages?u1=rdjpW16PMUy7lnJhF%2BK7BsO1Y7HBapa%2B7U31bGGewYuOLeseh7LI5PZ0D9ObXpoA9WTWMx8RPoouWghijyvpjssvVgVgtmgqguDZw%2Fiw9WPuEAEBOxdV%2BrwerT3yv1orHHj0qD9CEJZjwrdH1%2FY3ELyqZ5H28pfh5d4Zr7CAhvvx2w%3D%3D&u2=tpiLU5WrvjFcEVra&width=2560", "author": { "@type": "Organization", "name": "YourSiteName" }, "publisher": { "@type": "Organization", "name": "YourSiteName", "logo": { "@type": "ImageObject", "url": "https://www.yoursite.com/logo.png" } }, "datePublished": "2025-12-23", "dateModified": "2025-12-23", "mainEntityOfPage": { "@type": "WebPage", "@id": "https://www.yoursite.com/context-window-expansion" }, "keywords": "context window expansion, AI context windows, large language models, LLM context length, GPT-4 context, Gemini context window, AI performance optimization, token processing, machine learning, artificial intelligence 2025", "articleSection": "Artificial Intelligence", "wordCount": 950, "inLanguage": "en-US" } Thank you for reading. Visit our website for more articles: https://www.proainews.com

Context Window Expansion: Transform Your AI Performance in 2025 #AI #ArtificialIntelligence #MachineLearning #ContextWindow #LanguageModels

1 0 0 0
Video thumbnail

AI Needs Better Thinking Steps - Demis Hassabis and Hannah Fry

#languagemodels #ai

0 0 0 0
Post image

Norway becomes first country to establish state-funded AI training framework using newspaper content. Landmark agreement funds open Norwegian/Sami language models for public & private use. Major step for accessible multilingual AI. #OpenAI #LanguageModels

0 0 1 0

"🤖💬 Are AI models like ChatGPT closer to human reasoning? A groundbreaking study reveals surprising language analysis skills that challenge our uniqueness! 🤯 What do you think? #AI #Linguistics #LanguageModels LINK"

0 0 0 0
Post image

New research shows that layering complex AI personas during fine‑tuning actually erodes meaning in benchmark prompts, and human judges are struggling to spot artificial origins. Curious? Dive into the details. #AIPersona #FineTuning #LanguageModels

🔗 aidailypost.com/news/researc...

1 0 0 0
Preview
Respectful or Toxic? Using Zero-Shot Learning with Language Models to Detect Hate Speech Flor Miriam Plaza-del-arco, Debora Nozza, Dirk Hovy. The 7th Workshop on Online Abuse and Harms (WOAH). 2023.

#TBT #NLProc 'Respectful or Toxic?' by Plaza-del-Arco, @debora & @dirkhovy.bsky.social (2023) explores zero-shot learning for multilingual hate speech detection. Highlights prompt & model choice for accuracy. #AI #LanguageModels #HateSpeechDetection

2 2 0 0

Diffusion Language Models are Super Data Learners
Chao Du, Hang Yan et al.
Paper
Details
#DiffusionModels #DataEfficient #LanguageModels

0 0 0 0
Preview
AI Across Borders Podcast · Dr. Ayesha Khanna · AI Across Borders is a podcast that dives into the real stories behind global tech journeys. From Asia to Latin America, Europe to emerging markets, we uncover how individuals found their path in technology, what inspired them, and the impact they’re making. At the heart of every episode is the human - the innovator, policymaker, athlete, entrepreneur, and researcher - each navigating change, creativity, and uncertainty in their own way. AI is not the hero of our story, but a facilitator between who we are and the change we want to make.

Subscribe to the YouTube channel
 
Visit the website: https://f.mtr.cool/vzrmnryjtn
Listen on Spotify: https://f.mtr.cool/uriyamidqg
 
#AI #LanguageModels #UAE

0 0 0 0
Preview
AI Across Borders Podcast · Dr. Ayesha Khanna · AI Across Borders is a podcast that dives into the real stories behind global tech journeys. From Asia to Latin America, Europe to emerging markets, we uncover how individuals found their path in technology, what inspired them, and the impact they’re making. At the heart of every episode is the human - the innovator, policymaker, athlete, entrepreneur, and researcher - each navigating change, creativity, and uncertainty in their own way. AI is not the hero of our story, but a facilitator between who we are and the change we want to make.

Visit the website: https://f.mtr.cool/iwiojwrtkd
Listen on Spotify: https://f.mtr.cool/scecsykolo
 
#AI #LanguageModels #UAE

0 0 0 0
Preview
AI Across Borders Podcast · Dr. Ayesha Khanna · AI Across Borders is a podcast that dives into the real stories behind global tech journeys. From Asia to Latin America, Europe to emerging markets, we uncover how individuals found their path in technology, what inspired them, and the impact they’re making. At the heart of every episode is the human - the innovator, policymaker, athlete, entrepreneur, and researcher - each navigating change, creativity, and uncertainty in their own way. AI is not the hero of our story, but a facilitator between who we are and the change we want to make.

👉Subscribe to the YouTube channel

Visit: https://f.mtr.cool/okjvoggbbn
Listen on Spotify: https://f.mtr.cool/hgtnqexmuc

#AI #LanguageModels #UAE

0 0 0 0
Preview
Part 1: Tokenization, Building an LLM From Scratch in Rust Learn how to build a language model from scratch in Rust, starting with part 1 of 6: tokenization, BPE, and vocabulary trade-offs.

Part 1 of our 6 part series on building a language model is now live. Read Part 1: www.tag1.com/white-paper/part1-tokeni...

#TechCommunity #MachineLearning #LanguageModels #DeepLearning #OpenSource

2 0 0 0

📈 Overall AI CAGR: 33.8% (2025-2033)
🚗 Automotive AI CAGR: 15.3%-37.4%
💻 AI Semiconductors CAGR: ~18%
🤖 SLMs CAGR: 28.7%

#AIGrowth #AI #AutomotiveAI #Semiconductors #LanguageModels
View in Timelines

0 0 0 0
Preview
AI Across Borders Podcast · Dr. Ayesha Khanna · AI Across Borders is a podcast that dives into the real stories behind global tech journeys. From Asia to Latin America, Europe to emerging markets, we uncover how individuals found their path in technology, what inspired them, and the impact they’re making. At the heart of every episode is the human - the innovator, policymaker, athlete, entrepreneur, and researcher - each navigating change, creativity, and uncertainty in their own way. AI is not the hero of our story, but a facilitator between who we are and the change we want to make.

Visit the website: https://f.mtr.cool/mehbodenfv
Listen on Spotify: https://f.mtr.cool/cjsiojmydw

#AI #LanguageModels #womenempowerment #UAE

0 0 0 0
Preview
#feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study Background: Large language models (LLMs) are increasingly used in medical education for feedback and grading; yet their role in postgraduate examination preparation remains uncertain due to inconsistent grading, hallucinations, and user acceptance. Objective: This study evaluates the Personalized Anesthesia Study Support (PASS), a specialized GPT-4 model developed to assist candidates preparing for Singapore’s postgraduate specialist anesthesiology examination. We assessed user acceptance, grading interrater reliability, and hallucination detection rates to determine the #feasibility of integrating specialized LLMs into high-stakes examination preparation. Methods: PASS was built on OpenAI’s GPT-4 and adapted with domain-specific prompts and references. Twenty-one senior anesthesiology residents completed a mock short answer question examination, which was independently graded by 3 human examiners and 3 PASS iterations. Participants reviewed feedback from both PASS and standard GPT-4 and completed a technology acceptance model (TAM) survey. Grading reliability was evaluated using Cohen and Fleiss κ. Hallucination rates were assessed by participants and examiners. Results: Of the 21 participants, 17 (81%) completed the TAM survey, generating 136 responses. PASS scored significantly higher than standard GPT-4 in usefulness (mean 4.25, SD 0.50 vs mean 3.44, SD 0.82; P

JMIR Formative Res: #feasibility of a Specialized Large Language Model for Postgraduate Medical Examination Preparation: Single-Center Proof-Of-Concept Study #MedicalEducation #LanguageModels #Anesthesiology #PostgraduateExams #GPT4

1 0 0 0
Preview
Accuracy of Large Language Model Responses Versus Internet Searches for Common Questions About Glucagon-Like Peptide-1 Receptor Agonist Therapy: Exploratory Simulation Study Background: Novel glucagon-like peptide 1 receptor agonists (GLP1RAs) for obesity treatment have generated much dialogue on digital media platforms. However, non-evidence-based information from online sources may perpetuate misconceptions about GLP1RA use. A promising new digital avenue for patient education is large language models (LLMs), which could potentially be used as an alternative to clarify questions about GLP1RA therapy. Objective: This study compared LLM (ChatGPT 4o) and internet (Google) search responses to simulated questions about GLP1RA therapy. Methods: Responses were graded by 2 independent evaluators based on Safety, Consensus with Guidelines, Objectivity, Reproducibility, Relevance and Explainability using a 5-point Likert Scale. Mean scores were compared using independent T-test. Qualitative observations were recorded. Results: LLM responses had significantly higher mean scores than Internet responses in the "objectivity" (3.91 ± 0.63 vs 3.36 ± 0.80, p=0.038) and "reproducibility" (3.85 ± 0.49 vs 3.00 ± 0.97, p=0.007) categories. There was no significant difference in the mean scores in "safety", "consensus", “relevance” and 'explainability". However, LLM responses lacked updated information pertaining to more contemporary concerns surrounding GLP1RA use such as the impact on fertility and mental health. Conclusions: The study highlights the importance of healthcare provider communication, as both LLM and internet searches have limitations and may perpetuate misconceptions about GLP1RAs.

JMIR Formative Res: Accuracy of Large Language Model Responses Versus Internet Searches for Common Questions About Glucagon-Like Peptide-1 Receptor Agonist Therapy: Exploratory Simulation Study #GLP1RA #ObesityTreatment #PatientEducation #DigitalHealth #LanguageModels

0 0 0 0