(@joelniklaus) — bluesky.baby

📣 Call for Contributions: LEXam-v2 – A Benchmark for Legal Reasoning in AI

How well do today’s AI systems really reason about law?

We’re building a global benchmark based on real law school & bar exams.

🧵 Full details, scope, and how to contribute in the thread 👇

28.01.2026 15:11 👍 5 🔁 4 💬 1 📌 0

How to Use ChatGPT to Change Your Life Prompt PDF here: https://markmanson.net/aipromptsIn this video, I put AI to the test. Not as a productivity hack, but as a personal growth tool. Everyone’s u...

You know LLMs have become mainstream when Mark Manson teaches you prompt engineering 😉

youtu.be/AUAHkhOldx8

13.11.2025 16:01 👍 0 🔁 0 💬 0 📌 0

Training on the Test Task Confounds Evaluation and Emergence We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data...

Paper: arxiv.org/abs/2407.07890

12.11.2025 15:56 👍 0 🔁 0 💬 0 📌 0

Moreover, we anticipate that the ways to effectively train on the test task will only grow in scope and adoption."

By Ricardo Dominguez-Olmedo, Florian E. Dorner, and Moritz Hardt

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

Detecting what training data a model has seen is a notoriously difficult problem –existing heuristics achieve partial success at best. Researchers routinely acknowledge the futility of fighting data contamination.

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

Instead, we propose to adjust for it by giving every model the same task-specific preparation before evaluation. We work from the assumption that training on the test task, in general, cannot be effectively detected, disallowed, or disincentivized.

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

The anecdote holds a lesson for the evaluation of large language models half a century later. Knowledge about the evaluation conditions necessarily influences training practices under competitive pressure. It may be a fool’s errand to prohibit the practice.

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

But the hotly debated results of the Games did not lead the organizers to prohibit training at natural altitude. Instead, they let everyone do it, and athletes came to consider altitude training an excellent way to train.

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

"The 1968 Olympics took place in Mexico City at the significant altitude of 2340 meters, higher than Australia’s tallest peak. Runners who had trained at altitude in their home countries were better prepared to compete in Mexico City’s conditions, as it turned out.

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

Cool analogy regarding training on the test task

12.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

PleIAs/Monad · Hugging Face

Blog Post: pleias.fr/blog/blogsy...

Dataset: huggingface.co/datasets/Pl...

Large model: huggingface.co/PleIAs/Bagu...

Small model: huggingface.co/PleIAs/Monad

11.11.2025 15:59 👍 1 🔁 0 💬 0 📌 0

- Cool to see this being done on the French supercomputer Jean Zay

11.11.2025 15:59 👍 0 🔁 0 💬 1 📌 0

- They don't release any code and the method description is quite high-level only: For example I am curious how they finetuned their models and would love to learn more about how they set up their synthetic data pipeline. Looking forward to the full report.

11.11.2025 15:59 👍 0 🔁 0 💬 1 📌 0

- They only evaluate on MMLU, GSM8K and HotPotQA. This seems cherry-picked, I wonder how their dataset performs on other standard benchmarks. They say that they basically skip pre-training and go straight to post-training.

11.11.2025 15:59 👍 0 🔁 0 💬 1 📌 0

- Seems like a cool case study pushing really small models to the limits (30 MMLU for a 56M model)

11.11.2025 15:59 👍 0 🔁 0 💬 1 📌 0

pleias just released 75B tokens of synthetic data upsampled from 50K vital Wikipedia articles!

Some thoughts below:
- Interesting that they use such a deep architecture for such small models (64 layers for 56M and 80 layers for 321M parameters)

11.11.2025 15:59 👍 0 🔁 0 💬 1 📌 0

Is AGI the right goal for AI? And also, what the heck is AGI anyway?

Paper: arxiv.org/abs/2510.18212

Gary Marcus' comment: garymarcus.substack.com/p/is-agi-th...

10.11.2025 15:56 👍 0 🔁 0 💬 0 📌 0

- Co-author Gary Marcus notes he doesn't agree with every detail but signed on to support better articulation of what AGI means. The equal 10% weighting across domains is one choice among many reasonable configurations, though the paper argues for prioritizing breadth over depth.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

For instance, GPT-5 reaches 70.8% on visual reasoning tasks where humans average 88.9%, yet scores 0% on adaptation tasks that test flexible rule inference.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

- The framework reveals a "jagged" cognitive profile where models excel in knowledge-intensive domains but have critical deficits in foundational machinery.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

Models compensate by expanding context windows, but the paper calls this a "capability contortion" that masks the absence of genuine experiential memory.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

- Both GPT-4 and GPT-5 score exactly 0% on long-term memory storage. This isn't a bug but an architectural constraint of transformer models, where attention mechanisms scale quadratically with context length.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

The framework tests ten core domains: general knowledge, reading and writing, math, reasoning, working memory, long-term memory storage, memory retrieval, visual processing, auditory processing, and speed. Applying this to current models reveals GPT-4 scores 27% and GPT-5 scores 58%.

My take:

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

A who's who in AI, 33 researchers from institutions including Berkeley, MIT, Stanford, and Oxford, including Yoshua Bengio, Eric Schmidt, Gary Marcus, and Max Tegmark, developed a quantifiable framework grounded in Cattell-Horn-Carroll theory, the most empirically validated model of human cognition.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

The term AGI acts as a constantly moving goalpost, with criteria shifting as AI systems master tasks once thought to require human intellect. This ambiguity obscures how far we actually are from human-level cognition.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

What does AGI actually mean? A who's who in AI spent 57 pages answering that.

TLDR: AGI is defined through ten measurable cognitive domains using psychometric theory.

10.11.2025 15:56 👍 0 🔁 0 💬 1 📌 0

Paper: arxiv.org/pdf/2504.07854

Collections: huggingface.co/alea-instit...

Datasets: huggingface.co/alea-instit...

Tokenizers: huggingface.co/collections...

Code: github.com/alea-instit...

Website: aleainstitute.ai/

Data Gallery: gallery.kl3m.ai/document/ra...

08.11.2025 15:01 👍 0 🔁 0 💬 0 📌 0

Join the HuggingLegal 🤗 Discord Server! Check out the HuggingLegal 🤗 community on Discord – hang out with 287 other members and enjoy free voice and text chat.

All openly available via Hugging Face and S3 under CC-BY terms.

Interested in improving the data landscape for legal AI?
Join the HuggingLegal community on Discord: discord.gg/Mnn28ak8

08.11.2025 15:01 👍 0 🔁 0 💬 1 📌 0

- Mid/post-training resources: QA pairs, summarization tasks, classification examples, drafting templates
- Multi-turn conversations from Congressional hearings and rulemaking
- kl3m-004-128k-cased tokenizer (30-40% more efficient than standard tokenizers)

08.11.2025 15:01 👍 0 🔁 0 💬 1 📌 0

- 1.35 trillion tokens across SEC EDGAR, USPTO patents, court opinions, federal regulations, EU materials
- Mean document length of 6,237 tokens; 200K+ documents exceeding 100K tokens
- Diverse domains: legal, regulatory, financial, technical (USDA protocols to NIST standards)

08.11.2025 15:01 👍 0 🔁 0 💬 1 📌 0

Latest posts by @joelniklaus