#Benchmark — Bluesky Posts

@roxsross.bsky.social

43 minutes ago

⚡ PNNL y OpenAI se asocian para agilizar permisos federales

Presentan DraftNEPABench, un benchmark para acelerar revisiones de infraestructura con IA.

openai.com/index/pacific-northwest-...

#AIcoding #NEPA #Benchmark #RoxsRoss

1 0 0 0

Diario El Mundo

@elmundo.hn

1 day ago

BCIE reduce 95 puntos básicos en tres años tras emitir 2,000 millones en Benchmark histórico El Banco Centroamericano de Integración Económica aplicará tercer recorte consecutivo de 15 puntos básicos en tasas de interés desde el 1 de junio de 2026 acumulando reducción de entre 80 y 95 puntos en tres años, beneficiando presupuestos nacionales mediante eficiencias logradas tras captar 2,000 millones de dólares en emisión Benchmark más grande de su historia. Este artículo BCIE reduce 95 puntos básicos en tres años tras emitir 2,000 millones en Benchmark histórico se publicó primero en Diario El Mundo | Noticias de Honduras y el Mundo.

#Economía #Presupuestos #Benchmark BCIE reduce 95 puntos básicos en tres años tras emitir 2,000 millones en Benchmark histórico

0 0 0 0

KillBait News

@kill-bait.bsky.social

2 days ago

Early Benchmarks Show Apple's MacBook Neo Outperforming Top x86 CPUs in Single-Core Tests Benchmark results from Notebookcheck reveal that the new Apple MacBook Neo, powered by the A18 Pro chip, delivers record-breaking single-core performance that surpasses all current x86 processors from Intel and AMD. In Cinebench 2024 testing, the A18 Pro achieved 147 points while consuming only 3.5 to 4 watts. This efficiency is noteworthy, as the test itself lasts roughly ten minutes and taxes a CPU core consistently during the process. The performance figure places Apple’s chip ahead of even high-end desktop CPUs such as Intel’s Core Ultra 9 285K and AMD’s Ryzen 9 9950X3D—not to mention every modern mobile chip from AMD, Intel, and Qualcomm. The A18 Pro also tops Apple’s previous M3 generation, cementing the company’s continued lead in single-core efficiency. Despite these impressive results, the article notes that Apple’s architectural design includes specialized accelerators that favor workload types optimized for its ecosystem, meaning the raw benchmark may not represent typical real-world usage outside macOS or Apple-optimized software. Notebookcheck suggests that Apple’s tight integration between hardware and software provides a unique advantage versus general-purpose processors. Industry reactions are mixed; some applaud the innovation, while others label the coverage as overly promotional. Regardless, the results signal a new level of competition between Apple’s ARM-based systems and the traditional x86 giants, Intel and AMD.

Early Benchmarks Show Apple's MacBook Neo Outperforming Top x86 CPUs in Single-Core Tests

🤖 IA: It's clickbait ⚠️
👥 Usuarios: It's clickbait ⚠️

#apple #benchmark #cpu

View full AI summary:

0 0 0 0

KillBait News

@kill-bait.bsky.social

2 days ago

Researchers Develop a Comprehensive Benchmark to Evaluate AI Expertise As AI systems increasingly excelled at traditional academic benchmarks, researchers recognized the need for more challenging tests. In response, an international team of nearly 1,000 experts developed Humanity's Last Exam (HLE), a 2,500-question assessment covering mathematics, humanities, natural sciences, ancient languages, and other highly specialized fields. Each question was carefully crafted so that current AI models could not solve it, with any solvable questions removed from the final exam. Early testing revealed that even the most advanced AI models struggle significantly, with scores ranging from roughly 2.7% to around 50% for the most capable systems. Dr. Tung Nguyen from Texas A&M University emphasized that the goal is not to defeat AI but to identify gaps in AI knowledge and provide a durable benchmark for measuring AI progress. The exam demonstrates that high performance on traditional human-focused tests does not equate to genuine intelligence, as AI systems still lack deep, contextual understanding and specialized expertise. Humanity's Last Exam also highlights the importance of human expertise and the value of global, interdisciplinary collaboration in evaluating AI capabilities.

Researchers Develop a Comprehensive Benchmark to Evaluate AI Expertise

🤖 IA: It's clickbait ⚠️
👥 Usuarios: It's clickbait ⚠️

#ai #benchmark #research

View full AI summary:

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

2 days ago

mSOP-765k: A Benchmark For Multi-Modal Structured Output Predictions

Bianca Lamm, Janis Keuper

Action editor: Mohammad Ghavamzadeh

https://openreview.net/forum?id=H7eYL4yFZS

#benchmark #advertisements #modal

0 0 0 0

KillBait Noticias

@kill-bait-es.bsky.social

3 days ago

Evaluación de modelos de IA frente a preguntas sin sentido BullshitBench es un benchmark diseñado para evaluar cómo los modelos de inteligencia artificial responden a preguntas sin sentido o basadas en premisas incorrectas. La prueba analiza si los modelos detectan estas premisas defectuosas, si señalan directamente el sinsentido y si evitan continuar con suposiciones inválidas de forma confiada. La plataforma permite filtrar los resultados según diferentes criterios, como la visibilidad del modelo y la técnica de razonamiento utilizada. Además, ofrece un ranking de modelos según su capacidad para rechazar claramente las preguntas sin sentido, mostrando la mejora de cada versión en términos de porcentajes de respuestas correctas y de detección de errores. Los datos se organizan con códigos de colores que indican el tipo de respuesta: verde para respuestas claras, ámbar para respuestas parciales, rojo para aceptar el sinsentido y errores que indican fallos. Esta herramienta resulta útil para desarrolladores y investigadores que buscan entender las limitaciones de los modelos de lenguaje actuales y mejorar su capacidad de razonamiento crítico, evitando que los modelos den respuestas incorrectas con confianza. BullshitBench también permite comparar modelos entre sí y rastrear el progreso de su desarrollo a lo largo del tiempo, proporcionando información valiosa sobre la evolución de la inteligencia artificial en contextos de razonamiento complejo y detección de información inválida.

Evaluación de modelos de IA frente a preguntas sin sentido

🤖 IA: No es clickbait ✅
👥 Usuarios: No es clickbait ✅

#ia #modelosdelenguaje #benchmark

Ver resumen IA completo:

0 0 0 0

Fierce Mind

@fiercemind.bsky.social

3 days ago

#Google: #AI agents learn to cooperate on their own - no hardcoded #orchestration needed. Train them against a diverse pool of #opponents and #cooperation emerges as a property of #training.

#Benchmark:
Iterated Prisoner's Dilemma.

Result: stable collaboration

#AI #MultiAgent #MachineLearning

3 0 0 0

DW Innovation

@dw-innovation.mastodon.social.ap.brid.gy

4 days ago

LLMs hallucinate – but not at the same rate. AA-Omniscience data reveals major differences between models and domains.

Well structured and worth checking out: https://artificialanalysis.ai/evaluations/omniscience

#AI #LLM #benchmark

0 5 0 1

Ahmandonk

@ahmandonk.bsky.social

6 days ago

📰 Benchmark Intel Core Ultra 5 250K Plus Bocor, Gambarkan Performa Arrow Lake Refresh

👉 Baca artikel lengkap di sini: ahmandonk.com/2026/03/09/intel-core-ul...

#arrowLake #benchmark #cpu #intel

0 0 0 0

Melamorsicata.it

@melanews.bsky.social

1 week ago

Geekbench 6 benchmark results showing iPhone 17e with A19 chip performance compared to iPhone 17.

I primi benchmark Geekbench 6 rivelano che iPhone 17e con chip A19 è alla pari con iPhone 17 per la CPU. La GPU a 4 core del 17e mostra un leggero calo grafico rispetto ai 5 core del 17. 📱📊
#iphone17e #benchmark #chipa19

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

1 week ago

There are no Champions in Supervised Long-Term Time Series Forecasting

Lorenzo Brigato, Rafael Morand, Knut Joar Strømmen et al.

Action editor: Devendra Dhami

https://openreview.net/forum?id=yO1JuBpTBB

#benchmarking #forecasting #benchmark

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

1 week ago

New #J2C Certification:

\texttt{Complex-Edit}: CoT-Like Instruction Generation for Complexity-Controllable Image Editing ...

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, Cihang Xie

https://openreview.net/forum?id=lL1JR6dxG8

#editing #instruction #benchmark

0 0 0 0

Melamorsicata.it

@melanews.bsky.social

1 week ago

MacBook Neo benchmark:
CPU vicina a iPhone 16 Pro, chip A18 Pro con GPU ridotta.

Dati:
Neo: 3461/8668/31286
iPhone 16 Pro: 3445/8624/32575
M4 Air: 3696/14730/54630

Analisi prestazioni hardware 💻📊

#apple #macbookneo #benchmark

0 0 0 0

Gui17aume

@gui17aume.bsky.social

1 week ago

MacBook Neo performance Single Core - Geekbench

MacBook Neo performance Multi-Core - Geekbench

Le MacBook Neo est la grosse nouveauté de cet #AppleLaunch
Niveau performances on se situe quelque part entre la puce M1 et la puce M4 en fonction des usages. Hâte de voir ce qu'il donnera en conditions réelles ! 🤩
#MacBookNeo #Geekbench #benchmark

0 0 0 0

Shenzhen Pages

@shenzhenpages.bsky.social

1 week ago

#BYD has unveiled its second-gen blade battery, setting a new #benchmark in fast‑charging technology.

At a launch event in Shenzhen, the company demonstrated charging speeds from 10% to 70% in just five minutes, and up to 97% in nine minutes, comparable to refueling a car.

0 0 1 0

LLMs

@llms.activitypub.awakari.com.ap.brid.gy

1 week ago

Awakari App

The Price Per Million Tokens Is Lying to You About 9 months ago, I was building a RAG system, for those who don’t know its a kind of enhanced memory system for AI agents. One of the… Continue r...

#benchmark #ai #developer-tools #llm #machine-learning

Origin | Interest | Match

1 0 0 0

deepseek

@deepseek.activitypub.awakari.com.ap.brid.gy

1 week ago

The Price Per Million Tokens Is Lying to You About 9 months ago, I was building a RAG system, for those who don't know its a kind of enhanced...

The Price Per Million Tokens Is Lying to You About 9 months ago, I was building a RAG system, for those who don't know its a kind of enhanced memory system for AI agents. One of the agentic flo...

#ai #llm #benchmark #devtools

Origin | Interest | Match

1 0 0 0

Emil Lazzaroni

@globalcompounders.bsky.social

1 week ago

48. The cartographers of the financial world For our forty-eighth episode, we're exploring a company that is the ultimate "tollbooth" business. Imagine you're a massive pension fund or an asset manager. How do you measure your performance? How do you decide how to invest in international stocks? [pause] You need a benchmark, a universal standard, a map of the financial world. Our company today creates those maps. And for the privilege of using them, they collect a small, recurring fee on trillions of dollars of global assets. We are talking about the financial data and index powerhouse... MSCI. When you hear the name MSCI ($MSCI), you probably think of their world-renowned stock market indices, like the MSCI World or MSCI Emerging Markets benchmarks. They are the creators of the yardsticks that a huge portion of the global investment industry, including countless ETFs and mutual funds, measure themselves against. But the real story behind MSCI is its evolution into a deeply embedded financial data and analytics powerhouse. The indices are just the beginning. The company operates a powerful, recurring-revenue "toll road" model, collecting fees based on the assets tied to its benchmarks. Furthermore, its suite of mission-critical risk analytics and ESG data tools are woven into the daily operations of the world's largest asset managers, creating incredibly high switching costs. But with the stock perpetually trading at a premium valuation, is the price of admission too high? We're running the numbers to determine if MSCI's formidable competitive moat makes it a must-own compounder or if its high valuation presents too much risk in a cyclical market.

📣 New Podcast! "48. The cartographers of the financial world" on @Spreaker #analytics #assetmanagement #benchmark #blackrock #compounder #data #esg #etf #finance #financial #index #investing #moat #msci #portfolio #recurring #risk #royalty #stock #valuation

0 0 0 0

Shantha Mohan, Ph.D., DTM

@shlead.bsky.social

1 week ago

How Well Does Agent Development Reflect Real-World Work? AI agents are increasingly developed and evaluated on benchmarks relevant to human work, yet it remains unclear how representative these benchmarking efforts are of the labor market as a whole. In thi...

Current AI agent benchmarks are poorly aligned with real-world human work. They are heavily skewed toward programming-centric tasks. Domains where most people work and contribute value are underrepresented in how we measure AI progress.

arxiv.org/abs/2603.01203
#ai #benchmark

6 0 0 0

Raspberry-Pi

@raspberry-pi.activitypub.awakari.com.ap.brid.gy

1 week ago

I Benchmarked Java on Single-Board Computers: Orange Pi 5 Ultra and Raspberry Pi 5 Lead the Pack Table of Contents Benchmark ToolBenchmarkRunner.java - The User ToolSummarizeReports.java - The Auto...

#Embedded #Java #Java #Core #JBang #Performance #Raspberry #Pi […]

[Original post on foojay.io]

0 0 0 0

TMLR Published Papers

@tmlr-pub.bsky.social

3 weeks ago

Leveraging the True Depth of LLMs

Ramón Calvo González, Daniele Paliotta, Matteo Pagliardini, Martin Jaggi, François Fleuret

Action editor: Changyou Chen

https://openreview.net/forum?id=JccJ6YfWd4

#llms #llm #benchmark

2 0 0 0

Akzess

@akzess.bsky.social

3 weeks ago

Coolest Stores: Boss Flagship on Passeig de Gràcia | invidis Barcelona | Boss brings a striking blend of digital signage and Catalan design heritage to its new flagship on Barcelona’s Passeig de Gràcia. The expansive store combines architectural craftsmanship, ...

Boss brings a striking blend of #digitalsignage to its new flagship. The expansive store combines architectural craftsmanship, natural light, and immersive brand experiences to set a new #benchmark for modern #Retail.

invidis.com/news/2026/02...

0 0 0 0

Cognix Dev

@cognix-dev.bsky.social

3 weeks ago

Cognix v0.2.5 released.

Benchmark vs Claude Code & Aider (3 runs, same LLM):
- Exec: 100% (= Claude Code, > Aider 87.5%)
- Lint: 0.00 (best in class)
- claude-opus-4.6 support added

Report on Zenn/Dev.to soon.

pipx install cognix
cognix-dev.github.io/cognix/

#Claude #Aider #Benchmark

0 0 0 0

ToxSec

@toxsec.bsky.social

3 weeks ago

#Gemini 3.1 is here.

another day another #benchmark drop.

Gemini 3.1 is here.

stats looks pretty good honestly.

look at that #ARC-AGI-2 jump!

#BrowseComp also through the roof, so it should have a really good agentic search function.

2 0 0 0

Reverse-Engineering

@reverse-engineering.activitypub.awakari.com.ap.brid.gy

3 weeks ago

We hid backdoors in binaries — Opus 4.6 found 49% of them This blog post was authored by Piotr Grabowski, Rafał Strzaliński, Michał Kowalczyk, Piotr Migdał,...

We hid backdoors in binaries — Opus 4.6 found 49% of them This blog post was authored by Piotr Grabowski , Rafał Strzaliński , Michał Kowalczyk , Piotr Migdał , and Jacek Migdal . Claude can ...

#ai #benchmark #security

Origin | Interest | Match

0 0 0 0

The Daily Tech Feed

@thedailytechfeed.com

3 weeks ago

Jack Altman joins Benchmark as General Partner, bringing his Alt Capital team along. A significant shift in the VC landscape! #VentureCapital #Benchmark #JackAltman Link: thedailytechfeed.com/jack-altman-...

0 0 0 0

Yves Zieba

@yveszieba.bsky.social

3 weeks ago

Le levier stratégique presque toujours sous-estimé Pourquoi le choix du fournisseur de propreté est-il crucial pour la RSE ? Impact environnemental, bien-être social et image de marque : apprenez à choisir un partenaire responsable pour vos bureaux…

Propreté, entretien des locaux professionnels ou industriels, un achat parfois négligé... faire un nouveau #benchmark pour reconsidérer vos options peut vous aider à marquer des points plutôt faciles. #greentech #impact #achatresponsable #respect #stratégieRSE #nettoyage yveszieba.me/2026/02/18/l...