Repo (Try it!): github.com/Arize-ai/tw...
Technically the agent optimized the metric perfectly, it just took a human looking at the result to say: this is awful.
At this stage, weβre still in an era where agents optimize β and humans decide whatβs worth optimizing.
At one point the agent found a clever way to get a βlink completenessβ evaluator to pass: it added a giant βTweet Sourcesβ section at the bottom of the newsletter listing every URL.
In short, the coding agent tasked with improving the app was excellent at the mechanical loop: read eval results, diagnose the failure, write a fix, run the evals again. It went from 1/5 to 5/5 on hallucinated links in two iterations, methodically fixing the data pipeline and then the prompt.
We just open sourced a tool that turns recent tweets into an email newsletter (try it out!). Hereβs how he used evals and an agent to iteratively improve the app: arize.com/blog/how-we...
ποΈ Calling all AI practitioners! Observe 26 wants YOU on stage.
If you're working on LLM evaluation, AI agents, observability, or shipping AI to production, we want to hear your story.
Observe 2026 | June 4 | Shack15, San Francisco
Apply to speak π
docs.google.com/forms/d/e/1...
#Observe26
One command gives Cursor, Claude Code, Codex, Windsurf and others native knowledge of Arize workflows. Instrument, debug, evaluate. Without leaving your editor.
npx skills add Arize-ai/arize-skills --skill '*' --yes
arize.com/blog/arize-...
Introducing Arize Skills.
Every new session, engineers were writing the same wall of context before their coding agent could do anything with Arize. So we packaged it.
New York ποΈ: we're hosting a workshop at Betaworks covering a proven way to boost Claude Code performance. RSVP: luma.com/ajy0fdyf
In our next "How It Was Built" workshop, we're peeling back the curtain on the planning architecture, context management challenges, and testing strategies behind Alyx. π©RSVP: luma.com/alyx2.0
π¬π§ London: we're hosting an AI Builders night on March 17th with food, drinks, β‘ demos, learning, and fun. RSVP: luma.com/gwd1hbzo
Come see how we improved Claude Code's performance on SWE-Bench Lite by up to 11% purely by optimizing the system prompt instructions.
Arize is crashing Microsoft Azure Model Mondays! RSVP: developer.microsoft.com/en-us/react...
πΏ Rich Young will explore how organizations can build a continuous responsible AI lifecycle combining Microsoft Foundry with Arize AX's observability and experimentation workflows.
- Plan pinned after the system prompt on every loop iteration
- 4 task statuses: pending, in_progress, completed, blocked
- A hard gate that prevents finishing with incomplete tasks
Part 1 of our "How We Built Alyx" deep dive series: arize.com/blog/how-to...
Learn how to build a production agent from our own real experience!
Structured planning is what turns an agent from a tool executor into a workflow orchestrator.
Here's what worked for us:
- Planning as structured tool calls, not prompt instructions
Last week we launched Alyx 2.0, the in-app AI engineering agent for Arize AX. Today we're taking it further.
The AX CLI makes your Arize data machine readable so your coding agent can work with it directly.
Blog: arize.com/blog/ax-cli...
pip install arize-ax-cli
climbing > sitting in a conference room
Join us next Friday at Benchmark Climbing in SF for a free rock climbing night with the Arize AI community. π§
First 20 guests get an Owala water bottle. space is limited, grab your spot β
luma.com/arize-ai-cl...
Your agents are getting smarter, but can you prove they are reliable?
This DataCamp virtual workshop with Laurie Voss will cover:
β
How to build, evaluate, and analyze a simple AI agent end-to-end
β
Core evals principles
RSVP: www.datacamp.com/webinars/ev...
Save the date π
Observe 2026 is back!
Join 700+ AI builders at Shack15 in San Francisco on June 4th for the 5th annual Observe conference β a full day of talks, demos, and deep dives into AI observability, evaluation, and agents.
π Learn more + save your spot: arize.com/observe/
Arize AI was just named to the Agentic List 2026!
Presented by the AI Agent Conference and curated by Simon Chan at FirsthandVC in partnership with NYSE Wired and SiliconANGLE & theCUBE, the award recognizes the top 120 agentic AI companies shaping the future through autonomous, intelligent systems
Alyx 2.0 is live.
An AI engineering agent built into Arize AX that can reason across multi-step workflows and execute autonomously.
β³ Error analysis
β³ Prompt experimentation
β³ Trace debugging
No more stitching everything together by hand.
Learn more on the blog: arize.com/blog/alyx-2...
New tutorial just dropped: π how Google ADK works with Arize AX to power complex RAG flows with visibility for hallucination detection, retrieval quality, and answer-quality arize.com/blog/master...
Join us next week in Seattle π οΈ
Hands-on workshop with AWS: build, evaluate & monitor AI agents in production using Strands SDK, Bedrock Agentcore & Arize AX.
Feb 26 Β· 5β7 PM Β· Food provided
Limited spots β lu.ma/n34vo5el
βThis provides a way for business users to come in and interrogate a decision, like theyβd pop into somebodyβs office.β β Austin Facer, America First Credit Union
How AFCU built a GenAI Decision Explainer w/ parallel LLM workers + end-to-end tracing in Arize AX: arize.com/blog/how-am...
β‘οΈ The experiments table now shows prompt name, version, and a hover preview of system/user messages, with one-click navigation back to the playground with the original prompt loaded
Plus many more! Check it all out: app.arize.com
β‘οΈ Full RBAC lineage support now covers prompts, evaluators, and annotation configs, with hierarchical enforcement across spaces and accounts
β‘οΈ Text annotations can now be updated programmatically via the SDK, enabling bulk updates and automated annotation pipelines at scale
β‘οΈ Eval traces are now linked directly from playground experiment results, so you can jump straight to span-level trace data for debugging without leaving your workflow
We shipped some great stuff last week! Here are the highlights:
β‘οΈ Claude Opus 4.6 is now available on AWS Bedrock - 1M token context, built for complex enterprise tasks, coding, and agentic workflows
If youβre building agents or LLM applications and want a disciplined way to test improvements, prevent regressions, and track quality over time, this tutorial walks through the full Arize AX workflow end to end!
Get started below β¬οΈ
β’ Running experiments with LLM-as-a-Judge - Score outputs on more subjective criteria like helpfulness, actionability, or safety
β’ Building an iteration workflow - Compare experiment runs, analyze results, and systematically validate changes before pushing to production.