Seth Karten (@sethkarten.ai)

I think I accidentally stumbled upon engagement baiting from first principles

Ill stay on bluesky as long as the 10 accounts I like to see still post here

25.12.2025 19:58 👍 1 🔁 0 💬 0 📌 0

You should make one of those github repos called
Awesome-Multi-agent-Papers
Because this looks like a solid list

25.12.2025 19:56 👍 3 🔁 0 💬 0 📌 0

I do appreciate you for making this contribution, but it is hard to compete with other platforms that do it centralized

24.12.2025 07:32 👍 1 🔁 0 💬 0 📌 0

I already use it and it doesnt solve the issue with bugs, lack of discoverability, lack of useful recommendation, and otherwise a worse experience than X or even linkedin

24.12.2025 07:31 👍 0 🔁 0 💬 1 📌 0

I think I might leave bluesky tbh

24.12.2025 07:24 👍 1 🔁 0 💬 1 📌 1

Blanket use of LLMs should not decrease significance of results. I am distrusting of any researcher that would not use their own product

10.12.2025 05:09 👍 1 🔁 0 💬 0 📌 0

Vote em out

10.12.2025 02:52 👍 1 🔁 0 💬 0 📌 0

Source: x.com/nxthompson/s...

09.12.2025 20:02 👍 1 🔁 0 💬 0 📌 0

Personally I am worried about this effect in disclosure

09.12.2025 20:01 👍 5 🔁 0 💬 3 📌 0

Flyer for The PokeAgent Challenge at NeurIPS 2025. Sunday, Dec 7, 8–10:45 AM PST, Mezzanine Room 15AB, San Diego Convention Center. Two tracks: Track 1 (Battling) features competitive Pokémon battle bots; Track 2 (Speedrunning) features long-horizon RPG gameplay. Tagline: "How do we close the gap between specialist RL models and generalist LLM agents?" Speakers: Seth Karten (Princeton), Aaron Traylor, Minmin Chen (Google DeepMind), Jake Grigsby (UT Austin), Stephanie Milani (NYU/Johns Hopkins), Kiran Vodrahalli (Google DeepMind), Fei Fang (CMU), Yuke Zhu (UT Austin), Chi Jin (Princeton). Sponsored by Google DeepMind.

How do we close the gap between specialist RL and generalist LLM agents?

We're benchmarking it in Pokémon. Join us at the PokeAgent Challenge competition workshop @ NeurIPS 2025.

📍 Dec 7, 8AM
🎮 Track 1: Competitive Pokémon (game-theoretic reasoning)
🗺️ Track 2: Speedrunning (long-horizon planning)

24.11.2025 17:50 👍 4 🔁 3 💬 0 📌 0

Best account to aggregate MAS research

03.12.2025 17:16 👍 0 🔁 0 💬 0 📌 0

The assumption is not that bad. Additionally it is not a hard threshold so the methods will scale as models get better

28.11.2025 18:47 👍 1 🔁 0 💬 1 📌 0

EC is partially solved with foundation models. The social settings arent and the LLM Economist takeaways are going to be very practical moving forward. If you have aligned agents, many multi-agent problems become simple optimization problems. You just need to train with a scaffold like claude code

28.11.2025 06:29 👍 1 🔁 0 💬 1 📌 0

These are pretty cool.. but i guess nothing ever happened with it? I like the jersey city uber eats robots a lot too

But we should still be building and deploying things here 100x faster

27.11.2025 21:23 👍 1 🔁 0 💬 1 📌 0

Philly is a good place to deploy. My issue is the general anti AI sentiment is stronger in the northeast. (At least my perception as a lifelong northeaster) Many view the world as zero sum instead of general sum. It is much easier to build something new when you can abundantly find likeminded people

27.11.2025 20:58 👍 3 🔁 0 💬 1 📌 0

Between setbacks in boston from taxi unions and now this, i have pretty much given up on the northeast long term. At this rate the northeast will become a 20th century museum like europe

27.11.2025 20:43 👍 3 🔁 0 💬 1 📌 0

Trains should be autonomous

26.11.2025 22:32 👍 2 🔁 0 💬 0 📌 0

I’ll be in San Diego at NeurIPS Dec 3-7!
DM or email if you want to chat about
- building the foundation agents through games
- PokeAgent Challenge & PokéChamp
- LLM Economist & autonomous business agents

26.11.2025 21:32 👍 3 🔁 1 💬 0 📌 0

Flyer for The PokeAgent Challenge at NeurIPS 2025. Sunday, Dec 7, 8–10:45 AM PST, Mezzanine Room 15AB, San Diego Convention Center. Two tracks: Track 1 (Battling) features competitive Pokémon battle bots; Track 2 (Speedrunning) features long-horizon RPG gameplay. Tagline: "How do we close the gap between specialist RL models and generalist LLM agents?" Speakers: Seth Karten (Princeton), Aaron Traylor, Minmin Chen (Google DeepMind), Jake Grigsby (UT Austin), Stephanie Milani (NYU/Johns Hopkins), Kiran Vodrahalli (Google DeepMind), Fei Fang (CMU), Yuke Zhu (UT Austin), Chi Jin (Princeton). Sponsored by Google DeepMind.

How do we close the gap between specialist RL and generalist LLM agents?

We're benchmarking it in Pokémon. Join us at the PokeAgent Challenge competition workshop @ NeurIPS 2025.

📍 Dec 7, 8AM
🎮 Track 1: Competitive Pokémon (game-theoretic reasoning)
🗺️ Track 2: Speedrunning (long-horizon planning)

24.11.2025 17:50 👍 4 🔁 3 💬 0 📌 0

Yes, please bring on the supply
We need:
- cheap energy
- cheap housing
- cheap food

Only possible by increasing supply

11.11.2025 20:45 👍 1 🔁 0 💬 0 📌 0

Gen 1 OU Pokemon Qualifiers end tonight and I'm not even competing, yet I'm nervously watching error bars converge.

(5/5)

20.10.2025 03:50 👍 1 🔁 0 💬 0 📌 0

Most LLM arenas use Bradley-Terry (batch MLE)—accurate but requires full recomputation. Glicko-1 offers the best of both worlds: online updates and convergence to the batch optimum, with uncertainty estimates included.

(4/5)

20.10.2025 03:50 👍 1 🔁 0 💬 1 📌 0

Top-3 agents converge across all methods (250+ games each). But ranks 4+ show systematic disagreement:
-Elo diverges from HR even when HR's error bars don't overlap
-Glicko-1 agrees with HR despite being online

(3/5)

20.10.2025 03:50 👍 1 🔁 0 💬 1 📌 0

Leaderboard of Pokemon Gen 1 OU Top 100 NeurIPS competition for the PokeAgent Challenge. The leaderboard shows username, elo, glicko-1, glicko-1 deviation, wins, losses, and ties for the results of the head to head battles for each agent methodology. Highlighted are top user submissions. PAC-MM-* usernames are organizer hosted baselines.

Leaderboard of Pokemon Gen 1 OU Top 100 NeurIPS competition for the PokeAgent Challenge on the pokeagent.github.io website. The leaderboard shows username, history rating, GXE, wins, losses for the results of the head to head battles for each agent methodology, including showing the currently qualifying methods.

In the NeurIPS PokeAgent Challenge, we stress-test 4 ranking systems across (100k+ agent matches):
- Bradley-terry (batch MLE, our ground truth)
- Elo (online, chess-standard)
- Glicko-1 (online, uncertainty-aware)
- GXE: (Glicko-derived win %)

(2/5)

20.10.2025 03:50 👍 1 🔁 0 💬 1 📌 0

Every LLM eval uses Bradley-Terry Elo rankings. Almost none report uncertainty. Should we trust them? Maybe there is something better... 👇

(1/5)

20.10.2025 03:50 👍 1 🔁 0 💬 1 📌 0

Two pokeagents in the replay archives

A benchmark environment is nothing without data so you can pretrain before you RL.

Announcing our replay archive preview: We are releasing an additional 25k games to help you train a metagame exploiter (5 million more released after qualifier)

replays.pokeagentshowdown. com:8443/
(3/3)

15.10.2025 17:50 👍 1 🔁 0 💬 0 📌 0

YouTube Share your videos with friends, family, and the world

- Gen 1 OU Battles require 100+ turns of long context planning in partially observable, stochastic environments

Check out the PokeAgent Challenge Gen 1 OU Qualifier live this week👇
youtube.com/live/N6JmD5XKf4g
(2/3)

15.10.2025 17:50 👍 1 🔁 0 💬 1 📌 0

Pokemon is truly the pareto frontier of agent research
- The RPG requires an autonomous embodied agentic agent with perception, planning, memory, and control
- VGC and Gen 9 OU penalize erroneous actions with fast-paced opponent-modeling in short games
(1/3)

15.10.2025 17:50 👍 9 🔁 1 💬 1 📌 0

Apparently i need to fullscreen my browser for the new post button to show up

15.10.2025 17:35 👍 0 🔁 0 💬 0 📌 0

Trying to get a post ready but bluesky won’t let me post on desktop!!! If you want users here you need a user experience!!!

15.10.2025 17:34 👍 0 🔁 0 💬 1 📌 0

Seth Karten

Latest posts by Seth Karten @sethkarten.ai