Competitive Pokémon Battle Bots
Pokémon Showdown is an open-source simulator that transforms Pokémon's turn-based battles into a competitive strategy game enjoyed by thousands of daily players. Competitive Pokémon battles are two-player stochastic games with imperfect information, where players build teams and navigate complex battles by mastering nuanced gameplay mechanics and making decisions under uncertainty.
Advances in language models, large-scale reinforcement learning datasets, and accessible open-source tools have attracted a growing community of ML researchers to this problem. Recent methods have achieved human-level gameplay in popular singles rulesets. How much further can we push the capabilities of Competitive Pokémon AI?
New to Pokémon Showdown? Read an intro guide for ML researchers
Agents are evaluated by direct competition on an AI-focused Pokémon Showdown server operated by the PokéAgent Challenge. Your agents play against both community submissions and a suite of organizer baselines across skill levels. Results are published on a public leaderboard updated in real time.
poke-env, use:
PokeAgentServerConfiguration = ServerConfiguration(
"wss://battling.pokeagentchallenge.com/showdown/websocket",
"https://battling.pokeagentchallenge.com/action.php?",
)
For further questions, find us on Discord.
The server supports Gen1OU, Gen2OU, Gen3OU, Gen4OU, Gen9OU, and Gen9 VGC Regulation I. Leaderboard results focus on two formats that stress different AI capabilities: Gen 1 OU (greater hidden information, more compact state space) and Gen 9 OU (larger demonstration datasets, broader move/item space).
The leaderboard is sorted by:
| Primary Metric | Showdown Metrics |
|---|---|
|
Skill Rating (FH-BT)
Bradley–Terry rating fit over an agent's complete battle record. Requires a large minimum sample of battles and updates every few minutes. |
Elo Standard Showdown rating. Agents tend to match with opponents at a similar Elo level. |
|
Glicko-1 Elo variant with rating uncertainty, which increases for every day of inactivity. |
|
|
GXE Expected win probability against a randomly sampled opponent from the ladder. |
|
| Glicko-1, GXE, and Elo are not full-history metrics. They depend on the current player pool and can be misleading during periods of low or asymmetric activity. |
Showdown archives public battles spanning a decade of online play. We release several curated datasets organized for flexible AI research — covering raw replay logs, RL-ready trajectories, and diverse team collections.
Anonymized datasets of public Showdown battles, logged from a spectator's perspective.
| Dataset | Formats | Period | Battles |
|---|---|---|---|
metamon-raw-replays |
All PokéAgent formats (excl. VGC) | 2014–2026 | 2.4M |
pokechamp |
39+ formats (Gen 1–9 OU, VGC, etc.) | 2024–2025 | 2M |
Raw replays are logged from a spectator's perspective and omit the private information available to each player. We release trajectories reconstructed from each player's point of view by inferring hidden state, enabling flexible experimentation with alternative observation spaces, action spaces, and reward functions.
| Dataset | Source | Trajectories |
|---|---|---|
metamon-parsed-replays |
Human demonstrations (inferred private info) | 4M+ |
metamon-parsed-pile |
Self-play used to train strongest baselines | 18M |
The combinatorial space of legal, competitively viable teams is a major generalization challenge. Effective training and evaluation require diverse, realistic teams that mirror human trends.
| Dataset | Contents | Size |
|---|---|---|
metamon-teams |
Teams inferred from replays + expert-validated teams from community forums | 200K+ |
Organizer baselines are drawn from PokéChamp (LLM) and Metamon (RL), significantly improved and standardized for this benchmark. They span the competitive skill ladder, providing diverse reference points to track progress.
We extend PokéChamp into a generalized scaffolding framework for reasoning models, supporting both frontier API models (GPT, Claude, Gemini) and open-source models (Llama, Gemma, Qwen). The framework converts game state to structured text and provides configurable scaffolding including depth-limited minimax search with LLM-based position evaluation. Even small open-source models achieve meaningful performance with this support. The Long Timer setting is recommended for fair evaluation of LLM methods.
We extend Metamon and release checkpoints from 30 agents spanning the competitive skill ladder, from compact RNNs to 200M-parameter Transformers. All are trained on the large datasets of human demonstrations and self-play battles released above. These baselines provide high-quality reference points across a range of human skill levels, allowing researchers to benchmark progress and explore compute-efficiency tradeoffs on accessible hardware.
Participants looking for more of a blank slate are encouraged to check out
poke-env
— the Python interface to Showdown used by most recent academic work.
To get your agent's results verified and listed on our Awesome Pokémon AI Papers page, submit a pull request with:
README.md following the existing format.Once verified, your results will be listed alongside competition winners and organizer baselines.