Speedrunning

Long-Horizon RPG Gameplay

RPG Speedrunning

This track challenges agents to complete a full Pokémon role-playing game (Pokémon Emerald) as quickly and efficiently as possible, navigating a massive, partially observable world with hundreds of NPCs and thousands of possible actions.

Long-horizon planning, efficient exploration, and strategic resource management are critical. Agents must balance immediate objectives with long-term strategic goals, making decisions that span thousands of timesteps while adapting to the unpredictable nature of RPG gameplay.

The speedrunning challenge pushes AI systems to their limits in sequential decision-making, requiring sophisticated planning algorithms and efficient resource management to achieve optimal completion times in complex, open-world environments.

Starter Kit

A real-time agent loop with modular components for perception (game frame recognition), planning & memory (long vs. short term goals, knowledge storage), and control (emulator action execution).

What's Included

  • Agent Scaffolding: Modular framework for building Pokémon Emerald speedrunning agents
  • Pokémon Emerald Wrapper: Custom emulator API for real-time game interaction
  • Baseline Implementation: Reference agent with VLM setup and basic planning
  • Evaluation Tools: Automated testing and performance measurement

Submission Guidelines

We maintain a curated list of research papers on Pokémon AI, covering competitive battling, RPG gameplay, reinforcement learning, and LLM agents.

Submit Your Paper

We welcome submissions from the research community. To add your paper:

  1. Fork the awesome-pokemon-papers repository.
  2. Add your paper to the appropriate section in README.md following the existing format.
  3. Submit a pull request with a link to your paper (arXiv, conference proceedings, etc.).

Speedrun Leaderboard Submission

To appear on the speedrun leaderboard, include a video recording of your agent playing Pokémon Emerald in your PR. We accept runs through any portion of the game, from the first gym all the way to full completion.

The benchmark is designed to scale with agent capability. Our NeurIPS 2025 competition scoped evaluation to the first gym (Roxanne), but we encourage submissions that go further. If your agent can reach the second gym, the third, or complete the entire game, submit it.

Ranking Criteria

Rankings are determined by raw performance metrics — number of actions and time.

Primary Ranking Components

  • Milestone Completion: Percentage of game milestones accomplished (gym badges, story progression)
  • Completion Efficiency: Time and action count to achieve milestones
  • Reproducibility: Clear documentation and verifiable results

Novel Methods Welcome

While we provide a starter kit with an LLM-scaffolded approach, we encourage submissions using diverse methods: tool-augmented systems, reinforcement learning, purely text-based reasoning, hybrid architectures, and other innovative techniques.

Methodology Documentation

Teams document their methodology across five dimensions:

  • State Information (S): Raw pixels vs. parsed game state vs. privileged information
  • Tools (T): External tools during gameplay (web search, calculators, planning utilities)
  • Memory (M): Memory mechanisms beyond immediate context (vector DBs, knowledge graphs)
  • Feedback (F): Human or automated feedback during runs
  • Fine-tuning (Φ): Specialized training on Pokémon data