LMArena launches Code Arena for full-cycle AI model evaluation

What's new? Code Arena assesses AI coding models over real-world dev cycles with agentic tool calls; it logs file operations with live coding and traceable IDs;

· 1 min read
LMArena

Arena has announced Code Arena, a comprehensive evaluation platform designed to assess AI coding models under real-world conditions. The release is available publicly, targeting developers, researchers, model builders, and knowledge workers interested in AI-assisted software development. Unlike traditional benchmarks that focus solely on code correctness, Code Arena records and evaluates the entire development cycle, including planning, iterative building, debugging, and refinement.

The system introduces agentic behaviors, allowing models to perform structured tool calls like file creation, editing, and execution. Each action is tracked and logged in persistent, restorable sessions, and output generations can be shared for peer review. A secure frontend streams live coding sessions, and every evaluation links to a unique traceable ID, ensuring reproducibility and transparency. Human evaluators compare model outputs based on functionality, usability, and fidelity, with their judgments aggregated using rigorous statistical methods.

Arena, the company behind Code Arena, previously ran WebDev Arena, one of the first large-scale human-in-the-loop coding benchmarks. Code Arena replaces this earlier effort with a platform rebuilt for methodological rigor, transparent data handling, and scientific measurement. The new leaderboard starts fresh, avoiding legacy data to guarantee clean comparisons. Industry experts and early users have highlighted Code Arena's step forward in transparent, reproducible evaluation, setting a new standard for testing AI coding agents.

Source