Skip to content

AIBugBench

Deterministic, local AI code benchmarking — sandboxed and reproducible.

Compare models on the same four tasks. Score across seven dimensions. No network. No vibes. Just receipts.

Get started How scoring works

CI codecov trend safety

Pick your path

  • New users
    Install, run your first benchmark, read the scorecard.
    Getting Started

  • Model authors
    Add a model folder and wire outputs.
    Developer Guide

  • Power users
    CLI flags, artifacts, diffs, comparisons.
    User Guide

  • Security first
    Sandbox, audit, and guardrails at a glance.
    Security

Documentation

Core Guides

Understanding the System

Project Information

Developer & Internals

Developer Guide Architecture API Reference Internals Overview

Validation Scripts Submissions Template

Quick start

See Getting Started for installation and your first run.