Getting Started¶
Get AIBugBench running and test a model fast.
Before You Begin¶
Prerequisites
- Python 3.13+
- Git
You’ll run 4 fixed challenges (refactor, config repair, transform, API). Scoring covers 7 quality dimensions.
Step 1: Clone and Set Up Environment¶
Step 2: Create Test Data and Install Dependencies¶
All Platforms:
This creates deliberately broken test data (sabotage fixtures) and ai_prompt.md for AI context. See Sabotage Notes for details.
Step 3: Verify Installation¶
Test with the built-in example model:
All Platforms:
Expected Output: Scores around 90/100 (A grade). If you see "FAILED" results or errors, check that Python 3.13+ is active and all dependencies installed.
Directory Overview¶
Core structure:
run_benchmark.py- Orchestrates scoringscripts/bootstrap_repo.py- Generates sabotage fixtures and prompt filebenchmark/- Validation and scoring engineprompts/- Challenge definitionstest_data/- Deliberately broken inputssubmissions/- Your model solutionsresults/- Saved JSON and text reports
Step 4: Create Your AI Model Submission¶
Step 5: Get AI Responses and Save Code¶
- Prime your AI with the contents of
ai_prompt.mdfor optimal results - Give each prompt from
prompts/folder to your AI model: prompt_1_refactoring.md→ Save Python code asprompt_1_solution.pyprompt_2_yaml_json.md→ Save YAML asprompt_2_config_fixed.yamland JSON asprompt_2_config.jsonprompt_3_transformation.md→ Save Python code asprompt_3_transform.pyprompt_4_api_simulation.md→ Save Python code asprompt_4_api_sync.py
Save only the Python/YAML/JSON code (no explanations or markdown).
Step 6: Run Benchmark and Review Results¶
All Platforms:
Results use timestamped directories. See the User Guide for the full layout.
Quick access:
results/latest_results.json– Pointer to most recent runresults/<RUN_TS>/latest_results.json– Full run JSONresults/<RUN_TS>/detailed/summary_report_<RUN_TS>.txt– Human-readable analysisresults/<RUN_TS>/comparison_charts/comparison_chart.txt– Visual progress bars
Historical runs accumulate; each benchmark invocation creates a new <RUN_TS> directory.
See Scoring Methodology – Grade Scale for letter-grade thresholds.
Troubleshooting¶
Common Issues:
- "No module named 'yaml'":
pip install pyyaml requests - Permission denied (PowerShell):
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser - FileNotFoundError: test_data:
python scripts/bootstrap_repo.py - All scores are 0.00: files must contain code only (no markdown)
- Venv issues: try
python3orpyon Windows
🔧 Comprehensive Troubleshooting Guide - Includes Tiered Structure Errors taxonomy and detailed solutions.
Next Steps¶
- Try other models
- Read your results summary and dig into details
- See Scoring Methodology
Quick Reference (Cheat Sheet)¶
# Create venv and install (pinned)
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\Activate.ps1
pip install -r requirements.lock
python scripts/bootstrap_repo.py
# Run an example model
python run_benchmark.py --model example_model
# Run your model
python run_benchmark.py --model your_model_name
# Results (quick look)
cat results/latest_results.json | jq '.summary'