Skip to content

AIBugBench Troubleshooting Guide

For AI/LLM generated code that is tested and produces errors, also consult the "Troubleshooting Guide" section inside the Sabotage Notes for hazard-specific remediation patterns.

This guide covers common issues encountered when working with AIBugBench and provides step-by-step solutions.

Quick FAQ

  • What Python version do I need? — Python 3.13+. Check with python --version.
  • How do I install and run? — See Getting Started for platform steps; then python run_benchmark.py --model example_model.
  • Where are results written? — See User Guide (start with results/latest_results.json).
  • How are grades calculated? — See Scoring Methodology (Grade Scale).
  • Can I run offline? — Yes. Offline and sandboxed by default (see Security).
  • All scores are 0.00 — Ensure files contain only code (no markdown) and exact filenames.
  • YAML/JSON parsing fails — Use safe loaders and preserve structure; see Troubleshooting + Prompt 2 notes.
  • How do I add a model? — Copy template to submissions/user_submissions/<your_model>/; see Developer Guide.

Common Issues and Solutions

Tiered Structure Errors (New Layout)

This section covers the three primary error/failure modes introduced with the enforced tiered submissions layout.

Scenario Symptom / Message Meaning Action
Missing submissions dir ERROR: Submissions directory 'submissions' not found! (summary line absent) Root submissions directory isn't present Run python scripts/bootstrap_repo.py or create submissions/ then re-run
Empty tiered structure Discovered models: reference=0 user=0 templates=OK followed later by No models found in submissions directory Layout exists but no models present Copy template to new model: cp -r submissions/templates/template submissions/user_submissions/my_model then implement files
Template missing Discovered models: reference=X user=Y templates=MISSING Required submissions/templates/template/ not found Recreate template: python -c "from benchmark.utils import create_submission_template; import pathlib; create_submission_template(pathlib.Path('submissions'))"
Legacy layout detected Process aborts with SystemExit: message starts Legacy submissions layout detected (e.g. submissions/example_model). Old flat layout present without new tiers (no fallback) Migrate: create reference_implementations/, templates/template/, user_submissions/; move old example_model/ into reference_implementations/

Quick Migration Commands (Unix-like shells):

mkdir -p submissions/{reference_implementations,templates,user_submissions}
mv submissions/example_model submissions/reference_implementations/
mkdir -p submissions/templates/template
python - <<'PY'
from benchmark.utils import create_submission_template; from pathlib import Path; create_submission_template(Path('submissions'))
PY

PowerShell equivalent:

New-Item -ItemType Directory -Force submissions/reference_implementations,submissions/templates/template,submissions/user_submissions | Out-Null
Move-Item submissions/example_model submissions/reference_implementations/ -Force
python - <<'PY'
from benchmark.utils import create_submission_template; from pathlib import Path; create_submission_template(Path('submissions'))
PY

Verification:

python run_benchmark.py --submissions-dir submissions --model example_model || true

Expected discovery line after migration (with at least example_model in reference implementations):

Discovered models: reference=1 user=0 templates=OK

Benchmark Execution Failures

"No module named 'benchmark'"

Symptoms:

ModuleNotFoundError: No module named 'benchmark'

Fix:

# Ensure you're in the project root
cd AIBugBench
# Verify the benchmark directory exists
ls -la benchmark/
# Re-run with proper Python path
python run_benchmark.py --model example_model

Verify:

python -c "import benchmark; print('OK')"

"FileNotFoundError: test_data directory not found"

Symptoms:

FileNotFoundError: [Errno 2] No such file or directory: 'test_data/config.yaml'

Fix:

# Run setup to create test data
python scripts/bootstrap_repo.py
# Verify test data exists
ls -la test_data/

Verify:

python -c "import os; print('OK' if os.path.exists('test_data/config.yaml') else 'FAIL')"

Validation Tool Failures

"scripts/validate_docs.py fails with platform mismatch"

Symptoms:

Platform mismatch: macos_linux vs windows_cmd

Fix:

# Windows users should run with platform override
python scripts/validate_docs.py --platform windows_cmd --docs-only
# Or use PowerShell variant
python scripts/validate_docs.py --platform windows_powershell --docs-only

Verify:

python scripts/validate_docs.py --docs-only --verbose

"validate_security.py fails with exit code 127"

Symptoms:

Command failed with return code 127 bandit: command not found

Fix:

# Install dev dependencies (pinned)
pip install -r requirements-dev.lock
# Verify bandit is installed
bandit --version
# Re-run security validation
python scripts/validate_security.py

Verify:

python scripts/validate_security.py --dry-run

Testing Issues

"pytest command not found"

Symptoms:

'pytest' is not recognized as an internal or external command

Fix:

# Install pytest via pip
pip install pytest pytest-cov
# Or install dev dependencies (pinned)
pip install -r requirements-dev.lock
# Re-run tests
pytest tests/ -v

Verify:

pytest --version

"Tests fail with import errors"

Symptoms:

ImportError: cannot import name 'validators' from 'benchmark'

Fix:

# Ensure you're in project root
cd AIBugBench
# Add project to Python path and run
PYTHONPATH=. pytest tests/ -v
# Or use the Python module flag
python -m pytest tests/ -v

Verify:

python -c "from benchmark import validators; print('OK')"

Submission Issues

"Model submission not recognized"

Symptoms:

Error: Model directory 'my_model' not found (reference_implementations/ or user_submissions/)

Fix:

# List available models by tier
ls -la submissions/reference_implementations/ || true
ls -la submissions/user_submissions/ || true
# Create new model from template (user tier)
cp -r submissions/templates/template submissions/user_submissions/my_model
# Verify structure
ls -la submissions/user_submissions/my_model/

Verify:

python run_benchmark.py --model my_model --dry-run

"Missing required solution files"

Symptoms:

Missing required file: prompt_1_solution.py

Fix:

# Check what files are missing
ls -la submissions/my_model/
# Copy missing files from template
# (Legacy path removed) copy from canonical template:
cp submissions/templates/template/prompt_1_solution.py submissions/my_model/
# Verify all required files exist
python scripts/validate_submission.py my_model

Environment Setup Issues

"Virtual environment activation fails"

Symptoms:

# Windows
'venv\Scripts\activate' is not recognized...

# macOS/Linux  
bash: venv/bin/activate: No such file or directory

Fix:

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python -m venv venv
source venv/bin/activate

# Verify activation
pip list

Verify:

python -c "import sys; print('Virtual env:', hasattr(sys, 'real_prefix') or (hasattr(sys, 'base_prefix') and sys.base_prefix != sys.prefix))"

"pip install fails with permission errors"

Symptoms:

ERROR: Could not install packages due to an EnvironmentError: [Errno 13] Permission denied

Fix:

# Use user installation (pinned)
pip install --user -r requirements.lock
# Or fix virtual environment
python -m venv venv --clear
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.lock

Log File Analysis

Common Log Locations

  • Test logs: reports/session/YYYYMMDD_HHMMSS/pytest.log
  • Validation logs: reports/session/YYYYMMDD_HHMMSS/ruff_check.log
  • Security logs: reports/session/YYYYMMDD_HHMMSS/bandit.log
  • Benchmark results: results/latest_results.json

Analyzing Failed Runs

# Check latest test results
cat reports/session/$(ls -t reports/session/ | head -1)/pytest.log | grep ERROR

# Check validation issues
cat reports/session/$(ls -t reports/session/ | head -1)/ruff_check.log | tail -20

# Check security issues
cat reports/session/$(ls -t reports/session/ | head -1)/bandit.log | grep -A5 -B5 "HIGH\|MEDIUM"

Performance Issues

"Validation takes too long"

Symptoms:

  • validate_docs.py hangs for minutes
  • High CPU usage during validation

Fix:

# Skip sandbox creation for faster validation
python scripts/validate_docs.py --no-sandbox-safe --docs-only
# Skip network commands
python scripts/validate_docs.py --skip-network
# Use verbose mode to see progress
python scripts/validate_docs.py --verbose

"Benchmark runs out of memory"

Symptoms:

MemoryError: Unable to allocate array

Fix:

# Run with smaller batch sizes
python run_benchmark.py --model example_model --timeout 30
# Clear any cached results
rm -rf results/cache/
# Monitor memory usage
python run_benchmark.py --model example_model --verbose

Validation Commands Reference

Quick Health Check

# Run all validators in safe mode
python scripts/validate_docs.py --docs-only
python scripts/validate_security.py --dry-run
python -m pytest tests/ --tb=short

Full Validation Suite

# Documentation validation
python scripts/validate_docs.py --verbose --no-sandbox-safe

# Security validation  
python scripts/validate_security.py

# Code quality validation
python -m ruff check .
python -m mypy benchmark/

# Test validation with coverage
python -m pytest tests/ --cov=benchmark --cov-report=html

Debug Mode Commands

# Enable maximum verbosity
python run_benchmark.py --model example_model --verbose --debug

# Validate specific documentation file
python scripts/validate_docs.py --docs-only --verbose --project-root .

# Run single validator test (parametric suite) with full output
python -m pytest tests/test_validators_parametric.py::test_prompt1_parametric -v -s

Emergency Recovery

Complete Environment Reset

# Remove virtual environment
rm -rf venv/

# Clean cached files
find . -name "__pycache__" -type d -exec rm -rf {} +
find . -name "*.pyc" -delete

# Recreate environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.lock
python scripts/bootstrap_repo.py

Restore Default Test Data

# Backup current test data
mv test_data test_data_backup_$(date +%Y%m%d_%H%M%S)

# Regenerate clean test data
python scripts/bootstrap_repo.py

# Verify generation
ls -la test_data/
python -c "import yaml; print(yaml.safe_load(open('test_data/config.yaml')))"

Getting Additional Help

Enable Debug Logging

export PYTHONPATH=.
export DEBUG=1
python run_benchmark.py --model example_model --verbose 2>&1 | tee debug.log

Generate Detailed Reports

# Full system report (enhanced audit)
python validation/repo_audit_enhanced.py --path . --json audit_report.json > system_report.txt

# Validation report with output file
python scripts/validate_docs.py --output validation_report.txt

# Test report with coverage
python -m pytest tests/ --cov=benchmark --cov-report=html --html=test_report.html

Contact and Resources

  • Check recent commits in git log for related fixes
  • Review CHANGELOG.md for known issues and their resolutions
  • Examine docs/logging/ for session-specific debugging info

Platform-Specific Notes

Windows

  • Use PowerShell for better command compatibility
  • Path separators: use \ or double \\ in strings
  • Virtual env activation: venv\Scripts\activate.bat

macOS/Linux

  • Virtual env activation: source venv/bin/activate
  • May need to install python3-dev for some packages
  • Use python3 explicitly if python points to Python 2

Cross-Platform

  • Use Path() objects in Python for path handling
  • Always use forward slashes in documentation examples
  • Test commands on target platform before documenting