Skip to content

API Reference

Live, importable modules from the repo. Build step sets PYTHONPATH so this works without a wheel.

Reading order

Start with the Module map to see what's where. Then skim the snippets and CLI. The Full reference at the bottom is collapsed by default.

Module map

  • benchmark.scoring — Scoring engine and helpers; BenchmarkScorer, calculate_grade, comparison helpers.
  • benchmark.validators — Prompt validators and analyzers (security, performance, maintainability).
  • benchmark.secure_runner — Sandboxed execution utilities. SecureRunner provides sandbox context and guarded execution.
  • benchmark.utils — Utilities for test data loading, directory setup, comparisons, and model stats.
  • benchmark.types — TypedDict schemas describing prompt results and overall output structures.

Selected snippets

Strict sandbox environment (minimal, readable extract):

def _prepare_environment(self, sandbox_dir: Path) -> None:
    os.environ.clear()
    home_dir = sandbox_dir / "home"
    tmp_dir = sandbox_dir / "temp"
    home_dir.mkdir(exist_ok=True)
    tmp_dir.mkdir(exist_ok=True)

    base_env = {
        "HOME": str(home_dir),
        "USERPROFILE": str(home_dir),  # Windows
        "TEMP": str(tmp_dir),
        "TMP": str(tmp_dir),
        "TMPDIR": str(tmp_dir),
        "PYTHONDONTWRITEBYTECODE": "1",
        "AIBUGBENCH_SANDBOX_ROOT": str(sandbox_dir.resolve()),
        "AIBUGBENCH_ALLOW_NETWORK": "1" if self.allow_network else "0",
    }
    for key in ["PATH","SystemRoot","WINDIR","COMSPEC",
                "NUMBER_OF_PROCESSORS","PROCESSOR_ARCHITECTURE","LANG","LC_ALL"]:
        val = self._original_env.get(key)
        if val:
            base_env[key] = val
    os.environ.update(base_env)

Run a Python entry inside the sandbox (ensures guards via sitecustomize.py):

def run_python_sandboxed(self, args: list[str], *, timeout: int = 10,
                         cwd: Path | None = None, memory_mb: int = 512):
    cmd = [sys.executable, "-B", *args]    # -B keeps .pyc off; still loads sitecustomize
    env = os.environ.copy()                 # inherit sandbox env
    if cwd:
        # ensure sandbox folder (with sitecustomize.py) is on import path
        env["PYTHONPATH"] = str(cwd) + (os.pathsep + env["PYTHONPATH"] if env.get("PYTHONPATH") else "")
    # platform-specific resource limits applied here...
    return subprocess.run(cmd, cwd=str(cwd) if cwd else None, env=env,
                          stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                          text=True, timeout=timeout, check=False)


Snippets from the runner

Robust Unicode-safe printing

def safe_print(self, message: str) -> None:
    try:
        print(message)
    except UnicodeEncodeError:
        ascii_message = message.encode("ascii", "ignore").decode("ascii")
        print(ascii_message)
    except Exception as e:
        with contextlib.suppress(Exception):
            print(f"Print error: {e!s}")

Detailed scoring formatting (compact, two-line display)

def format_detailed_score(self, detailed_scoring: dict[str, Any]) -> str:
    lines, categories = [], []
    order = ["syntax","structure","execution","quality","security","performance","maintainability"]
    for cat in order:
        if cat in detailed_scoring:
            s = detailed_scoring[cat]
            categories.append(f"{cat.title()}: {s.get('earned',0):.1f}/{s.get('max',0):.1f}")
    mid = len(categories) // 2
    if categories:
        lines.append(f"     └─ {', '.join(categories[:mid])}")
        if len(categories) > mid:
            lines.append(f"        {', '.join(categories[mid:])}")
    return "\n".join(lines)

Atomic result write (tmp file swap)

def _atomic_write_json(self, path: Path, data: Any) -> None:
    tmp = path.with_suffix(path.suffix + ".tmp")
    with open(tmp, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    os.replace(tmp, path)

Recipes

Execute an existing script inside a sandbox

from benchmark.secure_runner import SecureRunner
from pathlib import Path

runner = SecureRunner(model_name="example_model", allow_network=False)
with runner.sandbox() as root:
    result = runner.run_python_sandboxed(
        ["-m", "module_to_run", "--flag"],
        cwd=Path(root),
        timeout=10,
        memory_mb=512,
    )
    print(result.stdout)

Parse CLI args without running the benchmark

from run_benchmark import parse_args
args = parse_args(["--model","example_model","--mem","768","--quiet"])
assert args.model == "example_model" and args.mem == 768 and args.quiet

CLI Reference

Usage

python run_benchmark.py [--model NAME | --all-models] [--workers N]
                        [--submissions-dir DIR] [--results-dir DIR]
                        [--mem {256,384,512,768,1024}]
                        [--unsafe] [--allow-network] [--trusted-model]
                        [--no-metadata] [-q|--quiet]

Arguments

Flag Type / Values Default Description
--model string Test a single model by name.
--all-models flag false Test all discovered models (if supported in your runner).
--workers int 1 Number of concurrent workers when testing multiple models.
--submissions-dir path submissions Root directory containing model submissions.
--results-dir path results Directory where results, summaries and charts are written.
--mem one of 256,384,512,768,1024 512 Memory limit (MB) for sandboxed execution.
--unsafe flag false Disable sandbox/resource isolation. Dangerous; for trusted runs only.
--allow-network flag false Allow network access during execution.
--trusted-model flag false Suppress unsafe-mode confirmation (use in CI for trusted submissions).
--no-metadata flag false Skip environment/git/dependency metadata collection.
-q, --quiet flag false Suppress non-essential output.

Examples

# Single model, default sandbox & limits
python run_benchmark.py --model example_model

# All models with 4 workers, custom results dir
python run_benchmark.py --all-models --workers 4 --results-dir out/results

# CI-like trusted run with network allowed and larger RAM cap
python run_benchmark.py --model gpt4 --unsafe --trusted-model --allow-network --mem 1024 -q

Environment Variables

Variable Default Description
AIBUGBENCH_RESULTS_DIR results/ Override default results directory
AIBUGBENCH_TIMEOUT 30 Default operation timeout
AIBUGBENCH_DEBUG false Enable debug logging
PYTHONPATH - Include benchmark modules in path

Programmatic usage (high level)

from run_benchmark import AICodeBenchmark

bench = AICodeBenchmark(submissions_dir="submissions", results_dir="results")
result = bench.run_single_model("example_model")
print(result["overall_score"], result["percentage"])

Representative output: (terminal, .json, results)

Terminal output + security banner

Example terminal output (multi-model run)
> python run_benchmark.py
╔══════════════════════════════════════╗
║      AIBugBench Security Status      ║
╠══════════════════════════════════════╣
║Sandboxing:     ENABLED               ║
║Network:        BLOCKED               ║
║Subprocess:     BLOCKED               ║
║Filesystem:     CONFINED              ║
║Env Clean:      CLEANED               ║
║ResourceLimits: ENFORCED              ║
║Trusted Model:  YES                   ║
╚══════════════════════════════════════╝
Discovered models: reference=1 user=0 templates=OK
🔍 Discovered 1 model(s): example_model

Testing model: example_model
==================================================

📝 Testing Refactoring & Analysis...
   ✅ PASSED - Score: 23.17/25
     └─ Syntax: 5.0/5.0, Structure: 2.4/3.0, Execution: 6.0/6.0
        Quality: 3.0/3.0, Security: 4.0/4.0, Performance: 1.9/2.0, Maintainability: 0.9/2.0

📝 Testing YAML/JSON Correction...
   ✅ PASSED - Score: 25.00/25
     └─ Syntax: 4.0/4.0, Structure: 6.0/6.0, Execution: 8.0/8.0
        Quality: 6.0/6.0, Security: 1.0/1.0, Performance: 0.0/0.0, Maintainability: 0.0/0.0

📝 Testing Data Transformation...
2025-09-30 02:19:47,031 - transform_module - WARNING - User 103: Email is null, cannot extract provider
2025-09-30 02:19:47,031 - transform_module - WARNING - User 999: Missing or invalid 'contact' field
2025-09-30 02:19:47,032 - transform_module - WARNING - User 999: Missing or invalid 'stats' field
2025-09-30 02:19:47,032 - transform_module - INFO - Successfully transformed 6 users
    ✅ PASSED - Score: 22.00/25
     └─ Syntax: 3.0/3.0, Structure: 3.0/3.0, Execution: 12.0/12.0
        Quality: 3.0/3.0, Security: 0.0/1.0, Performance: 1.0/1.0, Maintainability: 0.0/2.0

📝 Testing API Simulation...
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,072 - api_module - INFO - Successfully synced users. Job ID: abc123
✅ Sync successful! Job ID: abc123
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,072 - api_module - WARNING - Unexpected success status code: 400
⚠️  Warning: Unexpected response status 400
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - WARNING - Unexpected success status code: 401
⚠️  Warning: Unexpected response status 401
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - WARNING - Unexpected success status code: 503
⚠️  Warning: Unexpected response status 503
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - ERROR - Network connection error: Network error
❌ Network Error: Unable to connect to CRM system
   Please check your internet connection and try again
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - INFO - Successfully synced users. Job ID: test123
✅ Sync successful! Job ID: test123
   ✅ PASSED - Score: 22.00/25
     └─ Syntax: 2.0/2.0, Structure: 3.0/3.0, Execution: 7.0/7.0
        Quality: 3.0/3.0, Security: 6.0/7.0, Performance: 0.0/2.0, Maintainability: 1.0/1.0

🎯 Final Score: 92.17/100 (92.2%)
📄 Summary report: results\20250930_021946\detailed\summary_report_20250930_021946.txt
📊 Comparison chart: results\20250930_021946\comparison_charts\comparison_chart.txt

🎉 Benchmark completed! Tested 1 model(s)

🏆 Top Performers:
  1. example_model: 92.2%
  2. (n/a)

📁 Detailed results have been saved to:
  • results/latest_results.json - Complete data with detailed scoring
  • results/detailed/summary_report_*.txt - Summary with enhanced feedback
  • results/comparison_charts/comparison_chart_*.txt - Visual comparison with progress bars

For complete scoring breakdowns and analysis, check these files in the /results directory.

JSON results file

Detailed JSON results file (results/latest_results.json)
{
  "benchmark_run": {
    "timestamp": "2025-10-11T22:48:46.444397",
    "total_models": 4
  },
  "models": {
    "GPT5_Thinking": {
      "model_name": "GPT5_Thinking",
      "timestamp": "2025-10-11T22:48:46.444621",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 23.166666666666664,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (2.4/3.0):  ✓yaml_import, json_import, error_handling, type_hints ✗logging",
            "✅ Execution (6.0/6.0):  ✓runs_without_error, json_output_validation",
            "✅ Code Quality (3.0/3.0):  ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (0.9/2.0):  ✓no_duplication, good_naming ✗no_long_functions"
          ],
          "tests_passed": {
            "valid_python": true,
            "runs_successfully": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 2.4,
              "max": 3.0
            },
            "execution": {
              "earned": 6.0,
              "max": 6.0
            },
            "quality": {
              "earned": 3.0,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 0.9,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 25.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "✅ Quality (6.0/6.0):  ✓yaml_indentation, json_literals, formatting_style, no_duplication",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 6.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 21.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "⚠️ Quality (2.0/3.0):  ✓try_except, type_conversions ✗readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 2.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 15.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "⚠️ Execution (4.0/7.0):  ✓handle_400, handle_401, handle_503, handle_connection_error ✗success_handling, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "❌ Security (0.0/7.0):  ✗security_analysis",
            "✅ Performance (2.0/2.0):  ✓retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 4.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 7
            },
            "performance": {
              "earned": 2.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 84.16666666666666,
      "total_possible": 100,
      "percentage": 84.2
    },
    "Mistral_LeChat": {
      "model_name": "Mistral_LeChat",
      "timestamp": "2025-10-11T22:48:52.651905",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 16.316666666666666,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (1.8/3.0):  ✓yaml_import, json_import, error_handling ✗logging, type_hints",
            "❌ Execution (0.0/6.0):  ✗correct_filtering",
            "⚠️ Code Quality (2.2/3.0):  ✓no_global_vars, has_main_guard, proper_file_handling ✗uses_pathlib",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (1.4/2.0):  ✓no_long_functions, no_duplication, good_naming"
          ],
          "tests_passed": {
            "valid_python": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 1.7999999999999998,
              "max": 3.0
            },
            "execution": {
              "earned": 0.0,
              "max": 6.0
            },
            "quality": {
              "earned": 2.25,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 1.4,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 25.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "✅ Quality (6.0/6.0):  ✓yaml_indentation, json_literals, formatting_style, no_duplication",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 6.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "✅ Quality (3.0/3.0):  ✓try_except, type_conversions, readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 21.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "✅ Execution (7.0/7.0):  ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "⚠️ Security (5.0/7.0):  ✓bearer_auth, no_token_leak ✗explicit_timeout",
            "❌ Performance (0.0/2.0):  ✗retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 7.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 5.0,
              "max": 7
            },
            "performance": {
              "earned": 0.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 84.31666666666666,
      "total_possible": 100,
      "percentage": 84.3
    },
    "Sonnet4.5_Thinking": {
      "model_name": "Sonnet4.5_Thinking",
      "timestamp": "2025-10-11T22:48:52.739921",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 17.166666666666664,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (2.4/3.0):  ✓yaml_import, json_import, error_handling, type_hints ✗logging",
            "❌ Execution (0.0/6.0):  ✗runs_without_error, correct_filtering",
            "✅ Code Quality (3.0/3.0):  ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (0.9/2.0):  ✓no_duplication, good_naming ✗no_long_functions"
          ],
          "tests_passed": {
            "valid_python": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 2.4,
              "max": 3.0
            },
            "execution": {
              "earned": 0.0,
              "max": 6.0
            },
            "quality": {
              "earned": 3.0,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 0.9,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 23.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "⚠️ Quality (4.0/6.0):  ✓json_literals, formatting_style, no_duplication ✗yaml_indentation",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 4.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "✅ Quality (3.0/3.0):  ✓try_except, type_conversions, readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "✅ Execution (7.0/7.0):  ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "⚠️ Security (6.0/7.0):  ✓bearer_auth, no_token_leak, explicit_timeout ✗no_hardcoded_creds",
            "❌ Performance (0.0/2.0):  ✗retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 7.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 6.0,
              "max": 7
            },
            "performance": {
              "earned": 0.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 84.16666666666666,
      "total_possible": 100,
      "percentage": 84.2
    },
    "example_model": {
      "model_name": "example_model",
      "timestamp": "2025-10-11T22:48:52.852209",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 23.166666666666664,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (2.4/3.0):  ✓yaml_import, json_import, error_handling, type_hints ✗logging",
            "✅ Execution (6.0/6.0):  ✓runs_without_error, json_output_validation",
            "✅ Code Quality (3.0/3.0):  ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (0.9/2.0):  ✓no_duplication, good_naming ✗no_long_functions"
          ],
          "tests_passed": {
            "valid_python": true,
            "runs_successfully": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 2.4,
              "max": 3.0
            },
            "execution": {
              "earned": 6.0,
              "max": 6.0
            },
            "quality": {
              "earned": 3.0,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 0.9,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 25.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "✅ Quality (6.0/6.0):  ✓yaml_indentation, json_literals, formatting_style, no_duplication",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 6.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "✅ Quality (3.0/3.0):  ✓try_except, type_conversions, readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "✅ Execution (7.0/7.0):  ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "⚠️ Security (6.0/7.0):  ✓bearer_auth, no_token_leak, explicit_timeout ✗no_hardcoded_creds",
            "❌ Performance (0.0/2.0):  ✗retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 7.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 6.0,
              "max": 7
            },
            "performance": {
              "earned": 0.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 92.16666666666666,
      "total_possible": 100,
      "percentage": 92.2
    }
  },
  "comparison": {
    "ranking": [
      {
        "model": "example_model",
        "score": 92.16666666666666,
        "percentage": 92.2
      },
      {
        "model": "Mistral_LeChat",
        "score": 84.31666666666666,
        "percentage": 84.3
      },
      {
        "model": "GPT5_Thinking",
        "score": 84.16666666666666,
        "percentage": 84.2
      },
      {
        "model": "Sonnet4.5_Thinking",
        "score": 84.16666666666666,
        "percentage": 84.2
      }
    ],
    "prompt_performance": {
      "prompt_1": {
        "best_score": 23.166666666666664,
        "avg_score": 20.0,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "GPT5_Thinking",
            "score": 23.166666666666664,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 23.166666666666664,
            "passed": true
          },
          {
            "model": "Sonnet4.5_Thinking",
            "score": 17.166666666666664,
            "passed": true
          },
          {
            "model": "Mistral_LeChat",
            "score": 16.316666666666666,
            "passed": true
          }
        ]
      },
      "prompt_2": {
        "best_score": 25.0,
        "avg_score": 24.5,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "GPT5_Thinking",
            "score": 25.0,
            "passed": true
          },
          {
            "model": "Mistral_LeChat",
            "score": 25.0,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 25.0,
            "passed": true
          },
          {
            "model": "Sonnet4.5_Thinking",
            "score": 23.0,
            "passed": true
          }
        ]
      },
      "prompt_3": {
        "best_score": 22.0,
        "avg_score": 21.8,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "Mistral_LeChat",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "Sonnet4.5_Thinking",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "GPT5_Thinking",
            "score": 21.0,
            "passed": true
          }
        ]
      },
      "prompt_4": {
        "best_score": 22.0,
        "avg_score": 20.0,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "Sonnet4.5_Thinking",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "Mistral_LeChat",
            "score": 21.0,
            "passed": true
          },
          {
            "model": "GPT5_Thinking",
            "score": 15.0,
            "passed": true
          }
        ]
      }
    },
    "summary_stats": {}
  },
  "_metadata": {
    "spec_version": "0.8.0",
    "git_commit": "6fbb2b4",
    "python_version": "3.13.7",
    "platform": "Windows-11-10.0.26100-SP0",
    "timestamp_utc": "2025-10-11T20:48:53.047029Z",
    "dependency_fingerprint": "efa462512888b811"
  }
}

Comparison Chart Format

Model Comparison:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

model_1         [████████████████████████████████████████████░░░░] 94.3%
model_2         [███████████████████████████████████████████░░░░░] 91.8%
model_3         [████████████████████████████████████░░░░░░░░░░░░] 76.5%
example_model   [██████████████████████████████████████████████░░] 92.2%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Configuration Files

Test Data Configuration

Configuration files are generated by scripts/bootstrap_repo.py and located in test_data/:

  • config.yaml - Deliberately broken YAML configuration (multi-document)
  • user_data.json - Sample user data for transformation
  • process_records.py - Python script requiring refactoring

Loading configuration snippets (collapsed)

Safe YAML loading
import yaml

# Load multi-document YAML safely
with open('test_data/config.yaml', 'r') as f:
    docs = list(yaml.safe_load_all(f))
    # Merge documents (last document wins)
    config = {}
    for doc in docs:
        if doc:
            config.update(doc)
Configuration structure
use_legacy_paths: true
paths:
  data_source: /srv/data/production/users.json
  legacy_data_source: ./user_data.json
  log_file: /var/log/processor.log
validation_rules:
  min_age: 18
  max_age: 120
  required_fields:
    - name
    - email
    - country
processing_options:
  batch_size: 100
  timeout_seconds: 30
  retry_attempts: 3
api_keys:
  - primary_key
  - secondary_key
  - backup_key
feature_flags:
  enable_logging: true
  strict_validation: false
  debug_mode: false

Validation functions (collapsed)

Prompt 1 (Code Refactoring), 2 (YAML/JSON Correction), 3 (Data Transformation), 4 (API Integration)

Prompt 1: Code Refactoring

def validate_prompt1_refactoring(solution_path: str) -> dict:
    """
    Validate refactored Python code.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'syntax': bool,
                'execution': bool,
                'security': list,
                'performance': list,
                'maintainability': list
            }
        }
    """

Prompt 2: YAML/JSON Correction

def validate_prompt2_yaml_json(yaml_path: str, json_path: str) -> dict:
    """
    Validate corrected YAML and JSON files.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'yaml_valid': bool,
                'json_valid': bool,
                'equivalence': bool,
                'structure': dict
            }
        }
    }
"""

Prompt 3: Data Transformation

def validate_prompt3_transformation(transform_path: str) -> dict:
    """
    Validate data transformation function.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'function_exists': bool,
                'signature_correct': bool,
                'transformations': dict,
                'business_rules': bool
            }
        }
    """

Prompt 4: API Integration

def validate_prompt4_api_integration(api_path: str) -> dict:
    """
    Validate API integration function.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'function_exists': bool,
                'authentication': bool,
                'error_handling': dict,
                'security': dict
            }
        }
    """

Error Handling

Common Exit Codes

Code Meaning Resolution
0 Success -
1 General error Check error message
2 Missing model Verify model name and directory
3 Validation failure Review submission files
4 Timeout exceeded Increase timeout or optimize code
5 File not found Ensure all required files exist

Platform-Specific Notes

Windows

  • Use python or py command
  • Paths use backslashes or forward slashes
  • PowerShell may require execution policy adjustment

macOS/Linux

  • Use python3 command
  • Ensure proper file permissions
  • May need to use sudo for certain operations

Docker

  • Mount submissions directory as volume
  • Set environment variables in container
  • Use non-root user for security

Python API

Core Modules

benchmark.runner

Main benchmark execution module.

from benchmark.runner import BenchmarkRunner

# Initialize runner
runner = BenchmarkRunner(
    model_name="gpt4",
    results_dir="results/",
    timeout=30
)

# Run all tests
results = runner.run_all_tests()

# Run specific test
prompt1_result = runner.run_test("prompt_1_refactoring")

benchmark.validators

Validation logic for each prompt.

from benchmark.validators import (
    validate_prompt1_refactoring,
    validate_prompt2_yaml_json,
    validate_prompt3_transformation,
    validate_prompt4_api_integration
)

# Validate refactored code
result = validate_prompt1_refactoring(
    solution_path="submissions/model/prompt_1_solution.py"
)
print(f"Score: {result['score']}/{result['max_score']}")

benchmark.scoring

Scoring engine with 7-category assessment.

from benchmark.scoring import calculate_score, get_grade

# Calculate total score
results = {
    'prompt_1': {'score': 23.5, 'max_score': 25},
    'prompt_2': {'score': 25.0, 'max_score': 25},
    'prompt_3': {'score': 24.5, 'max_score': 25},
    'prompt_4': {'score': 21.0, 'max_score': 25}
}

total_score = calculate_score(results)
grade = get_grade(total_score)
print(f"Total: {total_score}/100 - Grade: {grade}")

benchmark.utils

Utility functions for file operations and formatting.

from benchmark.utils import (
    safe_load_json,
    safe_load_yaml,
    format_results,
    create_comparison_chart
)

# Safe file loading with error handling
data = safe_load_json("user_data.json")
config = safe_load_yaml("config.yaml")

# Format results for display
formatted = format_results(results)
print(formatted)

# Create comparison chart
chart = create_comparison_chart(results)
print(chart)

Full API reference (collapsed)

benchmark.scoring

benchmark.scoring

Scoring system for AI Code Benchmark.

BenchmarkScorer

BenchmarkScorer()

Handles scoring and grading for the benchmark.

calculate_grade

calculate_grade(percentage)

Convert percentage score to letter grade.

generate_feedback_summary

generate_feedback_summary(results)

Generate high-level feedback based on results.

generate_improvement_suggestions

generate_improvement_suggestions(results)

Generate specific improvement suggestions based on test results.

compare_models

compare_models(all_results)

Generate detailed model comparison analysis.

generate_badge

generate_badge(percentage)

Generate a badge/achievement based on performance.

benchmark.validators

benchmark.validators

Validators for AI Code Benchmark prompts.

ScoringDetail

ScoringDetail(max_points)

Helper class to track detailed scoring with rationale.

add_check

add_check(name, passed, points, rationale='')

Add a scoring check with detailed rationale.

get_feedback_line

get_feedback_line(category_name)

Generate detailed feedback line with breakdown.

SecurityAnalyzer

Analyzes code for common security vulnerabilities.

check_sql_injection_patterns staticmethod

check_sql_injection_patterns(code)

Check for potential SQL injection vulnerabilities.

check_hardcoded_secrets staticmethod

check_hardcoded_secrets(code)

Check for hardcoded secrets and API keys.

check_path_traversal staticmethod

check_path_traversal(code)

Check for path traversal vulnerabilities.

analyze_code_security classmethod

analyze_code_security(code)

Perform comprehensive security analysis on code.

PerformanceAnalyzer

Analyzes code for performance issues and inefficient patterns.

check_nested_loops staticmethod

check_nested_loops(code)

Check for O(n²) and nested loop patterns that may be inefficient.

check_inefficient_patterns staticmethod

check_inefficient_patterns(code)

Check for common inefficient programming patterns.

check_memory_patterns staticmethod

check_memory_patterns(code)

Check for potential memory inefficiencies.

check_algorithm_efficiency staticmethod

check_algorithm_efficiency(code)

Check for algorithmically inefficient approaches.

analyze_code_performance classmethod

analyze_code_performance(code)

Perform comprehensive performance analysis on code.

MaintainabilityAnalyzer

Analyzes code for maintainability issues and code quality metrics.

check_function_length staticmethod

check_function_length(code)

Check for overly long functions (>20 lines).

check_code_duplication staticmethod

check_code_duplication(code)

Check for obvious code duplication patterns.

check_variable_naming staticmethod

check_variable_naming(code)

Check for poor variable naming practices.

check_complexity_indicators staticmethod

check_complexity_indicators(code)

Check for high complexity indicators.

analyze_code_maintainability classmethod

analyze_code_maintainability(code)

Perform comprehensive maintainability analysis on code.

PromptValidators

PromptValidators(test_data_dir, model_name='default')

Validates solutions for each benchmark prompt.

validate_prompt_1_refactoring

validate_prompt_1_refactoring(solution_file)

Validate Prompt 1: Code Refactoring & Analysis.

validate_prompt_3_transformation

validate_prompt_3_transformation(transform_file)

Validate Prompt 3: Data Transformation.

validate_prompt_4_api

validate_prompt_4_api(api_file)

Validate Prompt 4: API Integration with behavioral testing.

run_in_sandbox

run_in_sandbox(fn, self, *args, **kwargs)

Internal helper executing a validator method inside a SecureRunner sandbox.

Mirrors the prior sandbox logic while remaining reusable and type-safe.

sandbox_validator

sandbox_validator(fn)

Decorator for PromptValidators instance methods preserving signature & return type.

benchmark.secure_runner

benchmark.secure_runner.SecureRunner

SecureRunner(model_name, allow_network=False)

Execute untrusted code in an isolated temporary environment.

sandbox

sandbox()

Context manager establishing the sandbox directory and environment.

run_with_limits

run_with_limits(func, *args, timeout=30, memory_mb=512)

Execute func(*args) under CPU & memory limits where supported.

memory_mb: planned default; future flag --mem will allow override (512/768/1024).

run_python_sandboxed

run_python_sandboxed(args, *, timeout=10, cwd=None, memory_mb=512)

Execute a python module or script with -B inside the sandbox.

Caller MUST be inside with self.sandbox(): context so that sitecustomize guard + strict env are active. Applies resource limits: rlimits on POSIX, Job Objects on Windows.

Sandbox snippet
def safe_print(msg: str) -> None:
    ...
benchmark.utils

benchmark.utils

Utility functions for AI Code Benchmark

load_test_data

load_test_data(test_data_dir)

Load all test data files.

ensure_directories

ensure_directories(dirs)

Ensure all specified directories exist.

create_submission_template

create_submission_template(submissions_dir)

Create template directory in tiered structure.

Legacy fallback removed: always uses submissions/templates/template.

generate_comparison_chart

generate_comparison_chart(results, output_file)

Generate a simple text-based comparison chart.

validate_submission_structure

validate_submission_structure(model_dir)

Validate that a model submission has the correct file structure.

get_model_statistics

get_model_statistics(results)

Extract key statistics from benchmark results.

Always returns a structured stats object (never an empty dict) so that callers can rely on keys without defensive existence checks.

benchmark.types

benchmark.types

Shared TypedDict result shapes for benchmark validators and runner.

Centralizing these shapes ensures a single source of truth for prompt-level and model-level result contracts across the core pipeline (validators, runner, comparison utilities). Runtime behavior remains unchanged; this is purely a typing/structure consolidation.

PromptResult

Bases: TypedDict

Result for a single prompt validation.

total=False so optional fields (like error/traceback) don't require union types everywhere; absent keys are naturally treated as optional.

ModelResults

Bases: TypedDict

Aggregated results for a single model over all prompts.

See Also