API Reference¶
Live, importable modules from the repo. Build step sets
PYTHONPATHso this works without a wheel.
Reading order
Start with the Module map to see what's where. Then skim the snippets and CLI. The Full reference at the bottom is collapsed by default.
Module map¶
benchmark.scoring— Scoring engine and helpers;BenchmarkScorer,calculate_grade, comparison helpers.benchmark.validators— Prompt validators and analyzers (security, performance, maintainability).benchmark.secure_runner— Sandboxed execution utilities.SecureRunnerprovides sandbox context and guarded execution.benchmark.utils— Utilities for test data loading, directory setup, comparisons, and model stats.benchmark.types— TypedDict schemas describing prompt results and overall output structures.
Selected snippets¶
Strict sandbox environment (minimal, readable extract):
def _prepare_environment(self, sandbox_dir: Path) -> None:
os.environ.clear()
home_dir = sandbox_dir / "home"
tmp_dir = sandbox_dir / "temp"
home_dir.mkdir(exist_ok=True)
tmp_dir.mkdir(exist_ok=True)
base_env = {
"HOME": str(home_dir),
"USERPROFILE": str(home_dir), # Windows
"TEMP": str(tmp_dir),
"TMP": str(tmp_dir),
"TMPDIR": str(tmp_dir),
"PYTHONDONTWRITEBYTECODE": "1",
"AIBUGBENCH_SANDBOX_ROOT": str(sandbox_dir.resolve()),
"AIBUGBENCH_ALLOW_NETWORK": "1" if self.allow_network else "0",
}
for key in ["PATH","SystemRoot","WINDIR","COMSPEC",
"NUMBER_OF_PROCESSORS","PROCESSOR_ARCHITECTURE","LANG","LC_ALL"]:
val = self._original_env.get(key)
if val:
base_env[key] = val
os.environ.update(base_env)
Run a Python entry inside the sandbox (ensures guards via sitecustomize.py):
def run_python_sandboxed(self, args: list[str], *, timeout: int = 10,
cwd: Path | None = None, memory_mb: int = 512):
cmd = [sys.executable, "-B", *args] # -B keeps .pyc off; still loads sitecustomize
env = os.environ.copy() # inherit sandbox env
if cwd:
# ensure sandbox folder (with sitecustomize.py) is on import path
env["PYTHONPATH"] = str(cwd) + (os.pathsep + env["PYTHONPATH"] if env.get("PYTHONPATH") else "")
# platform-specific resource limits applied here...
return subprocess.run(cmd, cwd=str(cwd) if cwd else None, env=env,
stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
text=True, timeout=timeout, check=False)
Snippets from the runner¶
Robust Unicode-safe printing¶
def safe_print(self, message: str) -> None:
try:
print(message)
except UnicodeEncodeError:
ascii_message = message.encode("ascii", "ignore").decode("ascii")
print(ascii_message)
except Exception as e:
with contextlib.suppress(Exception):
print(f"Print error: {e!s}")
Detailed scoring formatting (compact, two-line display)¶
def format_detailed_score(self, detailed_scoring: dict[str, Any]) -> str:
lines, categories = [], []
order = ["syntax","structure","execution","quality","security","performance","maintainability"]
for cat in order:
if cat in detailed_scoring:
s = detailed_scoring[cat]
categories.append(f"{cat.title()}: {s.get('earned',0):.1f}/{s.get('max',0):.1f}")
mid = len(categories) // 2
if categories:
lines.append(f" └─ {', '.join(categories[:mid])}")
if len(categories) > mid:
lines.append(f" {', '.join(categories[mid:])}")
return "\n".join(lines)
Atomic result write (tmp file swap)¶
def _atomic_write_json(self, path: Path, data: Any) -> None:
tmp = path.with_suffix(path.suffix + ".tmp")
with open(tmp, "w", encoding="utf-8") as f:
json.dump(data, f, indent=2, ensure_ascii=False)
os.replace(tmp, path)
Recipes¶
Execute an existing script inside a sandbox¶
from benchmark.secure_runner import SecureRunner
from pathlib import Path
runner = SecureRunner(model_name="example_model", allow_network=False)
with runner.sandbox() as root:
result = runner.run_python_sandboxed(
["-m", "module_to_run", "--flag"],
cwd=Path(root),
timeout=10,
memory_mb=512,
)
print(result.stdout)
Parse CLI args without running the benchmark¶
from run_benchmark import parse_args
args = parse_args(["--model","example_model","--mem","768","--quiet"])
assert args.model == "example_model" and args.mem == 768 and args.quiet
CLI Reference¶
Usage¶
python run_benchmark.py [--model NAME | --all-models] [--workers N]
[--submissions-dir DIR] [--results-dir DIR]
[--mem {256,384,512,768,1024}]
[--unsafe] [--allow-network] [--trusted-model]
[--no-metadata] [-q|--quiet]
Arguments¶
| Flag | Type / Values | Default | Description |
|---|---|---|---|
--model | string | — | Test a single model by name. |
--all-models | flag | false | Test all discovered models (if supported in your runner). |
--workers | int | 1 | Number of concurrent workers when testing multiple models. |
--submissions-dir | path | submissions | Root directory containing model submissions. |
--results-dir | path | results | Directory where results, summaries and charts are written. |
--mem | one of 256,384,512,768,1024 | 512 | Memory limit (MB) for sandboxed execution. |
--unsafe | flag | false | Disable sandbox/resource isolation. Dangerous; for trusted runs only. |
--allow-network | flag | false | Allow network access during execution. |
--trusted-model | flag | false | Suppress unsafe-mode confirmation (use in CI for trusted submissions). |
--no-metadata | flag | false | Skip environment/git/dependency metadata collection. |
-q, --quiet | flag | false | Suppress non-essential output. |
Examples¶
# Single model, default sandbox & limits
python run_benchmark.py --model example_model
# All models with 4 workers, custom results dir
python run_benchmark.py --all-models --workers 4 --results-dir out/results
# CI-like trusted run with network allowed and larger RAM cap
python run_benchmark.py --model gpt4 --unsafe --trusted-model --allow-network --mem 1024 -q
Environment Variables¶
| Variable | Default | Description |
|---|---|---|
AIBUGBENCH_RESULTS_DIR | results/ | Override default results directory |
AIBUGBENCH_TIMEOUT | 30 | Default operation timeout |
AIBUGBENCH_DEBUG | false | Enable debug logging |
PYTHONPATH | - | Include benchmark modules in path |
Programmatic usage (high level)¶
from run_benchmark import AICodeBenchmark
bench = AICodeBenchmark(submissions_dir="submissions", results_dir="results")
result = bench.run_single_model("example_model")
print(result["overall_score"], result["percentage"])
Representative output: (terminal, .json, results)¶
Terminal output + security banner¶
Example terminal output (multi-model run)
> python run_benchmark.py
╔══════════════════════════════════════╗
║ AIBugBench Security Status ║
╠══════════════════════════════════════╣
║Sandboxing: ENABLED ║
║Network: BLOCKED ║
║Subprocess: BLOCKED ║
║Filesystem: CONFINED ║
║Env Clean: CLEANED ║
║ResourceLimits: ENFORCED ║
║Trusted Model: YES ║
╚══════════════════════════════════════╝
Discovered models: reference=1 user=0 templates=OK
🔍 Discovered 1 model(s): example_model
Testing model: example_model
==================================================
📝 Testing Refactoring & Analysis...
✅ PASSED - Score: 23.17/25
└─ Syntax: 5.0/5.0, Structure: 2.4/3.0, Execution: 6.0/6.0
Quality: 3.0/3.0, Security: 4.0/4.0, Performance: 1.9/2.0, Maintainability: 0.9/2.0
📝 Testing YAML/JSON Correction...
✅ PASSED - Score: 25.00/25
└─ Syntax: 4.0/4.0, Structure: 6.0/6.0, Execution: 8.0/8.0
Quality: 6.0/6.0, Security: 1.0/1.0, Performance: 0.0/0.0, Maintainability: 0.0/0.0
📝 Testing Data Transformation...
2025-09-30 02:19:47,031 - transform_module - WARNING - User 103: Email is null, cannot extract provider
2025-09-30 02:19:47,031 - transform_module - WARNING - User 999: Missing or invalid 'contact' field
2025-09-30 02:19:47,032 - transform_module - WARNING - User 999: Missing or invalid 'stats' field
2025-09-30 02:19:47,032 - transform_module - INFO - Successfully transformed 6 users
✅ PASSED - Score: 22.00/25
└─ Syntax: 3.0/3.0, Structure: 3.0/3.0, Execution: 12.0/12.0
Quality: 3.0/3.0, Security: 0.0/1.0, Performance: 1.0/1.0, Maintainability: 0.0/2.0
📝 Testing API Simulation...
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,072 - api_module - INFO - Successfully synced users. Job ID: abc123
✅ Sync successful! Job ID: abc123
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,072 - api_module - WARNING - Unexpected success status code: 400
⚠️ Warning: Unexpected response status 400
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - WARNING - Unexpected success status code: 401
⚠️ Warning: Unexpected response status 401
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - WARNING - Unexpected success status code: 503
⚠️ Warning: Unexpected response status 503
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - ERROR - Network connection error: Network error
❌ Network Error: Unable to connect to CRM system
Please check your internet connection and try again
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - INFO - Successfully synced users. Job ID: test123
✅ Sync successful! Job ID: test123
✅ PASSED - Score: 22.00/25
└─ Syntax: 2.0/2.0, Structure: 3.0/3.0, Execution: 7.0/7.0
Quality: 3.0/3.0, Security: 6.0/7.0, Performance: 0.0/2.0, Maintainability: 1.0/1.0
🎯 Final Score: 92.17/100 (92.2%)
📄 Summary report: results\20250930_021946\detailed\summary_report_20250930_021946.txt
📊 Comparison chart: results\20250930_021946\comparison_charts\comparison_chart.txt
🎉 Benchmark completed! Tested 1 model(s)
🏆 Top Performers:
1. example_model: 92.2%
2. (n/a)
📁 Detailed results have been saved to:
• results/latest_results.json - Complete data with detailed scoring
• results/detailed/summary_report_*.txt - Summary with enhanced feedback
• results/comparison_charts/comparison_chart_*.txt - Visual comparison with progress bars
For complete scoring breakdowns and analysis, check these files in the /results directory.
JSON results file¶
Detailed JSON results file (results/latest_results.json)
{
"benchmark_run": {
"timestamp": "2025-10-11T22:48:46.444397",
"total_models": 4
},
"models": {
"GPT5_Thinking": {
"model_name": "GPT5_Thinking",
"timestamp": "2025-10-11T22:48:46.444621",
"prompts": {
"prompt_1": {
"passed": true,
"score": 23.166666666666664,
"max_score": 25,
"feedback": [
"✅ Python Syntax (5.0/5.0): ✓valid_syntax",
"⚠️ Code Structure (2.4/3.0): ✓yaml_import, json_import, error_handling, type_hints ✗logging",
"✅ Execution (6.0/6.0): ✓runs_without_error, json_output_validation",
"✅ Code Quality (3.0/3.0): ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
"✅ Security Analysis (4.0/4.0): ✓no_security_issues",
"⚠️ Performance Analysis (1.9/2.0): ✓no_nested_loops",
"⚠️ Maintainability Analysis (0.9/2.0): ✓no_duplication, good_naming ✗no_long_functions"
],
"tests_passed": {
"valid_python": true,
"runs_successfully": true,
"good_quality": true,
"secure_code": true
},
"detailed_scoring": {
"syntax": {
"earned": 5.0,
"max": 5.0
},
"structure": {
"earned": 2.4,
"max": 3.0
},
"execution": {
"earned": 6.0,
"max": 6.0
},
"quality": {
"earned": 3.0,
"max": 3.0
},
"security": {
"earned": 4.0,
"max": 4.0
},
"performance": {
"earned": 1.8666666666666665,
"max": 2.0
},
"maintainability": {
"earned": 0.9,
"max": 2.0
}
}
},
"prompt_2": {
"passed": true,
"score": 25.0,
"max_score": 25,
"feedback": [
"✅ Syntax (4.0/4.0): ✓yaml_parses, json_parses",
"✅ Structure (6.0/6.0): ✓required_keys, nested_shapes, arrays_scalars",
"✅ Execution (8.0/8.0): ✓deep_equivalence, partial_matches",
"✅ Quality (6.0/6.0): ✓yaml_indentation, json_literals, formatting_style, no_duplication",
"✅ Security (1.0/1.0): ✓yaml_safety"
],
"tests_passed": {
"valid_yaml": true,
"valid_json": true,
"structure_preserved": true,
"equivalence_test": true,
"correct_types": true
},
"detailed_scoring": {
"syntax": {
"earned": 4.0,
"max": 4
},
"structure": {
"earned": 6.0,
"max": 6
},
"execution": {
"earned": 8.0,
"max": 8
},
"quality": {
"earned": 6.0,
"max": 6
},
"security": {
"earned": 1.0,
"max": 1
},
"performance": {
"earned": 0,
"max": 0
},
"maintainability": {
"earned": 0,
"max": 0
}
}
},
"prompt_3": {
"passed": true,
"score": 21.0,
"max_score": 25,
"feedback": [
"✅ Syntax (3.0/3.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, basic_organization",
"✅ Execution (12.0/12.0): ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
"⚠️ Quality (2.0/3.0): ✓try_except, type_conversions ✗readable_loops",
"❌ Security (0.0/1.0): ✗no_unsafe_constructs",
"✅ Performance (1.0/1.0): ✓single_pass",
"❌ Maintainability (0.0/2.0): ✗code_organization"
],
"tests_passed": {
"function_exists": true,
"no_crash": true,
"id_standardization": true,
"email_provider": true,
"account_tiers": true
},
"detailed_scoring": {
"syntax": {
"earned": 3.0,
"max": 3
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 12.0,
"max": 12
},
"quality": {
"earned": 2.0,
"max": 3
},
"security": {
"earned": 0.0,
"max": 1
},
"performance": {
"earned": 1.0,
"max": 1
},
"maintainability": {
"earned": 0.0,
"max": 2
}
}
},
"prompt_4": {
"passed": true,
"score": 15.0,
"max_score": 25,
"feedback": [
"✅ Syntax (2.0/2.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, request_structure",
"⚠️ Execution (4.0/7.0): ✓handle_400, handle_401, handle_503, handle_connection_error ✗success_handling, json_parsing",
"✅ Quality (3.0/3.0): ✓informative_errors, documentation",
"❌ Security (0.0/7.0): ✗security_analysis",
"✅ Performance (2.0/2.0): ✓retry_resilience",
"✅ Maintainability (1.0/1.0): ✓code_organization"
],
"tests_passed": {
"function_signature": true,
"uses_requests": true,
"error_handling": true,
"api_structure": true
},
"detailed_scoring": {
"syntax": {
"earned": 2.0,
"max": 2
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 4.0,
"max": 7
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 0.0,
"max": 7
},
"performance": {
"earned": 2.0,
"max": 2
},
"maintainability": {
"earned": 1.0,
"max": 1
}
}
}
},
"overall_score": 84.16666666666666,
"total_possible": 100,
"percentage": 84.2
},
"Mistral_LeChat": {
"model_name": "Mistral_LeChat",
"timestamp": "2025-10-11T22:48:52.651905",
"prompts": {
"prompt_1": {
"passed": true,
"score": 16.316666666666666,
"max_score": 25,
"feedback": [
"✅ Python Syntax (5.0/5.0): ✓valid_syntax",
"⚠️ Code Structure (1.8/3.0): ✓yaml_import, json_import, error_handling ✗logging, type_hints",
"❌ Execution (0.0/6.0): ✗correct_filtering",
"⚠️ Code Quality (2.2/3.0): ✓no_global_vars, has_main_guard, proper_file_handling ✗uses_pathlib",
"✅ Security Analysis (4.0/4.0): ✓no_security_issues",
"⚠️ Performance Analysis (1.9/2.0): ✓no_nested_loops",
"⚠️ Maintainability Analysis (1.4/2.0): ✓no_long_functions, no_duplication, good_naming"
],
"tests_passed": {
"valid_python": true,
"good_quality": true,
"secure_code": true
},
"detailed_scoring": {
"syntax": {
"earned": 5.0,
"max": 5.0
},
"structure": {
"earned": 1.7999999999999998,
"max": 3.0
},
"execution": {
"earned": 0.0,
"max": 6.0
},
"quality": {
"earned": 2.25,
"max": 3.0
},
"security": {
"earned": 4.0,
"max": 4.0
},
"performance": {
"earned": 1.8666666666666665,
"max": 2.0
},
"maintainability": {
"earned": 1.4,
"max": 2.0
}
}
},
"prompt_2": {
"passed": true,
"score": 25.0,
"max_score": 25,
"feedback": [
"✅ Syntax (4.0/4.0): ✓yaml_parses, json_parses",
"✅ Structure (6.0/6.0): ✓required_keys, nested_shapes, arrays_scalars",
"✅ Execution (8.0/8.0): ✓deep_equivalence, partial_matches",
"✅ Quality (6.0/6.0): ✓yaml_indentation, json_literals, formatting_style, no_duplication",
"✅ Security (1.0/1.0): ✓yaml_safety"
],
"tests_passed": {
"valid_yaml": true,
"valid_json": true,
"structure_preserved": true,
"equivalence_test": true,
"correct_types": true
},
"detailed_scoring": {
"syntax": {
"earned": 4.0,
"max": 4
},
"structure": {
"earned": 6.0,
"max": 6
},
"execution": {
"earned": 8.0,
"max": 8
},
"quality": {
"earned": 6.0,
"max": 6
},
"security": {
"earned": 1.0,
"max": 1
},
"performance": {
"earned": 0,
"max": 0
},
"maintainability": {
"earned": 0,
"max": 0
}
}
},
"prompt_3": {
"passed": true,
"score": 22.0,
"max_score": 25,
"feedback": [
"✅ Syntax (3.0/3.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, basic_organization",
"✅ Execution (12.0/12.0): ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
"✅ Quality (3.0/3.0): ✓try_except, type_conversions, readable_loops",
"❌ Security (0.0/1.0): ✗no_unsafe_constructs",
"✅ Performance (1.0/1.0): ✓single_pass",
"❌ Maintainability (0.0/2.0): ✗code_organization"
],
"tests_passed": {
"function_exists": true,
"no_crash": true,
"id_standardization": true,
"email_provider": true,
"account_tiers": true
},
"detailed_scoring": {
"syntax": {
"earned": 3.0,
"max": 3
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 12.0,
"max": 12
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 0.0,
"max": 1
},
"performance": {
"earned": 1.0,
"max": 1
},
"maintainability": {
"earned": 0.0,
"max": 2
}
}
},
"prompt_4": {
"passed": true,
"score": 21.0,
"max_score": 25,
"feedback": [
"✅ Syntax (2.0/2.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, request_structure",
"✅ Execution (7.0/7.0): ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
"✅ Quality (3.0/3.0): ✓informative_errors, documentation",
"⚠️ Security (5.0/7.0): ✓bearer_auth, no_token_leak ✗explicit_timeout",
"❌ Performance (0.0/2.0): ✗retry_resilience",
"✅ Maintainability (1.0/1.0): ✓code_organization"
],
"tests_passed": {
"function_signature": true,
"uses_requests": true,
"error_handling": true,
"api_structure": true
},
"detailed_scoring": {
"syntax": {
"earned": 2.0,
"max": 2
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 7.0,
"max": 7
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 5.0,
"max": 7
},
"performance": {
"earned": 0.0,
"max": 2
},
"maintainability": {
"earned": 1.0,
"max": 1
}
}
}
},
"overall_score": 84.31666666666666,
"total_possible": 100,
"percentage": 84.3
},
"Sonnet4.5_Thinking": {
"model_name": "Sonnet4.5_Thinking",
"timestamp": "2025-10-11T22:48:52.739921",
"prompts": {
"prompt_1": {
"passed": true,
"score": 17.166666666666664,
"max_score": 25,
"feedback": [
"✅ Python Syntax (5.0/5.0): ✓valid_syntax",
"⚠️ Code Structure (2.4/3.0): ✓yaml_import, json_import, error_handling, type_hints ✗logging",
"❌ Execution (0.0/6.0): ✗runs_without_error, correct_filtering",
"✅ Code Quality (3.0/3.0): ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
"✅ Security Analysis (4.0/4.0): ✓no_security_issues",
"⚠️ Performance Analysis (1.9/2.0): ✓no_nested_loops",
"⚠️ Maintainability Analysis (0.9/2.0): ✓no_duplication, good_naming ✗no_long_functions"
],
"tests_passed": {
"valid_python": true,
"good_quality": true,
"secure_code": true
},
"detailed_scoring": {
"syntax": {
"earned": 5.0,
"max": 5.0
},
"structure": {
"earned": 2.4,
"max": 3.0
},
"execution": {
"earned": 0.0,
"max": 6.0
},
"quality": {
"earned": 3.0,
"max": 3.0
},
"security": {
"earned": 4.0,
"max": 4.0
},
"performance": {
"earned": 1.8666666666666665,
"max": 2.0
},
"maintainability": {
"earned": 0.9,
"max": 2.0
}
}
},
"prompt_2": {
"passed": true,
"score": 23.0,
"max_score": 25,
"feedback": [
"✅ Syntax (4.0/4.0): ✓yaml_parses, json_parses",
"✅ Structure (6.0/6.0): ✓required_keys, nested_shapes, arrays_scalars",
"✅ Execution (8.0/8.0): ✓deep_equivalence, partial_matches",
"⚠️ Quality (4.0/6.0): ✓json_literals, formatting_style, no_duplication ✗yaml_indentation",
"✅ Security (1.0/1.0): ✓yaml_safety"
],
"tests_passed": {
"valid_yaml": true,
"valid_json": true,
"structure_preserved": true,
"equivalence_test": true,
"correct_types": true
},
"detailed_scoring": {
"syntax": {
"earned": 4.0,
"max": 4
},
"structure": {
"earned": 6.0,
"max": 6
},
"execution": {
"earned": 8.0,
"max": 8
},
"quality": {
"earned": 4.0,
"max": 6
},
"security": {
"earned": 1.0,
"max": 1
},
"performance": {
"earned": 0,
"max": 0
},
"maintainability": {
"earned": 0,
"max": 0
}
}
},
"prompt_3": {
"passed": true,
"score": 22.0,
"max_score": 25,
"feedback": [
"✅ Syntax (3.0/3.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, basic_organization",
"✅ Execution (12.0/12.0): ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
"✅ Quality (3.0/3.0): ✓try_except, type_conversions, readable_loops",
"❌ Security (0.0/1.0): ✗no_unsafe_constructs",
"✅ Performance (1.0/1.0): ✓single_pass",
"❌ Maintainability (0.0/2.0): ✗code_organization"
],
"tests_passed": {
"function_exists": true,
"no_crash": true,
"id_standardization": true,
"email_provider": true,
"account_tiers": true
},
"detailed_scoring": {
"syntax": {
"earned": 3.0,
"max": 3
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 12.0,
"max": 12
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 0.0,
"max": 1
},
"performance": {
"earned": 1.0,
"max": 1
},
"maintainability": {
"earned": 0.0,
"max": 2
}
}
},
"prompt_4": {
"passed": true,
"score": 22.0,
"max_score": 25,
"feedback": [
"✅ Syntax (2.0/2.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, request_structure",
"✅ Execution (7.0/7.0): ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
"✅ Quality (3.0/3.0): ✓informative_errors, documentation",
"⚠️ Security (6.0/7.0): ✓bearer_auth, no_token_leak, explicit_timeout ✗no_hardcoded_creds",
"❌ Performance (0.0/2.0): ✗retry_resilience",
"✅ Maintainability (1.0/1.0): ✓code_organization"
],
"tests_passed": {
"function_signature": true,
"uses_requests": true,
"error_handling": true,
"api_structure": true
},
"detailed_scoring": {
"syntax": {
"earned": 2.0,
"max": 2
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 7.0,
"max": 7
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 6.0,
"max": 7
},
"performance": {
"earned": 0.0,
"max": 2
},
"maintainability": {
"earned": 1.0,
"max": 1
}
}
}
},
"overall_score": 84.16666666666666,
"total_possible": 100,
"percentage": 84.2
},
"example_model": {
"model_name": "example_model",
"timestamp": "2025-10-11T22:48:52.852209",
"prompts": {
"prompt_1": {
"passed": true,
"score": 23.166666666666664,
"max_score": 25,
"feedback": [
"✅ Python Syntax (5.0/5.0): ✓valid_syntax",
"⚠️ Code Structure (2.4/3.0): ✓yaml_import, json_import, error_handling, type_hints ✗logging",
"✅ Execution (6.0/6.0): ✓runs_without_error, json_output_validation",
"✅ Code Quality (3.0/3.0): ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
"✅ Security Analysis (4.0/4.0): ✓no_security_issues",
"⚠️ Performance Analysis (1.9/2.0): ✓no_nested_loops",
"⚠️ Maintainability Analysis (0.9/2.0): ✓no_duplication, good_naming ✗no_long_functions"
],
"tests_passed": {
"valid_python": true,
"runs_successfully": true,
"good_quality": true,
"secure_code": true
},
"detailed_scoring": {
"syntax": {
"earned": 5.0,
"max": 5.0
},
"structure": {
"earned": 2.4,
"max": 3.0
},
"execution": {
"earned": 6.0,
"max": 6.0
},
"quality": {
"earned": 3.0,
"max": 3.0
},
"security": {
"earned": 4.0,
"max": 4.0
},
"performance": {
"earned": 1.8666666666666665,
"max": 2.0
},
"maintainability": {
"earned": 0.9,
"max": 2.0
}
}
},
"prompt_2": {
"passed": true,
"score": 25.0,
"max_score": 25,
"feedback": [
"✅ Syntax (4.0/4.0): ✓yaml_parses, json_parses",
"✅ Structure (6.0/6.0): ✓required_keys, nested_shapes, arrays_scalars",
"✅ Execution (8.0/8.0): ✓deep_equivalence, partial_matches",
"✅ Quality (6.0/6.0): ✓yaml_indentation, json_literals, formatting_style, no_duplication",
"✅ Security (1.0/1.0): ✓yaml_safety"
],
"tests_passed": {
"valid_yaml": true,
"valid_json": true,
"structure_preserved": true,
"equivalence_test": true,
"correct_types": true
},
"detailed_scoring": {
"syntax": {
"earned": 4.0,
"max": 4
},
"structure": {
"earned": 6.0,
"max": 6
},
"execution": {
"earned": 8.0,
"max": 8
},
"quality": {
"earned": 6.0,
"max": 6
},
"security": {
"earned": 1.0,
"max": 1
},
"performance": {
"earned": 0,
"max": 0
},
"maintainability": {
"earned": 0,
"max": 0
}
}
},
"prompt_3": {
"passed": true,
"score": 22.0,
"max_score": 25,
"feedback": [
"✅ Syntax (3.0/3.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, basic_organization",
"✅ Execution (12.0/12.0): ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
"✅ Quality (3.0/3.0): ✓try_except, type_conversions, readable_loops",
"❌ Security (0.0/1.0): ✗no_unsafe_constructs",
"✅ Performance (1.0/1.0): ✓single_pass",
"❌ Maintainability (0.0/2.0): ✗code_organization"
],
"tests_passed": {
"function_exists": true,
"no_crash": true,
"id_standardization": true,
"email_provider": true,
"account_tiers": true
},
"detailed_scoring": {
"syntax": {
"earned": 3.0,
"max": 3
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 12.0,
"max": 12
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 0.0,
"max": 1
},
"performance": {
"earned": 1.0,
"max": 1
},
"maintainability": {
"earned": 0.0,
"max": 2
}
}
},
"prompt_4": {
"passed": true,
"score": 22.0,
"max_score": 25,
"feedback": [
"✅ Syntax (2.0/2.0): ✓file_compiles, function_exists",
"✅ Structure (3.0/3.0): ✓correct_signature, request_structure",
"✅ Execution (7.0/7.0): ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
"✅ Quality (3.0/3.0): ✓informative_errors, documentation",
"⚠️ Security (6.0/7.0): ✓bearer_auth, no_token_leak, explicit_timeout ✗no_hardcoded_creds",
"❌ Performance (0.0/2.0): ✗retry_resilience",
"✅ Maintainability (1.0/1.0): ✓code_organization"
],
"tests_passed": {
"function_signature": true,
"uses_requests": true,
"error_handling": true,
"api_structure": true
},
"detailed_scoring": {
"syntax": {
"earned": 2.0,
"max": 2
},
"structure": {
"earned": 3.0,
"max": 3
},
"execution": {
"earned": 7.0,
"max": 7
},
"quality": {
"earned": 3.0,
"max": 3
},
"security": {
"earned": 6.0,
"max": 7
},
"performance": {
"earned": 0.0,
"max": 2
},
"maintainability": {
"earned": 1.0,
"max": 1
}
}
}
},
"overall_score": 92.16666666666666,
"total_possible": 100,
"percentage": 92.2
}
},
"comparison": {
"ranking": [
{
"model": "example_model",
"score": 92.16666666666666,
"percentage": 92.2
},
{
"model": "Mistral_LeChat",
"score": 84.31666666666666,
"percentage": 84.3
},
{
"model": "GPT5_Thinking",
"score": 84.16666666666666,
"percentage": 84.2
},
{
"model": "Sonnet4.5_Thinking",
"score": 84.16666666666666,
"percentage": 84.2
}
],
"prompt_performance": {
"prompt_1": {
"best_score": 23.166666666666664,
"avg_score": 20.0,
"pass_rate": 100.0,
"ranking": [
{
"model": "GPT5_Thinking",
"score": 23.166666666666664,
"passed": true
},
{
"model": "example_model",
"score": 23.166666666666664,
"passed": true
},
{
"model": "Sonnet4.5_Thinking",
"score": 17.166666666666664,
"passed": true
},
{
"model": "Mistral_LeChat",
"score": 16.316666666666666,
"passed": true
}
]
},
"prompt_2": {
"best_score": 25.0,
"avg_score": 24.5,
"pass_rate": 100.0,
"ranking": [
{
"model": "GPT5_Thinking",
"score": 25.0,
"passed": true
},
{
"model": "Mistral_LeChat",
"score": 25.0,
"passed": true
},
{
"model": "example_model",
"score": 25.0,
"passed": true
},
{
"model": "Sonnet4.5_Thinking",
"score": 23.0,
"passed": true
}
]
},
"prompt_3": {
"best_score": 22.0,
"avg_score": 21.8,
"pass_rate": 100.0,
"ranking": [
{
"model": "Mistral_LeChat",
"score": 22.0,
"passed": true
},
{
"model": "Sonnet4.5_Thinking",
"score": 22.0,
"passed": true
},
{
"model": "example_model",
"score": 22.0,
"passed": true
},
{
"model": "GPT5_Thinking",
"score": 21.0,
"passed": true
}
]
},
"prompt_4": {
"best_score": 22.0,
"avg_score": 20.0,
"pass_rate": 100.0,
"ranking": [
{
"model": "Sonnet4.5_Thinking",
"score": 22.0,
"passed": true
},
{
"model": "example_model",
"score": 22.0,
"passed": true
},
{
"model": "Mistral_LeChat",
"score": 21.0,
"passed": true
},
{
"model": "GPT5_Thinking",
"score": 15.0,
"passed": true
}
]
}
},
"summary_stats": {}
},
"_metadata": {
"spec_version": "0.8.0",
"git_commit": "6fbb2b4",
"python_version": "3.13.7",
"platform": "Windows-11-10.0.26100-SP0",
"timestamp_utc": "2025-10-11T20:48:53.047029Z",
"dependency_fingerprint": "efa462512888b811"
}
}
Comparison Chart Format¶
Model Comparison:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
model_1 [████████████████████████████████████████████░░░░] 94.3%
model_2 [███████████████████████████████████████████░░░░░] 91.8%
model_3 [████████████████████████████████████░░░░░░░░░░░░] 76.5%
example_model [██████████████████████████████████████████████░░] 92.2%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Configuration Files¶
Test Data Configuration¶
Configuration files are generated by scripts/bootstrap_repo.py and located in test_data/:
config.yaml- Deliberately broken YAML configuration (multi-document)user_data.json- Sample user data for transformationprocess_records.py- Python script requiring refactoring
Loading configuration snippets (collapsed)¶
Safe YAML loading
Configuration structure
use_legacy_paths: true
paths:
data_source: /srv/data/production/users.json
legacy_data_source: ./user_data.json
log_file: /var/log/processor.log
validation_rules:
min_age: 18
max_age: 120
required_fields:
- name
- email
- country
processing_options:
batch_size: 100
timeout_seconds: 30
retry_attempts: 3
api_keys:
- primary_key
- secondary_key
- backup_key
feature_flags:
enable_logging: true
strict_validation: false
debug_mode: false
Validation functions (collapsed)¶
Prompt 1 (Code Refactoring), 2 (YAML/JSON Correction), 3 (Data Transformation), 4 (API Integration)
Prompt 1: Code Refactoring¶
def validate_prompt1_refactoring(solution_path: str) -> dict:
"""
Validate refactored Python code.
Returns:
dict: {
'score': float,
'max_score': 25,
'passed': bool,
'details': {
'syntax': bool,
'execution': bool,
'security': list,
'performance': list,
'maintainability': list
}
}
"""
Prompt 2: YAML/JSON Correction¶
def validate_prompt2_yaml_json(yaml_path: str, json_path: str) -> dict:
"""
Validate corrected YAML and JSON files.
Returns:
dict: {
'score': float,
'max_score': 25,
'passed': bool,
'details': {
'yaml_valid': bool,
'json_valid': bool,
'equivalence': bool,
'structure': dict
}
}
}
"""
Prompt 3: Data Transformation¶
def validate_prompt3_transformation(transform_path: str) -> dict:
"""
Validate data transformation function.
Returns:
dict: {
'score': float,
'max_score': 25,
'passed': bool,
'details': {
'function_exists': bool,
'signature_correct': bool,
'transformations': dict,
'business_rules': bool
}
}
"""
Prompt 4: API Integration¶
Error Handling¶
Common Exit Codes¶
| Code | Meaning | Resolution |
|---|---|---|
| 0 | Success | - |
| 1 | General error | Check error message |
| 2 | Missing model | Verify model name and directory |
| 3 | Validation failure | Review submission files |
| 4 | Timeout exceeded | Increase timeout or optimize code |
| 5 | File not found | Ensure all required files exist |
Platform-Specific Notes¶
Windows¶
- Use
pythonorpycommand - Paths use backslashes or forward slashes
- PowerShell may require execution policy adjustment
macOS/Linux¶
- Use
python3command - Ensure proper file permissions
- May need to use
sudofor certain operations
Docker¶
- Mount submissions directory as volume
- Set environment variables in container
- Use non-root user for security
Python API¶
Core Modules¶
benchmark.runner¶
Main benchmark execution module.
from benchmark.runner import BenchmarkRunner
# Initialize runner
runner = BenchmarkRunner(
model_name="gpt4",
results_dir="results/",
timeout=30
)
# Run all tests
results = runner.run_all_tests()
# Run specific test
prompt1_result = runner.run_test("prompt_1_refactoring")
benchmark.validators¶
Validation logic for each prompt.
from benchmark.validators import (
validate_prompt1_refactoring,
validate_prompt2_yaml_json,
validate_prompt3_transformation,
validate_prompt4_api_integration
)
# Validate refactored code
result = validate_prompt1_refactoring(
solution_path="submissions/model/prompt_1_solution.py"
)
print(f"Score: {result['score']}/{result['max_score']}")
benchmark.scoring¶
Scoring engine with 7-category assessment.
from benchmark.scoring import calculate_score, get_grade
# Calculate total score
results = {
'prompt_1': {'score': 23.5, 'max_score': 25},
'prompt_2': {'score': 25.0, 'max_score': 25},
'prompt_3': {'score': 24.5, 'max_score': 25},
'prompt_4': {'score': 21.0, 'max_score': 25}
}
total_score = calculate_score(results)
grade = get_grade(total_score)
print(f"Total: {total_score}/100 - Grade: {grade}")
benchmark.utils¶
Utility functions for file operations and formatting.
from benchmark.utils import (
safe_load_json,
safe_load_yaml,
format_results,
create_comparison_chart
)
# Safe file loading with error handling
data = safe_load_json("user_data.json")
config = safe_load_yaml("config.yaml")
# Format results for display
formatted = format_results(results)
print(formatted)
# Create comparison chart
chart = create_comparison_chart(results)
print(chart)
Full API reference (collapsed)¶
benchmark.scoring
benchmark.validators
benchmark.validators ¶
Validators for AI Code Benchmark prompts.
ScoringDetail ¶
SecurityAnalyzer ¶
Analyzes code for common security vulnerabilities.
check_sql_injection_patterns staticmethod ¶
Check for potential SQL injection vulnerabilities.
check_hardcoded_secrets staticmethod ¶
Check for hardcoded secrets and API keys.
check_path_traversal staticmethod ¶
Check for path traversal vulnerabilities.
analyze_code_security classmethod ¶
Perform comprehensive security analysis on code.
PerformanceAnalyzer ¶
Analyzes code for performance issues and inefficient patterns.
check_nested_loops staticmethod ¶
Check for O(n²) and nested loop patterns that may be inefficient.
check_inefficient_patterns staticmethod ¶
Check for common inefficient programming patterns.
check_memory_patterns staticmethod ¶
Check for potential memory inefficiencies.
check_algorithm_efficiency staticmethod ¶
Check for algorithmically inefficient approaches.
analyze_code_performance classmethod ¶
Perform comprehensive performance analysis on code.
MaintainabilityAnalyzer ¶
Analyzes code for maintainability issues and code quality metrics.
check_function_length staticmethod ¶
Check for overly long functions (>20 lines).
check_code_duplication staticmethod ¶
Check for obvious code duplication patterns.
check_variable_naming staticmethod ¶
Check for poor variable naming practices.
check_complexity_indicators staticmethod ¶
Check for high complexity indicators.
analyze_code_maintainability classmethod ¶
Perform comprehensive maintainability analysis on code.
PromptValidators ¶
run_in_sandbox ¶
Internal helper executing a validator method inside a SecureRunner sandbox.
Mirrors the prior sandbox logic while remaining reusable and type-safe.
sandbox_validator ¶
Decorator for PromptValidators instance methods preserving signature & return type.
benchmark.secure_runner
benchmark.secure_runner.SecureRunner ¶
Execute untrusted code in an isolated temporary environment.
run_with_limits ¶
Execute func(*args) under CPU & memory limits where supported.
memory_mb: planned default; future flag --mem will allow override (512/768/1024).
run_python_sandboxed ¶
Execute a python module or script with -B inside the sandbox.
Caller MUST be inside with self.sandbox(): context so that sitecustomize guard + strict env are active. Applies resource limits: rlimits on POSIX, Job Objects on Windows.
benchmark.utils
benchmark.utils ¶
Utility functions for AI Code Benchmark
create_submission_template ¶
Create template directory in tiered structure.
Legacy fallback removed: always uses submissions/templates/template.
generate_comparison_chart ¶
Generate a simple text-based comparison chart.
validate_submission_structure ¶
Validate that a model submission has the correct file structure.
get_model_statistics ¶
Extract key statistics from benchmark results.
Always returns a structured stats object (never an empty dict) so that callers can rely on keys without defensive existence checks.
benchmark.types
benchmark.types ¶
Shared TypedDict result shapes for benchmark validators and runner.
Centralizing these shapes ensures a single source of truth for prompt-level and model-level result contracts across the core pipeline (validators, runner, comparison utilities). Runtime behavior remains unchanged; this is purely a typing/structure consolidation.
See Also¶
- Getting Started - Initial setup and quick start
- Developer Guide - Adding and testing models
- Scoring Methodology - Understanding scores
- Troubleshooting - Common issues and solutions