API Reference¶

Live, importable modules from the repo. Build step sets PYTHONPATH so this works without a wheel.

Reading order

Start with the Module map to see what's where. Then skim the snippets and CLI. The Full reference at the bottom is collapsed by default.

Module map¶

benchmark.scoring — Scoring engine and helpers; BenchmarkScorer, calculate_grade, comparison helpers.
benchmark.validators — Prompt validators and analyzers (security, performance, maintainability).
benchmark.secure_runner — Sandboxed execution utilities. SecureRunner provides sandbox context and guarded execution.
benchmark.utils — Utilities for test data loading, directory setup, comparisons, and model stats.
benchmark.types — TypedDict schemas describing prompt results and overall output structures.

Selected snippets¶

Strict sandbox environment (minimal, readable extract):

def _prepare_environment(self, sandbox_dir: Path) -> None:
    os.environ.clear()
    home_dir = sandbox_dir / "home"
    tmp_dir = sandbox_dir / "temp"
    home_dir.mkdir(exist_ok=True)
    tmp_dir.mkdir(exist_ok=True)

    base_env = {
        "HOME": str(home_dir),
        "USERPROFILE": str(home_dir),  # Windows
        "TEMP": str(tmp_dir),
        "TMP": str(tmp_dir),
        "TMPDIR": str(tmp_dir),
        "PYTHONDONTWRITEBYTECODE": "1",
        "AIBUGBENCH_SANDBOX_ROOT": str(sandbox_dir.resolve()),
        "AIBUGBENCH_ALLOW_NETWORK": "1" if self.allow_network else "0",
    }
    for key in ["PATH","SystemRoot","WINDIR","COMSPEC",
                "NUMBER_OF_PROCESSORS","PROCESSOR_ARCHITECTURE","LANG","LC_ALL"]:
        val = self._original_env.get(key)
        if val:
            base_env[key] = val
    os.environ.update(base_env)

Run a Python entry inside the sandbox (ensures guards via sitecustomize.py):

def run_python_sandboxed(self, args: list[str], *, timeout: int = 10,
                         cwd: Path | None = None, memory_mb: int = 512):
    cmd = [sys.executable, "-B", *args]    # -B keeps .pyc off; still loads sitecustomize
    env = os.environ.copy()                 # inherit sandbox env
    if cwd:
        # ensure sandbox folder (with sitecustomize.py) is on import path
        env["PYTHONPATH"] = str(cwd) + (os.pathsep + env["PYTHONPATH"] if env.get("PYTHONPATH") else "")
    # platform-specific resource limits applied here...
    return subprocess.run(cmd, cwd=str(cwd) if cwd else None, env=env,
                          stdout=subprocess.PIPE, stderr=subprocess.STDOUT,
                          text=True, timeout=timeout, check=False)

Snippets from the runner¶

Robust Unicode-safe printing¶

def safe_print(self, message: str) -> None:
    try:
        print(message)
    except UnicodeEncodeError:
        ascii_message = message.encode("ascii", "ignore").decode("ascii")
        print(ascii_message)
    except Exception as e:
        with contextlib.suppress(Exception):
            print(f"Print error: {e!s}")

Detailed scoring formatting (compact, two-line display)¶

def format_detailed_score(self, detailed_scoring: dict[str, Any]) -> str:
    lines, categories = [], []
    order = ["syntax","structure","execution","quality","security","performance","maintainability"]
    for cat in order:
        if cat in detailed_scoring:
            s = detailed_scoring[cat]
            categories.append(f"{cat.title()}: {s.get('earned',0):.1f}/{s.get('max',0):.1f}")
    mid = len(categories) // 2
    if categories:
        lines.append(f"     └─ {', '.join(categories[:mid])}")
        if len(categories) > mid:
            lines.append(f"        {', '.join(categories[mid:])}")
    return "\n".join(lines)

Atomic result write (tmp file swap)¶

def _atomic_write_json(self, path: Path, data: Any) -> None:
    tmp = path.with_suffix(path.suffix + ".tmp")
    with open(tmp, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    os.replace(tmp, path)

Recipes¶

Execute an existing script inside a sandbox¶

from benchmark.secure_runner import SecureRunner
from pathlib import Path

runner = SecureRunner(model_name="example_model", allow_network=False)
with runner.sandbox() as root:
    result = runner.run_python_sandboxed(
        ["-m", "module_to_run", "--flag"],
        cwd=Path(root),
        timeout=10,
        memory_mb=512,
    )
    print(result.stdout)

Parse CLI args without running the benchmark¶

from run_benchmark import parse_args
args = parse_args(["--model","example_model","--mem","768","--quiet"])
assert args.model == "example_model" and args.mem == 768 and args.quiet

CLI Reference¶

Usage¶

python run_benchmark.py [--model NAME | --all-models] [--workers N]
                        [--submissions-dir DIR] [--results-dir DIR]
                        [--mem {256,384,512,768,1024}]
                        [--unsafe] [--allow-network] [--trusted-model]
                        [--no-metadata] [-q|--quiet]

Arguments¶

Flag	Type / Values	Default	Description
`--model`	string	—	Test a single model by name.
`--all-models`	flag	`false`	Test all discovered models (if supported in your runner).
`--workers`	int	`1`	Number of concurrent workers when testing multiple models.
`--submissions-dir`	path	`submissions`	Root directory containing model submissions.
`--results-dir`	path	`results`	Directory where results, summaries and charts are written.
`--mem`	one of `256,384,512,768,1024`	`512`	Memory limit (MB) for sandboxed execution.
`--unsafe`	flag	`false`	Disable sandbox/resource isolation. Dangerous; for trusted runs only.
`--allow-network`	flag	`false`	Allow network access during execution.
`--trusted-model`	flag	`false`	Suppress unsafe-mode confirmation (use in CI for trusted submissions).
`--no-metadata`	flag	`false`	Skip environment/git/dependency metadata collection.
`-q, --quiet`	flag	`false`	Suppress non-essential output.

Examples¶

# Single model, default sandbox & limits
python run_benchmark.py --model example_model

# All models with 4 workers, custom results dir
python run_benchmark.py --all-models --workers 4 --results-dir out/results

# CI-like trusted run with network allowed and larger RAM cap
python run_benchmark.py --model gpt4 --unsafe --trusted-model --allow-network --mem 1024 -q

Environment Variables¶

Variable	Default	Description
`AIBUGBENCH_RESULTS_DIR`	`results/`	Override default results directory
`AIBUGBENCH_TIMEOUT`	`30`	Default operation timeout
`AIBUGBENCH_DEBUG`	`false`	Enable debug logging
`PYTHONPATH`	-	Include benchmark modules in path

Programmatic usage (high level)¶

from run_benchmark import AICodeBenchmark

bench = AICodeBenchmark(submissions_dir="submissions", results_dir="results")
result = bench.run_single_model("example_model")
print(result["overall_score"], result["percentage"])

Representative output: (terminal, .json, results)¶

Example terminal output (multi-model run)

> python run_benchmark.py
╔══════════════════════════════════════╗
║      AIBugBench Security Status      ║
╠══════════════════════════════════════╣
║Sandboxing:     ENABLED               ║
║Network:        BLOCKED               ║
║Subprocess:     BLOCKED               ║
║Filesystem:     CONFINED              ║
║Env Clean:      CLEANED               ║
║ResourceLimits: ENFORCED              ║
║Trusted Model:  YES                   ║
╚══════════════════════════════════════╝
Discovered models: reference=1 user=0 templates=OK
🔍 Discovered 1 model(s): example_model

Testing model: example_model
==================================================

📝 Testing Refactoring & Analysis...
   ✅ PASSED - Score: 23.17/25
     └─ Syntax: 5.0/5.0, Structure: 2.4/3.0, Execution: 6.0/6.0
        Quality: 3.0/3.0, Security: 4.0/4.0, Performance: 1.9/2.0, Maintainability: 0.9/2.0

📝 Testing YAML/JSON Correction...
   ✅ PASSED - Score: 25.00/25
     └─ Syntax: 4.0/4.0, Structure: 6.0/6.0, Execution: 8.0/8.0
        Quality: 6.0/6.0, Security: 1.0/1.0, Performance: 0.0/0.0, Maintainability: 0.0/0.0

📝 Testing Data Transformation...
2025-09-30 02:19:47,031 - transform_module - WARNING - User 103: Email is null, cannot extract provider
2025-09-30 02:19:47,031 - transform_module - WARNING - User 999: Missing or invalid 'contact' field
2025-09-30 02:19:47,032 - transform_module - WARNING - User 999: Missing or invalid 'stats' field
2025-09-30 02:19:47,032 - transform_module - INFO - Successfully transformed 6 users
    ✅ PASSED - Score: 22.00/25
     └─ Syntax: 3.0/3.0, Structure: 3.0/3.0, Execution: 12.0/12.0
        Quality: 3.0/3.0, Security: 0.0/1.0, Performance: 1.0/1.0, Maintainability: 0.0/2.0

📝 Testing API Simulation...
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,072 - api_module - INFO - Successfully synced users. Job ID: abc123
✅ Sync successful! Job ID: abc123
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,072 - api_module - WARNING - Unexpected success status code: 400
⚠️  Warning: Unexpected response status 400
2025-09-30 02:19:47,072 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - WARNING - Unexpected success status code: 401
⚠️  Warning: Unexpected response status 401
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - WARNING - Unexpected success status code: 503
⚠️  Warning: Unexpected response status 503
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - ERROR - Network connection error: Network error
❌ Network Error: Unable to connect to CRM system
   Please check your internet connection and try again
2025-09-30 02:19:47,073 - api_module - INFO - Attempting to sync 1 users to CRM system
2025-09-30 02:19:47,073 - api_module - INFO - Successfully synced users. Job ID: test123
✅ Sync successful! Job ID: test123
   ✅ PASSED - Score: 22.00/25
     └─ Syntax: 2.0/2.0, Structure: 3.0/3.0, Execution: 7.0/7.0
        Quality: 3.0/3.0, Security: 6.0/7.0, Performance: 0.0/2.0, Maintainability: 1.0/1.0

🎯 Final Score: 92.17/100 (92.2%)
📄 Summary report: results\20250930_021946\detailed\summary_report_20250930_021946.txt
📊 Comparison chart: results\20250930_021946\comparison_charts\comparison_chart.txt

🎉 Benchmark completed! Tested 1 model(s)

🏆 Top Performers:
  1. example_model: 92.2%
  2. (n/a)

📁 Detailed results have been saved to:
  • results/latest_results.json - Complete data with detailed scoring
  • results/detailed/summary_report_*.txt - Summary with enhanced feedback
  • results/comparison_charts/comparison_chart_*.txt - Visual comparison with progress bars

For complete scoring breakdowns and analysis, check these files in the /results directory.

JSON results file¶

Detailed JSON results file (results/latest_results.json)

{
  "benchmark_run": {
    "timestamp": "2025-10-11T22:48:46.444397",
    "total_models": 4
  },
  "models": {
    "GPT5_Thinking": {
      "model_name": "GPT5_Thinking",
      "timestamp": "2025-10-11T22:48:46.444621",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 23.166666666666664,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (2.4/3.0):  ✓yaml_import, json_import, error_handling, type_hints ✗logging",
            "✅ Execution (6.0/6.0):  ✓runs_without_error, json_output_validation",
            "✅ Code Quality (3.0/3.0):  ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (0.9/2.0):  ✓no_duplication, good_naming ✗no_long_functions"
          ],
          "tests_passed": {
            "valid_python": true,
            "runs_successfully": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 2.4,
              "max": 3.0
            },
            "execution": {
              "earned": 6.0,
              "max": 6.0
            },
            "quality": {
              "earned": 3.0,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 0.9,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 25.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "✅ Quality (6.0/6.0):  ✓yaml_indentation, json_literals, formatting_style, no_duplication",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 6.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 21.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "⚠️ Quality (2.0/3.0):  ✓try_except, type_conversions ✗readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 2.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 15.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "⚠️ Execution (4.0/7.0):  ✓handle_400, handle_401, handle_503, handle_connection_error ✗success_handling, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "❌ Security (0.0/7.0):  ✗security_analysis",
            "✅ Performance (2.0/2.0):  ✓retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 4.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 7
            },
            "performance": {
              "earned": 2.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 84.16666666666666,
      "total_possible": 100,
      "percentage": 84.2
    },
    "Mistral_LeChat": {
      "model_name": "Mistral_LeChat",
      "timestamp": "2025-10-11T22:48:52.651905",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 16.316666666666666,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (1.8/3.0):  ✓yaml_import, json_import, error_handling ✗logging, type_hints",
            "❌ Execution (0.0/6.0):  ✗correct_filtering",
            "⚠️ Code Quality (2.2/3.0):  ✓no_global_vars, has_main_guard, proper_file_handling ✗uses_pathlib",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (1.4/2.0):  ✓no_long_functions, no_duplication, good_naming"
          ],
          "tests_passed": {
            "valid_python": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 1.7999999999999998,
              "max": 3.0
            },
            "execution": {
              "earned": 0.0,
              "max": 6.0
            },
            "quality": {
              "earned": 2.25,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 1.4,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 25.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "✅ Quality (6.0/6.0):  ✓yaml_indentation, json_literals, formatting_style, no_duplication",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 6.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "✅ Quality (3.0/3.0):  ✓try_except, type_conversions, readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 21.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "✅ Execution (7.0/7.0):  ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "⚠️ Security (5.0/7.0):  ✓bearer_auth, no_token_leak ✗explicit_timeout",
            "❌ Performance (0.0/2.0):  ✗retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 7.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 5.0,
              "max": 7
            },
            "performance": {
              "earned": 0.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 84.31666666666666,
      "total_possible": 100,
      "percentage": 84.3
    },
    "Sonnet4.5_Thinking": {
      "model_name": "Sonnet4.5_Thinking",
      "timestamp": "2025-10-11T22:48:52.739921",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 17.166666666666664,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (2.4/3.0):  ✓yaml_import, json_import, error_handling, type_hints ✗logging",
            "❌ Execution (0.0/6.0):  ✗runs_without_error, correct_filtering",
            "✅ Code Quality (3.0/3.0):  ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (0.9/2.0):  ✓no_duplication, good_naming ✗no_long_functions"
          ],
          "tests_passed": {
            "valid_python": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 2.4,
              "max": 3.0
            },
            "execution": {
              "earned": 0.0,
              "max": 6.0
            },
            "quality": {
              "earned": 3.0,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 0.9,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 23.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "⚠️ Quality (4.0/6.0):  ✓json_literals, formatting_style, no_duplication ✗yaml_indentation",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 4.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "✅ Quality (3.0/3.0):  ✓try_except, type_conversions, readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "✅ Execution (7.0/7.0):  ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "⚠️ Security (6.0/7.0):  ✓bearer_auth, no_token_leak, explicit_timeout ✗no_hardcoded_creds",
            "❌ Performance (0.0/2.0):  ✗retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 7.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 6.0,
              "max": 7
            },
            "performance": {
              "earned": 0.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 84.16666666666666,
      "total_possible": 100,
      "percentage": 84.2
    },
    "example_model": {
      "model_name": "example_model",
      "timestamp": "2025-10-11T22:48:52.852209",
      "prompts": {
        "prompt_1": {
          "passed": true,
          "score": 23.166666666666664,
          "max_score": 25,
          "feedback": [
            "✅ Python Syntax (5.0/5.0):  ✓valid_syntax",
            "⚠️ Code Structure (2.4/3.0):  ✓yaml_import, json_import, error_handling, type_hints ✗logging",
            "✅ Execution (6.0/6.0):  ✓runs_without_error, json_output_validation",
            "✅ Code Quality (3.0/3.0):  ✓no_global_vars, uses_pathlib, has_main_guard, proper_file_handling",
            "✅ Security Analysis (4.0/4.0):  ✓no_security_issues",
            "⚠️ Performance Analysis (1.9/2.0):  ✓no_nested_loops",
            "⚠️ Maintainability Analysis (0.9/2.0):  ✓no_duplication, good_naming ✗no_long_functions"
          ],
          "tests_passed": {
            "valid_python": true,
            "runs_successfully": true,
            "good_quality": true,
            "secure_code": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 5.0,
              "max": 5.0
            },
            "structure": {
              "earned": 2.4,
              "max": 3.0
            },
            "execution": {
              "earned": 6.0,
              "max": 6.0
            },
            "quality": {
              "earned": 3.0,
              "max": 3.0
            },
            "security": {
              "earned": 4.0,
              "max": 4.0
            },
            "performance": {
              "earned": 1.8666666666666665,
              "max": 2.0
            },
            "maintainability": {
              "earned": 0.9,
              "max": 2.0
            }
          }
        },
        "prompt_2": {
          "passed": true,
          "score": 25.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (4.0/4.0):  ✓yaml_parses, json_parses",
            "✅ Structure (6.0/6.0):  ✓required_keys, nested_shapes, arrays_scalars",
            "✅ Execution (8.0/8.0):  ✓deep_equivalence, partial_matches",
            "✅ Quality (6.0/6.0):  ✓yaml_indentation, json_literals, formatting_style, no_duplication",
            "✅ Security (1.0/1.0):  ✓yaml_safety"
          ],
          "tests_passed": {
            "valid_yaml": true,
            "valid_json": true,
            "structure_preserved": true,
            "equivalence_test": true,
            "correct_types": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 4.0,
              "max": 4
            },
            "structure": {
              "earned": 6.0,
              "max": 6
            },
            "execution": {
              "earned": 8.0,
              "max": 8
            },
            "quality": {
              "earned": 6.0,
              "max": 6
            },
            "security": {
              "earned": 1.0,
              "max": 1
            },
            "performance": {
              "earned": 0,
              "max": 0
            },
            "maintainability": {
              "earned": 0,
              "max": 0
            }
          }
        },
        "prompt_3": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (3.0/3.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, basic_organization",
            "✅ Execution (12.0/12.0):  ✓function_runs, id_standardization, email_provider, age_normalization, account_tiers, error_handling",
            "✅ Quality (3.0/3.0):  ✓try_except, type_conversions, readable_loops",
            "❌ Security (0.0/1.0):  ✗no_unsafe_constructs",
            "✅ Performance (1.0/1.0):  ✓single_pass",
            "❌ Maintainability (0.0/2.0):  ✗code_organization"
          ],
          "tests_passed": {
            "function_exists": true,
            "no_crash": true,
            "id_standardization": true,
            "email_provider": true,
            "account_tiers": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 3.0,
              "max": 3
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 12.0,
              "max": 12
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 0.0,
              "max": 1
            },
            "performance": {
              "earned": 1.0,
              "max": 1
            },
            "maintainability": {
              "earned": 0.0,
              "max": 2
            }
          }
        },
        "prompt_4": {
          "passed": true,
          "score": 22.0,
          "max_score": 25,
          "feedback": [
            "✅ Syntax (2.0/2.0):  ✓file_compiles, function_exists",
            "✅ Structure (3.0/3.0):  ✓correct_signature, request_structure",
            "✅ Execution (7.0/7.0):  ✓success_handling, handle_400, handle_401, handle_503, handle_connection_error, json_parsing",
            "✅ Quality (3.0/3.0):  ✓informative_errors, documentation",
            "⚠️ Security (6.0/7.0):  ✓bearer_auth, no_token_leak, explicit_timeout ✗no_hardcoded_creds",
            "❌ Performance (0.0/2.0):  ✗retry_resilience",
            "✅ Maintainability (1.0/1.0):  ✓code_organization"
          ],
          "tests_passed": {
            "function_signature": true,
            "uses_requests": true,
            "error_handling": true,
            "api_structure": true
          },
          "detailed_scoring": {
            "syntax": {
              "earned": 2.0,
              "max": 2
            },
            "structure": {
              "earned": 3.0,
              "max": 3
            },
            "execution": {
              "earned": 7.0,
              "max": 7
            },
            "quality": {
              "earned": 3.0,
              "max": 3
            },
            "security": {
              "earned": 6.0,
              "max": 7
            },
            "performance": {
              "earned": 0.0,
              "max": 2
            },
            "maintainability": {
              "earned": 1.0,
              "max": 1
            }
          }
        }
      },
      "overall_score": 92.16666666666666,
      "total_possible": 100,
      "percentage": 92.2
    }
  },
  "comparison": {
    "ranking": [
      {
        "model": "example_model",
        "score": 92.16666666666666,
        "percentage": 92.2
      },
      {
        "model": "Mistral_LeChat",
        "score": 84.31666666666666,
        "percentage": 84.3
      },
      {
        "model": "GPT5_Thinking",
        "score": 84.16666666666666,
        "percentage": 84.2
      },
      {
        "model": "Sonnet4.5_Thinking",
        "score": 84.16666666666666,
        "percentage": 84.2
      }
    ],
    "prompt_performance": {
      "prompt_1": {
        "best_score": 23.166666666666664,
        "avg_score": 20.0,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "GPT5_Thinking",
            "score": 23.166666666666664,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 23.166666666666664,
            "passed": true
          },
          {
            "model": "Sonnet4.5_Thinking",
            "score": 17.166666666666664,
            "passed": true
          },
          {
            "model": "Mistral_LeChat",
            "score": 16.316666666666666,
            "passed": true
          }
        ]
      },
      "prompt_2": {
        "best_score": 25.0,
        "avg_score": 24.5,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "GPT5_Thinking",
            "score": 25.0,
            "passed": true
          },
          {
            "model": "Mistral_LeChat",
            "score": 25.0,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 25.0,
            "passed": true
          },
          {
            "model": "Sonnet4.5_Thinking",
            "score": 23.0,
            "passed": true
          }
        ]
      },
      "prompt_3": {
        "best_score": 22.0,
        "avg_score": 21.8,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "Mistral_LeChat",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "Sonnet4.5_Thinking",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "GPT5_Thinking",
            "score": 21.0,
            "passed": true
          }
        ]
      },
      "prompt_4": {
        "best_score": 22.0,
        "avg_score": 20.0,
        "pass_rate": 100.0,
        "ranking": [
          {
            "model": "Sonnet4.5_Thinking",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "example_model",
            "score": 22.0,
            "passed": true
          },
          {
            "model": "Mistral_LeChat",
            "score": 21.0,
            "passed": true
          },
          {
            "model": "GPT5_Thinking",
            "score": 15.0,
            "passed": true
          }
        ]
      }
    },
    "summary_stats": {}
  },
  "_metadata": {
    "spec_version": "0.8.0",
    "git_commit": "6fbb2b4",
    "python_version": "3.13.7",
    "platform": "Windows-11-10.0.26100-SP0",
    "timestamp_utc": "2025-10-11T20:48:53.047029Z",
    "dependency_fingerprint": "efa462512888b811"
  }
}

Comparison Chart Format¶

Model Comparison:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

model_1         [████████████████████████████████████████████░░░░] 94.3%
model_2         [███████████████████████████████████████████░░░░░] 91.8%
model_3         [████████████████████████████████████░░░░░░░░░░░░] 76.5%
example_model   [██████████████████████████████████████████████░░] 92.2%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Configuration Files¶

Test Data Configuration¶

Configuration files are generated by scripts/bootstrap_repo.py and located in test_data/:

config.yaml - Deliberately broken YAML configuration (multi-document)
user_data.json - Sample user data for transformation
process_records.py - Python script requiring refactoring

Loading configuration snippets (collapsed)¶

Safe YAML loading

import yaml

# Load multi-document YAML safely
with open('test_data/config.yaml', 'r') as f:
    docs = list(yaml.safe_load_all(f))
    # Merge documents (last document wins)
    config = {}
    for doc in docs:
        if doc:
            config.update(doc)

Configuration structure

use_legacy_paths: true
paths:
  data_source: /srv/data/production/users.json
  legacy_data_source: ./user_data.json
  log_file: /var/log/processor.log
validation_rules:
  min_age: 18
  max_age: 120
  required_fields:
    - name
    - email
    - country
processing_options:
  batch_size: 100
  timeout_seconds: 30
  retry_attempts: 3
api_keys:
  - primary_key
  - secondary_key
  - backup_key
feature_flags:
  enable_logging: true
  strict_validation: false
  debug_mode: false

Validation functions (collapsed)¶

Prompt 1 (Code Refactoring), 2 (YAML/JSON Correction), 3 (Data Transformation), 4 (API Integration)

Prompt 1: Code Refactoring¶

def validate_prompt1_refactoring(solution_path: str) -> dict:
    """
    Validate refactored Python code.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'syntax': bool,
                'execution': bool,
                'security': list,
                'performance': list,
                'maintainability': list
            }
        }
    """

Prompt 2: YAML/JSON Correction¶

def validate_prompt2_yaml_json(yaml_path: str, json_path: str) -> dict:
    """
    Validate corrected YAML and JSON files.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'yaml_valid': bool,
                'json_valid': bool,
                'equivalence': bool,
                'structure': dict
            }
        }
    }
"""

Prompt 3: Data Transformation¶

def validate_prompt3_transformation(transform_path: str) -> dict:
    """
    Validate data transformation function.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'function_exists': bool,
                'signature_correct': bool,
                'transformations': dict,
                'business_rules': bool
            }
        }
    """

Prompt 4: API Integration¶

def validate_prompt4_api_integration(api_path: str) -> dict:
    """
    Validate API integration function.

    Returns:
        dict: {
            'score': float,
            'max_score': 25,
            'passed': bool,
            'details': {
                'function_exists': bool,
                'authentication': bool,
                'error_handling': dict,
                'security': dict
            }
        }
    """

Error Handling¶

Common Exit Codes¶

Code	Meaning	Resolution
0	Success	-
1	General error	Check error message
2	Missing model	Verify model name and directory
3	Validation failure	Review submission files
4	Timeout exceeded	Increase timeout or optimize code
5	File not found	Ensure all required files exist

Platform-Specific Notes¶

Windows¶

Use python or py command
Paths use backslashes or forward slashes
PowerShell may require execution policy adjustment

macOS/Linux¶

Use python3 command
Ensure proper file permissions
May need to use sudo for certain operations

Docker¶

Mount submissions directory as volume
Set environment variables in container
Use non-root user for security

Python API¶

Core Modules¶

benchmark.runner¶

Main benchmark execution module.

from benchmark.runner import BenchmarkRunner

# Initialize runner
runner = BenchmarkRunner(
    model_name="gpt4",
    results_dir="results/",
    timeout=30
)

# Run all tests
results = runner.run_all_tests()

# Run specific test
prompt1_result = runner.run_test("prompt_1_refactoring")

benchmark.validators¶

Validation logic for each prompt.

from benchmark.validators import (
    validate_prompt1_refactoring,
    validate_prompt2_yaml_json,
    validate_prompt3_transformation,
    validate_prompt4_api_integration
)

# Validate refactored code
result = validate_prompt1_refactoring(
    solution_path="submissions/model/prompt_1_solution.py"
)
print(f"Score: {result['score']}/{result['max_score']}")

benchmark.scoring¶

Scoring engine with 7-category assessment.

from benchmark.scoring import calculate_score, get_grade

# Calculate total score
results = {
    'prompt_1': {'score': 23.5, 'max_score': 25},
    'prompt_2': {'score': 25.0, 'max_score': 25},
    'prompt_3': {'score': 24.5, 'max_score': 25},
    'prompt_4': {'score': 21.0, 'max_score': 25}
}

total_score = calculate_score(results)
grade = get_grade(total_score)
print(f"Total: {total_score}/100 - Grade: {grade}")

benchmark.utils¶

Utility functions for file operations and formatting.

from benchmark.utils import (
    safe_load_json,
    safe_load_yaml,
    format_results,
    create_comparison_chart
)

# Safe file loading with error handling
data = safe_load_json("user_data.json")
config = safe_load_yaml("config.yaml")

# Format results for display
formatted = format_results(results)
print(formatted)

# Create comparison chart
chart = create_comparison_chart(results)
print(chart)

Full API reference (collapsed)¶

benchmark.scoring

benchmark.scoring ¶

Scoring system for AI Code Benchmark.

BenchmarkScorer ¶

BenchmarkScorer()

Handles scoring and grading for the benchmark.

calculate_grade ¶

calculate_grade(percentage)

Convert percentage score to letter grade.

generate_feedback_summary ¶

generate_feedback_summary(results)

Generate high-level feedback based on results.

generate_improvement_suggestions ¶

generate_improvement_suggestions(results)

Generate specific improvement suggestions based on test results.

compare_models ¶

compare_models(all_results)

Generate detailed model comparison analysis.

generate_badge ¶

generate_badge(percentage)

Generate a badge/achievement based on performance.

benchmark.validators

benchmark.validators ¶

Validators for AI Code Benchmark prompts.

ScoringDetail ¶

ScoringDetail(max_points)

Helper class to track detailed scoring with rationale.

add_check ¶

add_check(name, passed, points, rationale='')

Add a scoring check with detailed rationale.

get_feedback_line ¶

get_feedback_line(category_name)

Generate detailed feedback line with breakdown.

SecurityAnalyzer ¶

Analyzes code for common security vulnerabilities.

check_sql_injection_patterns `staticmethod` ¶

check_sql_injection_patterns(code)

Check for potential SQL injection vulnerabilities.

check_hardcoded_secrets `staticmethod` ¶

check_hardcoded_secrets(code)

Check for hardcoded secrets and API keys.

check_path_traversal `staticmethod` ¶

check_path_traversal(code)

Check for path traversal vulnerabilities.

analyze_code_security `classmethod` ¶

analyze_code_security(code)

Perform comprehensive security analysis on code.

PerformanceAnalyzer ¶

Analyzes code for performance issues and inefficient patterns.

check_nested_loops `staticmethod` ¶

check_nested_loops(code)

Check for O(n²) and nested loop patterns that may be inefficient.

check_inefficient_patterns `staticmethod` ¶

check_inefficient_patterns(code)

Check for common inefficient programming patterns.

check_memory_patterns `staticmethod` ¶

check_memory_patterns(code)

Check for potential memory inefficiencies.

check_algorithm_efficiency `staticmethod` ¶

check_algorithm_efficiency(code)

Check for algorithmically inefficient approaches.

analyze_code_performance `classmethod` ¶

analyze_code_performance(code)

Perform comprehensive performance analysis on code.

MaintainabilityAnalyzer ¶

Analyzes code for maintainability issues and code quality metrics.

check_function_length `staticmethod` ¶

check_function_length(code)

Check for overly long functions (>20 lines).

check_code_duplication `staticmethod` ¶

check_code_duplication(code)

Check for obvious code duplication patterns.

check_variable_naming `staticmethod` ¶

check_variable_naming(code)

Check for poor variable naming practices.

check_complexity_indicators `staticmethod` ¶

check_complexity_indicators(code)

Check for high complexity indicators.

analyze_code_maintainability `classmethod` ¶

analyze_code_maintainability(code)

Perform comprehensive maintainability analysis on code.

PromptValidators ¶

PromptValidators(test_data_dir, model_name='default')

Validates solutions for each benchmark prompt.

validate_prompt_1_refactoring ¶

validate_prompt_1_refactoring(solution_file)

Validate Prompt 1: Code Refactoring & Analysis.

validate_prompt_3_transformation ¶

validate_prompt_3_transformation(transform_file)

Validate Prompt 3: Data Transformation.

validate_prompt_4_api ¶

validate_prompt_4_api(api_file)

Validate Prompt 4: API Integration with behavioral testing.

run_in_sandbox ¶

run_in_sandbox(fn, self, *args, **kwargs)

Internal helper executing a validator method inside a SecureRunner sandbox.

Mirrors the prior sandbox logic while remaining reusable and type-safe.

sandbox_validator ¶

sandbox_validator(fn)

Decorator for PromptValidators instance methods preserving signature & return type.

benchmark.secure_runner

benchmark.secure_runner.SecureRunner ¶

SecureRunner(model_name, allow_network=False)

Execute untrusted code in an isolated temporary environment.

sandbox ¶

sandbox()

Context manager establishing the sandbox directory and environment.

run_with_limits ¶

run_with_limits(func, *args, timeout=30, memory_mb=512)

Execute func(*args) under CPU & memory limits where supported.

memory_mb: planned default; future flag --mem will allow override (512/768/1024).

run_python_sandboxed ¶

run_python_sandboxed(args, *, timeout=10, cwd=None, memory_mb=512)

Execute a python module or script with -B inside the sandbox.

Caller MUST be inside with self.sandbox(): context so that sitecustomize guard + strict env are active. Applies resource limits: rlimits on POSIX, Job Objects on Windows.

Sandbox snippet

def safe_print(msg: str) -> None:
    ...

benchmark.utils

benchmark.utils ¶

Utility functions for AI Code Benchmark

load_test_data ¶

load_test_data(test_data_dir)

Load all test data files.

ensure_directories ¶

ensure_directories(dirs)

Ensure all specified directories exist.

create_submission_template ¶

create_submission_template(submissions_dir)

Create template directory in tiered structure.

Legacy fallback removed: always uses submissions/templates/template.

generate_comparison_chart ¶

generate_comparison_chart(results, output_file)

Generate a simple text-based comparison chart.

validate_submission_structure ¶

validate_submission_structure(model_dir)

Validate that a model submission has the correct file structure.

get_model_statistics ¶

get_model_statistics(results)

Extract key statistics from benchmark results.

Always returns a structured stats object (never an empty dict) so that callers can rely on keys without defensive existence checks.

benchmark.types

benchmark.types ¶

Shared TypedDict result shapes for benchmark validators and runner.

Centralizing these shapes ensures a single source of truth for prompt-level and model-level result contracts across the core pipeline (validators, runner, comparison utilities). Runtime behavior remains unchanged; this is purely a typing/structure consolidation.

PromptResult ¶

Bases: TypedDict

Result for a single prompt validation.

total=False so optional fields (like error/traceback) don't require union types everywhere; absent keys are naturally treated as optional.

ModelResults ¶