Template

title: Model Submission Template description: Instructions for replacing template prompt files with full AI-generated solutions for benchmarking. search: boost: 0.5

Model Submission Template¶

For broader context on submission workflows see docs/developer-guide.md and scoring in docs/scoring-methodology.md. This template README strictly describes how to replace each prompt file.

Copy this template directory and rename it to your model name (e.g., gpt4, claude_sonnet_4, copilot).

How to Use This Template¶

IMPORTANT: Each template file should be completely replaced with your AI's full solution. Do NOT try to fill in the template structure - replace everything with your AI's complete code/data.

Files to Replace with AI Solutions¶

prompt_1_solution.py - Replace entire file with your AI's complete refactored version of process_records.py
prompt_2_config_fixed.yaml - Replace entire file with your AI's corrected YAML version of config.yaml
prompt_2_config.json - Replace entire file with your AI's complete JSON conversion of the corrected config
prompt_3_transform.py - Replace entire file with your AI's complete implementation of transform_and_enrich_users function
prompt_4_api_sync.py - Replace entire file with your AI's complete implementation of sync_users_to_crm function

Workflow¶

Copy template: Create your model directory
Get AI solutions: Ask your AI to solve each prompt completely
Copy-paste replace: Replace each template file entirely with AI's solution
Test: Run the benchmark to see your scores

Testing Your Submission¶

After completing your files, run:

All Platforms:

python run_benchmark.py --model your_model_name

Scoring¶

Each prompt uses a comprehensive 7-category scoring system (25 points total):

Syntax: Code compilation and basic structure
Structure: Organization, imports, and function design
Execution: Runtime behavior and correctness
Quality: Code style, patterns, and best practices
Security: Vulnerability detection and safe coding practices
Performance: Algorithm efficiency and optimization analysis
Maintainability: Code complexity and long-term maintenance

Grading threshold:

60% or higher (15+ points per prompt) = PASS
Below 60% = FAIL

Enhanced feedback: The benchmark provides detailed category-specific feedback showing exactly which aspects passed or failed, with specific rationale for improvement.

📖 For detailed scoring rubrics, see /docs/scoring_rubric.md

Good luck!