Template
title: Model Submission Template description: Instructions for replacing template prompt files with full AI-generated solutions for benchmarking. search: boost: 0.5
Model Submission Template¶
For broader context on submission workflows see
docs/developer-guide.mdand scoring indocs/scoring-methodology.md. This template README strictly describes how to replace each prompt file.
Copy this template directory and rename it to your model name (e.g., gpt4, claude_sonnet_4, copilot).
How to Use This Template¶
IMPORTANT: Each template file should be completely replaced with your AI's full solution. Do NOT try to fill in the template structure - replace everything with your AI's complete code/data.
Files to Replace with AI Solutions¶
- prompt_1_solution.py - Replace entire file with your AI's complete refactored version of
process_records.py - prompt_2_config_fixed.yaml - Replace entire file with your AI's corrected YAML version of
config.yaml - prompt_2_config.json - Replace entire file with your AI's complete JSON conversion of the corrected config
- prompt_3_transform.py - Replace entire file with your AI's complete implementation of
transform_and_enrich_usersfunction - prompt_4_api_sync.py - Replace entire file with your AI's complete implementation of
sync_users_to_crmfunction
Workflow¶
- Copy template: Create your model directory
- Get AI solutions: Ask your AI to solve each prompt completely
- Copy-paste replace: Replace each template file entirely with AI's solution
- Test: Run the benchmark to see your scores
Testing Your Submission¶
After completing your files, run:
All Platforms:
Scoring¶
Each prompt uses a comprehensive 7-category scoring system (25 points total):
- Syntax: Code compilation and basic structure
- Structure: Organization, imports, and function design
- Execution: Runtime behavior and correctness
- Quality: Code style, patterns, and best practices
- Security: Vulnerability detection and safe coding practices
- Performance: Algorithm efficiency and optimization analysis
- Maintainability: Code complexity and long-term maintenance
Grading threshold:
- 60% or higher (15+ points per prompt) = PASS
- Below 60% = FAIL
Enhanced feedback: The benchmark provides detailed category-specific feedback showing exactly which aspects passed or failed, with specific rationale for improvement.
📖 For detailed scoring rubrics, see
/docs/scoring_rubric.md
Good luck!