User Guide¶
How to run benchmarks and read results.
Running Your First Benchmark¶
Basic Workflow¶
-
Setup (one-time): See Getting Started for platform-specific installation and environment setup.
-
Test example model:
- Review results (v0.8.1+ timestamped layout):
Each benchmark run writes to a timestamped directory preserving history:
results/
latest_results.json # Pointer to most recent run (backward-compatible)
20250827_143215/ # Run-specific directory (YYYYMMDD_HHMMSS format)
latest_results.json # Full JSON (models + comparison + _metadata)
detailed/
detailed_results.json # Stable path for tooling
summary_report_20250827_143215.txt
comparison_charts/
comparison_chart.txt # ASCII comparison bars
Key advantages:
- Atomic writes (no partial files)
- Historical retention (track model performance over time)
- Dynamic prompt support (new prompts automatically appear in comparison data)
Quick access:
- Check console output for immediate feedback
- Open
results/latest_results.json(most recent run) - Review
results/<RUN_TS>/detailed/summary_report_<RUN_TS>.txtfor human-readable analysis
Understanding the Benchmark Process¶
What Happens During a Benchmark Run¶
- Initialization: Loads your model's submissions from
submissions/your_model/ - Validation: Each solution file is validated for syntax and structure
- Execution: Solutions are run against test data to verify functionality
- Analysis: Code is analyzed for security, performance, and maintainability
- Scoring: Points awarded based on 7-category assessment
- Reporting: Results saved in multiple formats for review
The Four Challenges¶
Challenge 1: Code Refactoring (25 points)¶
- Goal: Modernize legacy Python code
- Focus: Clean code, error handling, security
Challenge 2: Configuration Repair (25 points)¶
- Goal: Fix broken YAML/JSON files
- Focus: Format validation, cross-format consistency
Challenge 3: Data Transformation (25 points)¶
- Goal: Implement business logic for data processing
- Focus: Correct transformations, edge case handling
Challenge 4: API Integration (25 points)¶
- Goal: Create robust API client with error handling
- Focus: Security, authentication, resilience
Interpreting Your Results¶
Console Output¶
During benchmark execution, you'll see:
Testing model: your_model_name ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing Refactoring & Analysis... ✅ Syntax Validation (5.0/5.0) ✅ Code Structure (3.0/3.0) ✅ Execution Success (6.0/6.0) ⚠️ Code Quality (2.5/3.0) ✅ Security Analysis (4.0/4.0) ⚠️ Performance Analysis (1.5/2.0) ✅ Maintainability (2.0/2.0) PASSED - Score: 24.00/25 (96.0%)
Symbol Meanings:
- ✅ Full points earned
- ⚠️ Partial points earned
- ❌ No points earned
Results Files¶
latest_results.json (root pointer)¶
Complete scoring data with detailed breakdowns:
- Individual category scores
- Specific issues detected
- Performance metrics
- Security vulnerabilities
detailed/summary_report_.txt¶
Human-readable analysis including:
- Score breakdown by category
- Specific feedback for improvements
- Comparison to benchmark standards
- Recommendations for fixes
comparison_charts/comparison_chart.txt¶
Visual representation of scores:
- Progress bars for each category
- Overall grade and percentage
- Quick visual assessment
Understanding Grades¶
See Scoring Methodology – Grade Scale for the canonical letter-grade interpretation and thresholds.
Result Metadata & Privacy¶
What Information is Captured¶
Benchmark results embed minimal provenance metadata to aid reproducibility and debugging:
- spec_version: Benchmark scoring specification revision
- git_commit: Short commit hash of the repository state (local only, never transmitted)
- python_version: Interpreter version used to run the benchmark
- platform: Operating system, release, and architecture string
- timestamp_utc: UTC run time (RFC 3339 format with Z suffix)
- dependency_fingerprint: First 16 hex characters of SHA256 hash of
requirements.txt(non-reversible drift indicator)
Privacy Assurance¶
No personal data is collected or transmitted. All data is written only to local results/ JSON and text files. If you publish results publicly, you may reveal commit hashes or platform details, but no identifying information is included by default.
Opting Out¶
If you need to share results from private repositories or sensitive environments, you can suppress metadata collection:
CLI Flag:
Environment Variable:
Either mechanism retains only spec_version and suppresses git/platform/timestamp/dependency fingerprint fields.
Best Practices for High Scores¶
Security First¶
- Never use
eval()orexec() - Avoid hardcoded credentials
- Validate all inputs
- Use safe parsing methods (
yaml.safe_load) - Include proper authentication headers
Performance Matters¶
- Avoid nested loops (O(n²) complexity)
- Use efficient data structures (sets for membership)
- Minimize file I/O operations
- Process data in single passes when possible
Maintainable Code¶
- Keep functions under 20 lines
- Use descriptive variable names
- Add error handling for edge cases
- Include type hints and docstrings
- Avoid code duplication
Proper Error Handling¶
- Use specific exception types
- Provide informative error messages
- Handle network failures gracefully
- Include retry logic for transient errors
Comparing Multiple Models¶
Run all discovered models or a specific model. For concurrency and all CLI flags, see the API Reference – CLI Reference.
# All models (sequential)
python run_benchmark.py
# Single model
python run_benchmark.py --model your_model
Analyzing Differences¶
Use the comparison script:
This generates:
- Side-by-side score comparisons
- Strength/weakness analysis per model
- Statistical summaries
- Recommendations for model selection
Troubleshooting Low Scores¶
Syntax/Structure Issues¶
Problem: Files don't parse or load Solution: Verify proper Python/YAML/JSON syntax
Execution Failures¶
Problem: Code crashes during execution Solution: Add error handling for edge cases
Security Penalties¶
Problem: Dangerous patterns detected Solution: Review security best practices, avoid unsafe functions
Performance Deductions¶
Problem: Inefficient algorithms flagged Solution: Optimize loops, use appropriate data structures
Maintainability Concerns¶
Problem: Complex or duplicated code Solution: Refactor for clarity, extract common functions
Advanced Usage¶
For full CLI options (timeouts, verbosity, concurrency, output directories) and CI examples, see the API Reference – CLI Reference.
Next Steps¶
After running benchmarks:
- Analyze weak areas: Focus on lowest-scoring categories
- Review specific feedback: Each issue includes improvement suggestions
- Iterate on solutions: Address issues and re-test
- Compare approaches: Try different prompting strategies
- Share insights: Contribute findings to the community
Getting Help¶
- Documentation: Review guides in
docs/directory - Examples: Study
submissions/example_model/for reference - Troubleshooting: See Troubleshooting Guide
- Contributing: Check Contributing Guidelines
See Also¶
- Getting Started - Initial setup
- Scoring Methodology - Score details
Glossary¶
- Benchmark Runner: Orchestrates submissions, validators, scoring, and writes outputs (entry:
run_benchmark.py). - Validator: Checks syntax, structure, execution; analyzes quality, security, performance, maintainability.
- SecureRunner (Sandbox): Confines filesystem, cleans env vars, enforces resource/time limits.
- Sabotage Fixtures: Intentionally broken inputs in
test_data/representing real-world failure modes. - Tiered Submissions:
submissions/layout withreference_implementations/,templates/template/,user_submissions/. - RUN_TS: Timestamped directory
YYYYMMDD_HHMMSSunderresults/for detailed artifacts. - latest_results.json: Root pointer to most recent run under
results/. - Grade Scale: Percentage→letter mapping in Scoring Methodology.
- Determinism: Reproducible, offline-by-default runs with no import-time side effects.