Developer Guide¶
Complete guide for adding and testing AI models in AIBugBench.
Tiered Submission System¶
AIBugBench uses a three-tier system for code submissions:
Tier 1: Reference Implementations¶
Location: submissions/reference_implementations/
- Full quality checks and security scanning
- Must pass all linting rules
- Used as baseline for benchmarking
Tier 2: Templates¶
Location: submissions/templates/
- Basic safety checks only
- Starting points for new implementations
- Relaxed linting rules
Tier 3: User Submissions¶
Location: submissions/user_submissions/
- AI-generated and user code
- Excluded from automated checks
- Not tracked in git
Usage¶
- Copy template to user_submissions
- Implement your solution
- Run benchmark:
python run_benchmark.py --model your_model
Reference implementations demonstrate best practices and serve as comparison baseline.
Quick Start¶
- Copy the template (tiered layout):
Windows PowerShell (recommended):
Windows CMD:
macOS/Linux Bash:
- Present prompts to your AI model
- Save the AI's code responses in the appropriate files
- Run the benchmark:
Detailed Process¶
Step 1: Prepare Your Model Environment¶
Ensure you have access to the AI model you want to test:
- ChatGPT (GPT-4, GPT-3.5, etc.)
- Claude (Opus, Sonnet, Haiku)
- GitHub Copilot
- Local models (Llama, Mistral, etc.)
- Custom fine-tuned models
Example model names: gpt4_turbo, claude_opus_3, copilot_2024, llama_70b
Step 2: Prime Your AI Model¶
For best results, provide context to your AI model by using the ai_prompt.md file generated during setup:
This comprehensive context file includes:
- The broken test data files that need fixing
- Clear instructions about the benchmark requirements
- Examples of expected output formats
Copy-paste the entire contents of ai_prompt.md into your AI conversation before presenting any prompts.
Step 3: Present Each Prompt¶
Prompt 1: Code Refactoring¶
- Open
prompts/prompt_1_refactoring.md - Copy the entire prompt text to your AI model
- Save the AI's Python code response as
submissions/user_submissions/your_model_name/prompt_1_solution.py
Expected Output Structure:
#!/usr/bin/env python3
"""
Refactored process_records.py with modern Python practices.
"""
import json
import yaml
import logging
from pathlib import Path
from typing import Dict, List, Any
def setup_logging() -> logging.Logger:
"""Set up logging configuration."""
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
return logging.getLogger(__name__)
def load_user_data(file_path: Path) -> List[Dict[str, Any]]:
"""Load user data from JSON file with error handling."""
try:
with file_path.open('r', encoding='utf-8') as f:
data = json.load(f)
return data if isinstance(data, list) else []
except (FileNotFoundError, json.JSONDecodeError, PermissionError) as e:
logging.error(f"Error loading {file_path}: {e}")
return []
def filter_usa_users(users: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Filter users by USA country."""
return [
user for user in users
if isinstance(user, dict) and user.get('country', '').upper() == 'USA'
]
def main() -> None:
"""Main execution function."""
logger = setup_logging()
data_file = Path('user_data.json')
if not data_file.exists():
logger.error(f"Data file {data_file} not found")
return
users = load_user_data(data_file)
usa_users = filter_usa_users(users)
logger.info(f"Found {len(usa_users)} USA users out of {len(users)} total")
for user in usa_users:
print(f"USA User: {user.get('name', 'Unknown')} - {user.get('email', 'No email')}")
if __name__ == "__main__":
main()
Prompt 2: YAML/JSON Correction¶
- Open
prompts/prompt_2_yaml_json.md - Present the prompt to your AI model
- Save the corrected YAML as
prompt_2_config_fixed.yaml - Save the JSON conversion as
prompt_2_config.json
Expected YAML Structure (prompt_2_config_fixed.yaml):
use_legacy_paths: true
paths:
data_source: /srv/data/production/users.json
legacy_data_source: ./user_data.json
log_file: /var/log/processor.log
validation_rules:
min_age: 18
max_age: 120
required_fields:
- name
- email
- country
processing_options:
batch_size: 100
timeout_seconds: 30
retry_attempts: 3
api_keys:
- primary_key
- secondary_key
- backup_key
feature_flags:
enable_logging: true
strict_validation: false
debug_mode: false
Expected JSON Structure (prompt_2_config.json):
{
"use_legacy_paths": true,
"paths": {
"data_source": "/srv/data/production/users.json",
"legacy_data_source": "./user_data.json",
"log_file": "/var/log/processor.log"
},
"validation_rules": {
"min_age": 18,
"max_age": 120,
"required_fields": ["name", "email", "country"]
},
"processing_options": {
"batch_size": 100,
"timeout_seconds": 30,
"retry_attempts": 3
},
"api_keys": ["primary_key", "secondary_key", "backup_key"],
"feature_flags": {
"enable_logging": true,
"strict_validation": false,
"debug_mode": false
}
}
Prompt 3: Data Transformation (Deterministic & Import-Safe)¶
- Open
prompts/prompt_3_transformation.md - Present the prompt to your AI model
- Save the response as
prompt_3_transform.py
Key Requirements (now explicit for fairness):
- Provide exactly one Python file (no packages) that defines
transform_and_enrich_users(user_list) - No side effects at import time (the benchmark imports the module)
- Deterministic: no randomness, no network, no external writes
- Transformations per user:
- Coerce
idto int when possible - Add
contact.email_provider= domain after@if email present - Normalize
stats.ageto int when possible - Add
account_tierusing rules: Gold (>100 posts AND >300 comments), else Silver (>50 posts), else Bronze - Gracefully skip only the failing sub-step for malformed/missing data (never raise)
- Return a list the same length as the input
- Optional main guard demo allowed; ignored by scoring
Prompt 4: API Integration¶
- Open
prompts/prompt_4_api_simulation.md - Present the prompt to your AI model
- Save the API function as
prompt_4_api_sync.py
Key Requirements:
- Function must be named exactly:
sync_users_to_crm - Must handle various HTTP status codes (200, 400, 401, 503)
- Must include Bearer authentication and timeout settings
Step 4: Validate Your Submission¶
Check that all files exist and have content:
# List all files
ls -la submissions/your_model_name/
# Quick syntax validation
python -m py_compile submissions/your_model_name/prompt_1_solution.py
python -c "import yaml; yaml.safe_load(open('submissions/your_model_name/prompt_2_config_fixed.yaml'))"
python -c "import json; json.load(open('submissions/your_model_name/prompt_2_config.json'))"
Step 5: Run the Benchmark¶
Expected Results¶
Testing model: your_model_name
Testing Refactoring & Analysis... PASSED - Score: 23.50/25 (94.0%)
Testing YAML/JSON Correction... PASSED - Score: 25.00/25 (100.0%)
Testing Data Transformation... PASSED - Score: 24.75/25 (99.0%)
Testing API Simulation... PASSED - Score: 21.00/25 (84.0%)
Final Score: 94.25/100 (94.3%) - Grade: A
Results saved to: results/latest_results.json
Tips for Best Results¶
For AI Model Interaction¶
- Be Specific: Ask for complete, working code
- Mention Requirements: Remind the AI about function names and error handling requirements
- Request Testing: Ask the AI to consider edge cases and provide robust error handling
- Complete Code Only: Save exactly what the AI provides, including all imports and functions
- Proper Indentation: Maintain Python 4-space indentation standard
Understanding Sabotage Patterns¶
The test data includes intentional bugs and issues. See Sabotage Documentation for details on:
- Syntax errors in YAML/JSON
- Logic bugs in Python code
- Security vulnerabilities
- Performance anti-patterns
Common Issues and Solutions¶
Missing Function Names¶
- Issue: Function not named exactly as required
- Solution: Double-check function names match prompt requirements exactly
- Prompt 3:
transform_and_enrich_users - Prompt 4:
sync_users_to_crm
Import Errors¶
- Issue: Missing required imports
- Solution: Ensure all necessary imports are included at top of files
File Format Issues¶
- Issue: YAML syntax errors, invalid JSON
- Solution:
- YAML: Use 2-space indentation, no tabs
- JSON: Use proper boolean values (
truenot"true")
Incomplete Error Handling¶
- Issue: Code crashes on edge cases
- Solution: Emphasize robust error handling in your prompts
Low Security Scores¶
- Issue: Security vulnerabilities detected
- Solution: Avoid:
eval()orexec()- Hardcoded API keys
- Missing input validation
shell=Truein subprocess
Poor Performance Scores¶
- Issue: Inefficient algorithms detected
- Solution: Avoid:
- Nested loops over same data (O(n²))
- String concatenation in loops
- Multiple sorting operations
- Loading entire files unnecessarily
Advanced Usage¶
Repository Audit & Quality Gate¶
Run the consolidated repository audit before submitting significant changes:
Strict mode with minimum score threshold (fails repo health check below 85):
JSON-only output (no console summary):
Recommended pre-PR checklist addition:
- Lint:
ruff check . - Tests:
pytest -q - Audit score ≥ 85 and no Critical findings
Audit scoring factors: structure completeness, security config presence, placeholder leakage, duplicated legacy files, ignored artifact hygiene, and documentation alignment.
Testing Variations¶
Test different versions or prompting strategies: submissions/gpt4_version1/ submissions/gpt4_version2/ submissions/gpt4_with_examples/ submissions/gpt4_with_context/
Batch Testing¶
Test all models at once:
This tests all models in the submissions/ directory.
Comparing Models¶
Compare results between models:
Contributing¶
Sharing Results¶
Consider contributing your results:
- Fork the repository
- Add your model results to
results/community/ - Update documentation with insights
- Submit a pull request
Adding New Prompts¶
To add custom challenges:
- Create new prompt file in
prompts/ - Add validation logic in
benchmark/validators.py - Update scoring rubric in
benchmark/scoring.py - Document in scoring methodology
Next Steps¶
- Review detailed scores: Check
results/<RUN_TS>/detailed/summary_report_<RUN_TS>.txt(rootlatest_results.jsonpoints to most recent run) - Analyze weak areas: Focus on categories with lowest scores
- Iterate and improve: Refine prompting strategies
- Compare models: Use comparison charts to evaluate different approaches
📊 Scoring Methodology | 🔧 Troubleshooting | 🐛 Sabotage Notes