Skip to content

Sabotage Notes: Deliberate Issues Added to AIBugBench

Overview

Catalog of deliberate, solvable edge cases reflecting real production pitfalls. These tests probe how models handle inconsistent, legacy, or problematic inputs without changing core requirements.

Design Principles:

  • Every hazard has a realistic origin story from production systems
  • All issues remain solvable through defensive programming practices
  • Success paths are preserved to ensure benchmark remains achievable
  • Edge cases increase difficulty without changing core requirements

File-by-File Hazard Catalog

test_data/process_records.py

Hazards Introduced (compact):

  1. Builtin shadowing (list = []): confusing and unsafe → remove/rename.
  2. Mixed imports (dt + datetime): inconsistent references → standardize.
  3. Exception masking (except Exception: pass): silent failures/NameError risk → catch specific errors.
  4. Unsafe YAML loading (yaml.load): security risk → use yaml.safe_load.
  5. Naive datetime (mixed formats): TZ/parse issues → use aware datetimes and validate formats.
  6. Misleading params (hardcoded values): false flexibility → rename or make configurable.

test_data/user_data.json

Hazards Introduced (compact):

  1. JSON syntax: trailing commas, JS comments, duplicate keys → preprocess then parse.
  2. Types: "NaN" strings, leading zeros, mixed numeric types → normalize/coerce.
  3. Unicode: zero‑width, combining diacritics, BOM → normalize Unicode and strip BOM.
  4. Structure: contact can be string/object/array; missing fields; variable depth → defensive access and schema checks.

Success Path Preservation: Users 101, 105, and one additional clean record maintain perfect structure for successful processing:

  • Consistent data types
  • All required fields present
  • Standard JSON formatting
  • No Unicode complications

test_data/config.yaml

Hazards Introduced:

  1. Multi-Document YAML Structure
  2. What: Two YAML documents with conflicting values
  3. Real-world Context: Configuration file evolution, different environment configs
  4. Impact: Parser confusion, value precedence issues
  5. Solution: Multi-document aware parsing with merge strategies

  6. Mixed Boolean Representations

  7. Document 1: Native YAML booleans (yes, true, false)
  8. Document 2: String booleans ("true", "false", "1", "off")
  9. Real-world Context: Different systems writing to same config file
  10. Impact: Boolean evaluation inconsistencies
  11. Solution: Normalize boolean values during parsing

  12. Type System Chaos

  13. Numbers as Strings: "021", "05", "60.00"
  14. Leading Zeros: String numbers with leading zeros
  15. Decimal Strings: Integers represented as decimal strings
  16. Real-world Context: Form inputs, environment variable substitution
  17. Impact: Numeric comparison failures, type conversion issues
  18. Solution: Type coercion with validation

  19. Path Format Mixing

  20. Environment Variables: $HOME/data, ~/logs
  21. Platform Mixing: Windows (C:\) and Unix (/tmp) paths
  22. Relative vs Absolute: Mixed path resolution strategies
  23. Real-world Context: Cross-platform deployment, different developers
  24. Impact: Path resolution failures, file not found errors
  25. Solution: Platform-aware path handling, environment variable expansion

  26. Indentation Chaos

  27. Mixed Tabs/Spaces: Within single document
  28. Inconsistent Levels: Varying indentation amounts
  29. Real-world Context: Different editors, copy-paste operations
  30. Impact: YAML parsing errors, structure misinterpretation
  31. Solution: Indentation normalization, YAML validation

  32. YAML Advanced Features

  33. Anchors and Aliases: &default_paths and *default_paths
  34. Multi-line Strings: Various YAML string representations
  35. Real-world Context: DRY principle application, complex configurations
  36. Impact: Reference resolution issues, parser compatibility
  37. Solution: YAML feature-aware parsing

Success Path Preservation: Document 2 contains working path: "./user_data.json" pointing to actual file Essential configuration values remain accessible through proper multi-document parsing

Solution Pattern Library (Comments added to explain 'expected' behavior)

Pattern 1: Safe File Reading with Encoding Handling

def safe_json_load(file_path):
    with open(file_path, encoding='utf-8-sig') as f:
        content = f.read()
        # Strip BOM if present
        content = content.strip('\ufeff')
        # Remove JavaScript-style comments (simple approach)
        content = re.sub(r'//.*$', '', content, flags=re.MULTILINE)
        content = re.sub(r'/\*.*?\*/', '', content, flags=re.DOTALL)
        # Remove trailing commas (simplified)
        content = re.sub(r',(\s*[}\]])', r'\1', content)
        return json.loads(content)

Pattern 2: Type Normalization

def normalize_age(age_value):
    if age_value is None:
        return 0
    if isinstance(age_value, (int, float)):
        return int(age_value)
    if isinstance(age_value, str):
        if age_value.lower() in ('nan', 'null', ''):
            return 0
        # Handle leading zeros
        try:
            return int(float(age_value))  # Handle "21.0" -> 21
        except ValueError:
            return 0
    return 0

Pattern 3: Multi-Document YAML Handling

def load_config_safe(config_path):
    with open(config_path) as f:
        documents = list(yaml.safe_load_all(f))

    # Merge documents with precedence (last wins)
    config = {}
    for doc in documents:
        if doc:  # Skip None documents
            config.update(doc)

    return config

Pattern 4: Unicode Normalization

import unicodedata

def normalize_text(text):
    if not isinstance(text, str):
        return text

    # Remove zero-width characters
    text = ''.join(char for char in text if unicodedata.category(char) != 'Cf')

    # Normalize Unicode (NFC form)
    text = unicodedata.normalize('NFC', text)

    return text.strip()

Pattern 5: Defensive Field Access

def safe_get_nested(data, path, default=None):
    """Safely get nested dictionary values with dot notation"""
    keys = path.split('.')
    current = data

    for key in keys:
        if not isinstance(current, dict) or key not in current:
            return default
        current = current[key]

    return current

# Usage: email = safe_get_nested(user, 'contact.email', '')

Testing and Validation

Manual Verification Steps

  1. File Parsing Tests:
# Test JSON with comments and trailing commas
import json
with open('user_data.json', encoding='utf-8-sig') as f:
    raw = f.read()
    # Should fail with standard json.loads()
    # Should succeed with preprocessing

# Test multi-document YAML
import yaml
with open('config.yaml') as f:
    docs = list(yaml.safe_load_all(f))
    assert len(docs) == 2
  1. Unicode Handling Tests:
# Test zero-width space detection
name = "Jane​Doe"  # Contains U+200B
assert len(name) > len("JaneDoe")  # Should detect extra character

# Test combining diacritics  
muller1 = "Müller"        # Precomposed
muller2 = "Mu\u0308ller" # Combining diacritic
assert muller1 != muller2  # Different representations
  1. Type Coercion Tests:
# Test string number handling
assert int("0004") == 4      # Leading zeros
assert float("21.00") == 21.0 # Decimal strings

# Test NaN handling
try:
    int("NaN")  # Should raise ValueError
except ValueError:
    pass  # Expected

Automated Testing Framework

def test_enhanced_benchmark():
    """Test that enhanced files maintain solvability"""

    # Test 1: JSON parsing with preprocessing succeeds
    users = load_users_safe('test_data/user_data.json')
    assert len(users) >= 3  # At least 3 clean users

    # Test 2: YAML multi-document parsing succeeds  
    config = load_config_safe('test_data/config.yaml')
    assert 'validation_rules' in config

    # Test 3: Core business logic executable
    processor = SafeProcessor(config)
    results = processor.process_all_records()
    assert len(results) > 0  # Some records processed successfully

    # Test 4: Score achievability
    score = validate_solution_quality(results)
    assert score >= 18  # Minimum achievable with good solution

Troubleshooting Guide

Common Issues and Solutions

Issue: JSON parsing fails with syntax error Cause: JavaScript comments or trailing commas Solution: Preprocess JSON to remove comments and trailing commas

Issue: YAML parsing returns unexpected values
Cause: Multi-document structure, only first document loaded Solution: Use yaml.safe_load_all() and merge documents appropriately

Issue: Unicode characters display incorrectly Cause: Zero-width spaces, combining diacritics, encoding issues Solution: Unicode normalization and BOM stripping

Issue: Type conversion errors on numeric strings Cause: Leading zeros, decimal representations, "NaN" values Solution: Robust type coercion with fallback values

Issue: File path resolution failures Cause: Environment variables, platform-specific paths Solution: Path expansion and platform-aware handling

Performance Considerations

Impact: Enhanced files may require additional processing Mitigation:

  • Cache preprocessing results
  • Use efficient regex patterns
  • Minimize file I/O operations
  • Consider lazy loading for large datasets

Benchmarking: Enhanced benchmark ~10-15% slower due to:

  • JSON preprocessing overhead
  • Multi-document YAML parsing
  • Unicode normalization
  • Type coercion operations

Real-World Context and Educational Value

Why These Issues Exist

Legacy System Integration: Different systems export data in incompatible formats Multi-Developer Codebases: Inconsistent coding standards accumulate over time
Copy-Paste Programming: Code fragments copied without understanding context Evolving Requirements: Systems modified without updating related components Platform Differences: Windows/Unix path handling, encoding variations Internationalization: Unicode complexity, character encoding issues

Educational Outcomes

Defensive Programming: Students learn to validate and sanitize inputs Error Handling: Proper exception handling becomes critical for success Type Safety: Dynamic typing pitfalls become apparent Configuration Management: Complex config handling strategies required
Unicode Awareness: International character set considerations Security Consciousness: Unsafe YAML loading demonstrates security implications

Professional Relevance

These enhancements mirror real-world scenarios QA engineers encounter:

  • Data migration projects with inconsistent source formats
  • Legacy code maintenance with accumulated technical debt
  • Integration testing across different systems and platforms
  • International deployment with Unicode complexity
  • Security auditing of configuration and data loading practices

Conclusion

The AIBugBench difficulty enhancement introduces realistic complexity that mirrors production QA scenarios while maintaining clear solution paths for thorough implementations. Each hazard serves educational purposes and tests practical skills essential for professional software quality assurance.

Success requires:

  • Careful input validation and sanitization
  • Robust error handling and recovery
  • Type safety awareness and normalization
  • Unicode and encoding best practices
  • Configuration complexity management
  • Security-conscious programming practices

These skills directly transfer to real-world QA engineering challenges, making the enhanced benchmark a more valuable assessment tool for AI model capabilities in practical software quality scenarios.