Best Practices and Performance Tips

Schema Design Best Practices

1. Start Simple, Iterate

Do:

// Start with basic structure
{
  "type": "object",
  "properties": {
    "id": {"type": "string"},
    "name": {"type": "string"}
  }
}

// Then add constraints
{
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^[A-Z]{3}-[0-9]{6}$"
    },
    "name": {
      "type": "string",
      "minLength": 2,
      "maxLength": 50
    }
  },
  "required": ["id", "name"]
}

// Finally add statistical models

Don't:

Try to model everything at once
Over-complicate initial schemas
Add models before validating structure

2. Use Meaningful Property Names

Good Naming:

{
  "customerId": "string",
  "orderDate": "date",
  "totalAmount": "number",
  "isActive": "boolean"
}

Poor Naming:

{
  "cid": "string",
  "dt": "date", 
  "amt": "number",
  "flag": "boolean"
}

3. Organize Nested Structures

Well-Organized:

{
  "customer": {
    "personal": {
      "firstName": "string",
      "lastName": "string"
    },
    "contact": {
      "email": "string",
      "phone": "string"
    }
  }
}

Flat Structure (when appropriate):

{
  "customerFirstName": "string",
  "customerLastName": "string",
  "customerEmail": "string",
  "customerPhone": "string"
}

4. Constraint Guidelines

Strings

{
  "type": "string",
  "minLength": 1,        // Prevent empty strings
  "maxLength": 255,      // Set reasonable limits
  "pattern": "^[A-Z]",   // Use anchored patterns
  "format": "email"      // Use standard formats
}

Numbers

{
  "type": "number",
  "minimum": 0,          // Set logical bounds
  "maximum": 1000000,    // Prevent overflow
  "multipleOf": 0.01     // For currency/precision
}

Arrays

{
  "type": "array",
  "minItems": 0,         // Allow empty if valid
  "maxItems": 100,       // Prevent memory issues
  "uniqueItems": true    // When appropriate
}

Statistical Model Best Practices

1. Sample Data Requirements

Minimum Samples for Accuracy:

String patterns: 20+ examples
Histograms: 50+ values
Vocabularies: 10+ unique values
Format detection: 5+ examples

2. String Model Configuration

Entropy Guidelines:

Low (0-2):    Use for limited vocabularies (status, category)
Medium (2-4): Use for structured data (names, addresses)
High (4+):    Use for unique identifiers (IDs, passwords)

Pattern Design:

Good Patterns:
- "Llll Llll" for names
- "ddd-ddd-dddd" for phone numbers
- "PROD-dddddd" for product IDs

Avoid:
- Overly specific patterns
- Patterns without variation
- Conflicting patterns

3. Histogram Design

Effective Bins:

{
  "histogram": {
    "bins": [
      // Natural groupings
      {"rangeStart": 0, "rangeEnd": 100, "frequency": 0.7},
      {"rangeStart": 100, "rangeEnd": 500, "frequency": 0.25},
      {"rangeStart": 500, "rangeEnd": 1000, "frequency": 0.05}
    ]
  }
}

Avoid:

Too many bins (>20)
Overlapping ranges
Gaps in coverage

Performance Optimization

1. Generation Performance

Batch Sizes

Optimal Batch Sizes:

## Development/Testing
DataFlood generate schema.json --count 10-100

## Integration Testing  
DataFlood generate schema.json --count 1000-5000

## Performance Testing
DataFlood generate schema.json --count 10000-50000

## Large Scale (use CLI)
DataFlood generate schema.json --count 100000+

Memory Management

For Large Generations:

## Use separate files to reduce memory
DataFlood generate schema.json --count 10000 --separate

## Use CSV for better memory efficiency
DataFlood generate schema.json --count 100000 --format csv

## Stream to file (API)
curl -X POST ".../generate/Model/download" --output data.jsonl

2. Schema Complexity

Reduce Nesting Depth

// Avoid deep nesting (>5 levels)
// Bad: a.b.c.d.e.f.g.value

// Better: flatten where logical
{
  "customer_name": "string",
  "customer_address_street": "string",
  "customer_address_city": "string"
}

Optimize Property Count

Keep under 50 properties per object
Split large schemas into components
Use references for repeated structures

3. API Performance

Connection Pooling

## Python example
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

session = requests.Session()
adapter = HTTPAdapter(
    pool_connections=10,
    pool_maxsize=10,
    max_retries=Retry(total=3)
)
session.mount('http://', adapter)

Parallel Requests

import concurrent.futures

def generate_batch(batch_id):
    return requests.post(url, json={
        "count": 1000,
        "seed": batch_id
    })

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(generate_batch, i) for i in range(10)]
    results = [f.result() for f in futures]

Rate Limiting

## FloodGate configuration
limits:
  maxDocumentsPerRequest: 10000
  maxConcurrentRequests: 100
  requestsPerMinute: 1000

4. Sequence Optimization

Interval Configuration

{
  "intervalMs": 1000,  // Not too frequent
  "documentsPerInterval": 10,  // Reasonable count
  "addJitter": true    // Realistic timing
}

Step Scheduling

{
  "steps": [
    {
      // Stagger start times
      "startOffset": 0,
      "generationProbability": 0.5
    },
    {
      "startOffset": 5000,
      "generationProbability": 0.8
    }
  ]
}

Data Quality Best Practices

1. Validation Strategy

Multi-Level Validation:

Schema structure validation
Constraint satisfaction checking
Statistical distribution verification
Business logic validation

2. Testing Approach

Progressive Testing:

## 1. Validate schema
DataFlood generate schema.json --count 1

## 2. Small sample
DataFlood generate schema.json --count 10 --seed 42

## 3. Statistical verification
DataFlood generate schema.json --count 1000 --seed 42

## 4. Full scale test
DataFlood generate schema.json --count 10000

3. Quality Metrics

Track These Metrics:

Uniqueness rate for IDs
Distribution accuracy
Format compliance
Constraint violations
Generation time

Security Best Practices

1. Sensitive Data

Never Include:

Real passwords
Actual credit card numbers
Real SSNs or personal IDs
Production API keys
Private encryption keys

Use Instead:

Format-compliant fake data
Test credit card numbers
Generated IDs with valid format
Dummy API keys
Test encryption keys

2. API Security

Secure Configuration:

api:
  enableApiKey: true
  rateLimiting: true
  maxRequestSize: 10485760  # 10MB
  allowedOrigins:
    - "https://trusted-domain.com"

Client Security:

// Store API keys securely
const apiKey = process.env.FLOODGATE_API_KEY;

// Use HTTPS in production
const baseUrl = 'https://api.example.com';

3. File Security

Safe File Handling:

## Set appropriate permissions
chmod 600 sensitive-schema.json

## Use environment variables for paths
export SCHEMA_DIR=/secure/location/schemas

Workflow Best Practices

1. Development Workflow

1. Import sample data
2. Review generated schema
3. Add constraints
4. Configure statistical models
5. Test with small batch
6. Refine based on results
7. Export for production

2. Version Control

Git Best Practices:

## Schema files
/schemas/
  ├── v1/
  │   └── customer-v1.json
  ├── v2/
  │   └── customer-v2.json
  └── current/
      └── customer.json -> ../v2/customer-v2.json

## Commit messages
git commit -m "feat: Add email format validation to customer schema"
git commit -m "fix: Correct age histogram distribution"
git commit -m "perf: Optimize string model for faster generation"

3. Documentation

Document These Items:

### Customer Schema

#### Version: 2.1.0
#### Last Updated: 2024-01-15

##### Purpose
Generates realistic customer profiles for testing

##### Constraints
- Age: 18-120 (weighted toward 25-65)
- Email: Valid format, unique
- ID: Format CUST-XXXXXXXX

##### Statistical Models
- Name: Based on US census data
- Address: Major US cities weighted by population

##### Usage
```bash
DataFlood generate customer.json --count 1000

### Integration Best Practices

#### 1. CI/CD Integration

**Automated Testing:**
```yaml
## GitHub Actions example
- name: Validate Schemas
  run: |
    for schema in schemas/*.json; do
      DataFlood generate "$schema" --count 1
    done

- name: Generate Test Data
  run: |
    DataFlood generate schemas/main.json --count 1000 --output test-data.json

2. Database Integration

Bulk Loading:

-- PostgreSQL
COPY customers FROM '/data/generated.csv' CSV HEADER;

-- MySQL
LOAD DATA INFILE '/data/generated.csv'
INTO TABLE customers
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

3. API Testing

Load Testing:

// K6 script
import http from 'k6/http';

export default function() {
  const payload = {
    schema: schemaObject,
    count: 10
  };
  
  http.post('http://localhost:5000/api/documentgenerator/generate-simple', 
    JSON.stringify(payload),
    { headers: { 'Content-Type': 'application/json' }}
  );
}

export let options = {
  vus: 10,
  duration: '30s'
};

Monitoring and Maintenance

1. Performance Monitoring

Key Metrics to Track:

Generation time per 1000 documents
Memory usage during generation
API response times
Error rates

2. Schema Maintenance

Regular Reviews:

Monthly: Validate against requirements
Quarterly: Update statistical models
Yearly: Major version review

3. Troubleshooting Checklist

When Issues Occur:

Check schema validity
Verify constraints are satisfiable
Test with reduced complexity
Review error logs
Test with different seeds
Validate sample data

Common Pitfalls to Avoid

1. Schema Design

Avoid: Circular references
Avoid: Impossible constraints
Avoid: Missing required fields
Avoid: Incompatible types

2. Performance

Avoid: Generating millions in GUI
Avoid: Deeply nested structures (>10 levels)
Avoid: Too many unique constraints
Avoid: Excessive regex complexity

3. Statistical Models

Avoid: Insufficient sample data
Avoid: Conflicting patterns
Avoid: Mismatched entropy settings
Avoid: Overlapping histogram bins

4. Integration

Avoid: Hardcoded paths
Avoid: Missing error handling
Avoid: No retry logic
Avoid: Ignoring rate limits

Summary Checklist

Before Production

Schema validated and tested
Statistical models verified
Performance tested at scale
Security review completed
Documentation updated
Version control in place
Monitoring configured
Backup strategy defined

Optimization Priority

First: Get it working correctly
Second: Make it maintainable
Third: Optimize for performance
Fourth: Scale as needed

Remember

Start simple, iterate based on needs
Test with realistic volumes
Document decisions and constraints
Monitor and maintain regularly
Share knowledge with team