Best Practices and Performance Tips

Schema Design Best Practices

1. Start Simple, Iterate

Do:

// Start with basic structure
{
  "type": "object",
  "properties": {
    "id": {"type": "string"},
    "name": {"type": "string"}
  }
}

// Then add constraints
{
  "type": "object",
  "properties": {
    "id": {
      "type": "string",
      "pattern": "^[A-Z]{3}-[0-9]{6}$"
    },
    "name": {
      "type": "string",
      "minLength": 2,
      "maxLength": 50
    }
  },
  "required": ["id", "name"]
}

// Finally add statistical models

Don't:

  • Try to model everything at once
  • Over-complicate initial schemas
  • Add models before validating structure

2. Use Meaningful Property Names

Good Naming:

{
  "customerId": "string",
  "orderDate": "date",
  "totalAmount": "number",
  "isActive": "boolean"
}

Poor Naming:

{
  "cid": "string",
  "dt": "date", 
  "amt": "number",
  "flag": "boolean"
}

3. Organize Nested Structures

Well-Organized:

{
  "customer": {
    "personal": {
      "firstName": "string",
      "lastName": "string"
    },
    "contact": {
      "email": "string",
      "phone": "string"
    }
  }
}

Flat Structure (when appropriate):

{
  "customerFirstName": "string",
  "customerLastName": "string",
  "customerEmail": "string",
  "customerPhone": "string"
}

4. Constraint Guidelines

Strings

{
  "type": "string",
  "minLength": 1,        // Prevent empty strings
  "maxLength": 255,      // Set reasonable limits
  "pattern": "^[A-Z]",   // Use anchored patterns
  "format": "email"      // Use standard formats
}

Numbers

{
  "type": "number",
  "minimum": 0,          // Set logical bounds
  "maximum": 1000000,    // Prevent overflow
  "multipleOf": 0.01     // For currency/precision
}

Arrays

{
  "type": "array",
  "minItems": 0,         // Allow empty if valid
  "maxItems": 100,       // Prevent memory issues
  "uniqueItems": true    // When appropriate
}

Statistical Model Best Practices

1. Sample Data Requirements

Minimum Samples for Accuracy:

  • String patterns: 20+ examples
  • Histograms: 50+ values
  • Vocabularies: 10+ unique values
  • Format detection: 5+ examples

2. String Model Configuration

Entropy Guidelines:

Low (0-2):    Use for limited vocabularies (status, category)
Medium (2-4): Use for structured data (names, addresses)
High (4+):    Use for unique identifiers (IDs, passwords)

Pattern Design:

Good Patterns:
- "Llll Llll" for names
- "ddd-ddd-dddd" for phone numbers
- "PROD-dddddd" for product IDs

Avoid:
- Overly specific patterns
- Patterns without variation
- Conflicting patterns

3. Histogram Design

Effective Bins:

{
  "histogram": {
    "bins": [
      // Natural groupings
      {"rangeStart": 0, "rangeEnd": 100, "frequency": 0.7},
      {"rangeStart": 100, "rangeEnd": 500, "frequency": 0.25},
      {"rangeStart": 500, "rangeEnd": 1000, "frequency": 0.05}
    ]
  }
}

Avoid:

  • Too many bins (>20)
  • Overlapping ranges
  • Gaps in coverage

Performance Optimization

1. Generation Performance

Batch Sizes

Optimal Batch Sizes:

## Development/Testing
DataFlood generate schema.json --count 10-100

## Integration Testing  
DataFlood generate schema.json --count 1000-5000

## Performance Testing
DataFlood generate schema.json --count 10000-50000

## Large Scale (use CLI)
DataFlood generate schema.json --count 100000+

Memory Management

For Large Generations:

## Use separate files to reduce memory
DataFlood generate schema.json --count 10000 --separate

## Use CSV for better memory efficiency
DataFlood generate schema.json --count 100000 --format csv

## Stream to file (API)
curl -X POST ".../generate/Model/download" --output data.jsonl

2. Schema Complexity

Reduce Nesting Depth

// Avoid deep nesting (>5 levels)
// Bad: a.b.c.d.e.f.g.value

// Better: flatten where logical
{
  "customer_name": "string",
  "customer_address_street": "string",
  "customer_address_city": "string"
}

Optimize Property Count

  • Keep under 50 properties per object
  • Split large schemas into components
  • Use references for repeated structures

3. API Performance

Connection Pooling

## Python example
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

session = requests.Session()
adapter = HTTPAdapter(
    pool_connections=10,
    pool_maxsize=10,
    max_retries=Retry(total=3)
)
session.mount('http://', adapter)

Parallel Requests

import concurrent.futures

def generate_batch(batch_id):
    return requests.post(url, json={
        "count": 1000,
        "seed": batch_id
    })

with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    futures = [executor.submit(generate_batch, i) for i in range(10)]
    results = [f.result() for f in futures]

Rate Limiting

## FloodGate configuration
limits:
  maxDocumentsPerRequest: 10000
  maxConcurrentRequests: 100
  requestsPerMinute: 1000

4. Sequence Optimization

Interval Configuration

{
  "intervalMs": 1000,  // Not too frequent
  "documentsPerInterval": 10,  // Reasonable count
  "addJitter": true    // Realistic timing
}

Step Scheduling

{
  "steps": [
    {
      // Stagger start times
      "startOffset": 0,
      "generationProbability": 0.5
    },
    {
      "startOffset": 5000,
      "generationProbability": 0.8
    }
  ]
}

Data Quality Best Practices

1. Validation Strategy

Multi-Level Validation:

  1. Schema structure validation
  2. Constraint satisfaction checking
  3. Statistical distribution verification
  4. Business logic validation

2. Testing Approach

Progressive Testing:

## 1. Validate schema
DataFlood generate schema.json --count 1

## 2. Small sample
DataFlood generate schema.json --count 10 --seed 42

## 3. Statistical verification
DataFlood generate schema.json --count 1000 --seed 42

## 4. Full scale test
DataFlood generate schema.json --count 10000

3. Quality Metrics

Track These Metrics:

  • Uniqueness rate for IDs
  • Distribution accuracy
  • Format compliance
  • Constraint violations
  • Generation time

Security Best Practices

1. Sensitive Data

Never Include:

  • Real passwords
  • Actual credit card numbers
  • Real SSNs or personal IDs
  • Production API keys
  • Private encryption keys

Use Instead:

  • Format-compliant fake data
  • Test credit card numbers
  • Generated IDs with valid format
  • Dummy API keys
  • Test encryption keys

2. API Security

Secure Configuration:

api:
  enableApiKey: true
  rateLimiting: true
  maxRequestSize: 10485760  # 10MB
  allowedOrigins:
    - "https://trusted-domain.com"

Client Security:

// Store API keys securely
const apiKey = process.env.FLOODGATE_API_KEY;

// Use HTTPS in production
const baseUrl = 'https://api.example.com';

3. File Security

Safe File Handling:

## Set appropriate permissions
chmod 600 sensitive-schema.json

## Use environment variables for paths
export SCHEMA_DIR=/secure/location/schemas

Workflow Best Practices

1. Development Workflow

1. Import sample data
2. Review generated schema
3. Add constraints
4. Configure statistical models
5. Test with small batch
6. Refine based on results
7. Export for production

2. Version Control

Git Best Practices:

## Schema files
/schemas/
  ├── v1/
  │   └── customer-v1.json
  ├── v2/
  │   └── customer-v2.json
  └── current/
      └── customer.json -> ../v2/customer-v2.json

## Commit messages
git commit -m "feat: Add email format validation to customer schema"
git commit -m "fix: Correct age histogram distribution"
git commit -m "perf: Optimize string model for faster generation"

3. Documentation

Document These Items:

### Customer Schema

#### Version: 2.1.0
#### Last Updated: 2024-01-15

##### Purpose
Generates realistic customer profiles for testing

##### Constraints
- Age: 18-120 (weighted toward 25-65)
- Email: Valid format, unique
- ID: Format CUST-XXXXXXXX

##### Statistical Models
- Name: Based on US census data
- Address: Major US cities weighted by population

##### Usage
```bash
DataFlood generate customer.json --count 1000
### Integration Best Practices

#### 1. CI/CD Integration

**Automated Testing:**
```yaml
## GitHub Actions example
- name: Validate Schemas
  run: |
    for schema in schemas/*.json; do
      DataFlood generate "$schema" --count 1
    done

- name: Generate Test Data
  run: |
    DataFlood generate schemas/main.json --count 1000 --output test-data.json

2. Database Integration

Bulk Loading:

-- PostgreSQL
COPY customers FROM '/data/generated.csv' CSV HEADER;

-- MySQL
LOAD DATA INFILE '/data/generated.csv'
INTO TABLE customers
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;

3. API Testing

Load Testing:

// K6 script
import http from 'k6/http';

export default function() {
  const payload = {
    schema: schemaObject,
    count: 10
  };
  
  http.post('http://localhost:5000/api/documentgenerator/generate-simple', 
    JSON.stringify(payload),
    { headers: { 'Content-Type': 'application/json' }}
  );
}

export let options = {
  vus: 10,
  duration: '30s'
};

Monitoring and Maintenance

1. Performance Monitoring

Key Metrics to Track:

  • Generation time per 1000 documents
  • Memory usage during generation
  • API response times
  • Error rates

2. Schema Maintenance

Regular Reviews:

  • Monthly: Validate against requirements
  • Quarterly: Update statistical models
  • Yearly: Major version review

3. Troubleshooting Checklist

When Issues Occur:

  1. Check schema validity
  2. Verify constraints are satisfiable
  3. Test with reduced complexity
  4. Review error logs
  5. Test with different seeds
  6. Validate sample data

Common Pitfalls to Avoid

1. Schema Design

  • Avoid: Circular references
  • Avoid: Impossible constraints
  • Avoid: Missing required fields
  • Avoid: Incompatible types

2. Performance

  • Avoid: Generating millions in GUI
  • Avoid: Deeply nested structures (>10 levels)
  • Avoid: Too many unique constraints
  • Avoid: Excessive regex complexity

3. Statistical Models

  • Avoid: Insufficient sample data
  • Avoid: Conflicting patterns
  • Avoid: Mismatched entropy settings
  • Avoid: Overlapping histogram bins

4. Integration

  • Avoid: Hardcoded paths
  • Avoid: Missing error handling
  • Avoid: No retry logic
  • Avoid: Ignoring rate limits

Summary Checklist

Before Production

  • Schema validated and tested
  • Statistical models verified
  • Performance tested at scale
  • Security review completed
  • Documentation updated
  • Version control in place
  • Monitoring configured
  • Backup strategy defined

Optimization Priority

  1. First: Get it working correctly
  2. Second: Make it maintainable
  3. Third: Optimize for performance
  4. Fourth: Scale as needed

Remember

  • Start simple, iterate based on needs
  • Test with realistic volumes
  • Document decisions and constraints
  • Monitor and maintain regularly
  • Share knowledge with team

Additional Resources