DataFlood Core Concepts
Overview
DataFlood Models use statistical analysis methods to generate realistic synthetic data. Instead of simple random generation, a DataFlood model "learns" patterns, distributions, and relationships from your existing data so it can generate new data that maintains similar characteristics. DataFlood uses "just enough" ML to get the job done for real-time synthetic data generation.
Key Concepts
1. DataFlood Model
A DataFlood model is an enhanced structure that includes statistical models for data generation. It contains:
- Standard DataFlood model properties (type, format, constraints)
- Statistical models for strings (character patterns, vocabularies)
- Histograms for numeric distributions
- Pattern detection for formatted data
Example DataFlood model structure:
{
"type": "object",
"properties": {
"email": {
"type": "string",
"format": "email",
"stringModel": {
"entropyScore": 3.2,
"patterns": ["llll.llll@llll.lll"],
"valueFrequency": {
"john.doe@example.com": 5,
"jane.smith@test.org": 3
}
}
},
"age": {
"type": "integer",
"minimum": 18,
"maximum": 65,
"histogram": {
"bins": [
{"rangeStart": 18, "rangeEnd": 30, "frequency": 0.3},
{"rangeStart": 30, "rangeEnd": 50, "frequency": 0.5},
{"rangeStart": 50, "rangeEnd": 65, "frequency": 0.2}
]
}
}
}
}
2. String Models
String models capture patterns and characteristics of text data:
Components:
- Value Frequency: Most common values and their occurrence counts
- Character Probability: Distribution of characters in the data
- N-grams: Common character sequences (bigrams, trigrams)
- Patterns: Structural templates (e.g., "Llll Llll" for names)
- Entropy Score: Measure of randomness/predictability
Pattern Notation:
L
- Uppercase letterl
- Lowercase letterd
- Digits
- Space.
,@
,-
- Literal characters
Example patterns:
- Email:
llll.llll@llll.lll
- Phone:
(ddd) ddd-dddd
- Product ID:
PROD-dddddd
3. Histograms
Histograms model numeric value distributions:
{
"histogram": {
"bins": [
{
"rangeStart": 0,
"rangeEnd": 100,
"freqStart": 0,
"freqEnd": 40,
"frequency": 0.4
},
{
"rangeStart": 100,
"rangeEnd": 500,
"freqStart": 40,
"freqEnd": 85,
"frequency": 0.45
},
{
"rangeStart": 500,
"rangeEnd": 1000,
"freqStart": 85,
"freqEnd": 100,
"frequency": 0.15
}
],
"totalSamples": 1000
}
}
How Histograms Work:
- Values are grouped into bins based on ranges
- Each bin tracks its frequency (percentage of values)
- Generation samples from bins based on their frequency
- Values within bins are interpolated for variety
4. Format Detection
DataFlood automatically detects common data formats:
Format | Detection Pattern | Example |
---|---|---|
Contains @ and domain | user@example.com | |
URL | Starts with http/https | https://example.com |
UUID | 8-4-4-4-12 hex pattern | 123e4567-e89b-12d3-a456-426614174000 |
Date | ISO 8601 format | 2024-01-15 |
DateTime | ISO 8601 with time | 2024-01-15T10:30:00Z |
IPv4 | Four octets | 192.168.1.1 |
IPv6 | Colon-separated hex | 2001:0db8:85a3::8a2e:0370:7334 |
5. Entropy and Generation Methods
Entropy controls how DataFlood generates strings:
Low Entropy (0.0 - 2.0)
- Method: Sample from existing vocabulary
- Use Case: Predictable values, limited variation
- Example: Product categories, status codes
Medium Entropy (2.0 - 4.0)
- Method: Pattern-based generation
- Use Case: Structured data with variation
- Example: Names, addresses, product names
High Entropy (> 4.0)
- Method: Character probability distribution
- Use Case: High randomness, unique values
- Example: Random identifiers, passwords
6. Model Generation Process
When analyzing sample data, DataFlood:
Structural Analysis
- Identifies object properties and types
- Detects arrays and nesting levels
- Determines required vs optional fields
Statistical Analysis
- Calculates value frequencies
- Builds character probability distributions
- Creates n-gram models
- Generates histograms for numbers
Pattern Recognition
- Detects common formats (email, URL, date)
- Extracts structural patterns
- Identifies value constraints
Model Creation
- Combines analyses into comprehensive DataFlood model
- Adds statistical models to properties
- Sets appropriate constraints
7. Document Generation Process
When generating documents, DataFlood:
Model Loading
- Parses the DataFlood model
- Initializes random number generator (with seed if provided)
Property Generation
- For each property, selects generation method based on:
- Property type (string, number, boolean, etc.)
- Available models (stringModel, histogram)
- Entropy settings
- For each property, selects generation method based on:
Value Creation
- Strings: Uses appropriate method based on entropy
- Numbers: Samples from histogram or uniform distribution
- Booleans: Random true/false
- Objects: Recursively generates nested properties
- Arrays: Generates multiple items based on constraints
Validation
- Ensures generated values meet DataFlood model constraints
- Applies format requirements
- Maintains referential integrity
Advanced Concepts
Parent-Child Relationships (Tides)
In Tides generation, child documents can inherit fields from parents:
{
"transactions": [
{
"parentStepId": "customers",
"childSteps": [
{
"stepId": "orders",
"linkingStrategy": "InjectParentId",
"parentIdField": "customerId"
}
]
}
]
}
Custom Properties
Add dynamic properties during generation:
{
"customProperties": {
"timestamp": "{{now}}",
"environment": "production",
"correlationId": "{{uuid}}"
}
}
Weighted Generation
Control relative frequency of different models in Tides:
{
"steps": [
{
"modelPath": "premium-customer.json",
"weight": 1.0,
"generationProbability": 0.2
},
{
"modelPath": "standard-customer.json",
"weight": 3.0,
"generationProbability": 0.8
}
]
}
Best Practices
1. Sample Data Quality
- Provide diverse, representative samples
- Include edge cases and variations
- Use at least 100 samples for statistical significance
2. Model Refinement
- Review generated DataFlood models for accuracy
- Adjust constraints if too restrictive
- Add custom patterns for domain-specific data
3. Entropy Tuning
- Start with default (no override)
- Decrease for more predictable data
- Increase for unique identifiers
4. Performance Optimization
- Use appropriate batch sizes
- Consider memory limits for large generations
- Use streaming for very large datasets
Next Steps
- Explore the CLI Reference for all commands
- Learn about String Models in detail
- Understand Histogram Configuration
- See Examples for practical applications