DataFlood Core Concepts

Overview

DataFlood Models use statistical analysis methods to generate realistic synthetic data. Instead of simple random generation, a DataFlood model "learns" patterns, distributions, and relationships from your existing data so it can generate new data that maintains similar characteristics. DataFlood uses "just enough" ML to get the job done for real-time synthetic data generation.

Key Concepts

1. DataFlood Model

A DataFlood model is an enhanced structure that includes statistical models for data generation. It contains:

Standard DataFlood model properties (type, format, constraints)
Statistical models for strings (character patterns, vocabularies)
Histograms for numeric distributions
Pattern detection for formatted data

Example DataFlood model structure:

{
  "type": "object",
  "properties": {
    "email": {
      "type": "string",
      "format": "email",
      "stringModel": {
        "entropyScore": 3.2,
        "patterns": ["llll.llll@llll.lll"],
        "valueFrequency": {
          "john.doe@example.com": 5,
          "jane.smith@test.org": 3
        }
      }
    },
    "age": {
      "type": "integer",
      "minimum": 18,
      "maximum": 65,
      "histogram": {
        "bins": [
          {"rangeStart": 18, "rangeEnd": 30, "frequency": 0.3},
          {"rangeStart": 30, "rangeEnd": 50, "frequency": 0.5},
          {"rangeStart": 50, "rangeEnd": 65, "frequency": 0.2}
        ]
      }
    }
  }
}

2. String Models

String models capture patterns and characteristics of text data:

Components:

Value Frequency: Most common values and their occurrence counts
Character Probability: Distribution of characters in the data
N-grams: Common character sequences (bigrams, trigrams)
Patterns: Structural templates (e.g., "Llll Llll" for names)
Entropy Score: Measure of randomness/predictability

Pattern Notation:

L - Uppercase letter
l - Lowercase letter
d - Digit
s - Space
., @, - - Literal characters

Example patterns:

Email: llll.llll@llll.lll
Phone: (ddd) ddd-dddd
Product ID: PROD-dddddd

3. Histograms

Histograms model numeric value distributions:

{
  "histogram": {
    "bins": [
      {
        "rangeStart": 0,
        "rangeEnd": 100,
        "freqStart": 0,
        "freqEnd": 40,
        "frequency": 0.4
      },
      {
        "rangeStart": 100,
        "rangeEnd": 500,
        "freqStart": 40,
        "freqEnd": 85,
        "frequency": 0.45
      },
      {
        "rangeStart": 500,
        "rangeEnd": 1000,
        "freqStart": 85,
        "freqEnd": 100,
        "frequency": 0.15
      }
    ],
    "totalSamples": 1000
  }
}

How Histograms Work:

Values are grouped into bins based on ranges
Each bin tracks its frequency (percentage of values)
Generation samples from bins based on their frequency
Values within bins are interpolated for variety

4. Format Detection

DataFlood automatically detects common data formats:

Format	Detection Pattern	Example
Email	Contains @ and domain	user@example.com
URL	Starts with http/https	https://example.com
UUID	8-4-4-4-12 hex pattern	123e4567-e89b-12d3-a456-426614174000
Date	ISO 8601 format	2024-01-15
DateTime	ISO 8601 with time	2024-01-15T10:30:00Z
IPv4	Four octets	192.168.1.1
IPv6	Colon-separated hex	2001:0db8:85a3::8a2e:0370:7334

5. Entropy and Generation Methods

Entropy controls how DataFlood generates strings:

Low Entropy (0.0 - 2.0)

Method: Sample from existing vocabulary
Use Case: Predictable values, limited variation
Example: Product categories, status codes

Medium Entropy (2.0 - 4.0)

Method: Pattern-based generation
Use Case: Structured data with variation
Example: Names, addresses, product names

High Entropy (> 4.0)

Method: Character probability distribution
Use Case: High randomness, unique values
Example: Random identifiers, passwords

6. Model Generation Process

When analyzing sample data, DataFlood:

Structural Analysis
- Identifies object properties and types
- Detects arrays and nesting levels
- Determines required vs optional fields
Statistical Analysis
- Calculates value frequencies
- Builds character probability distributions
- Creates n-gram models
- Generates histograms for numbers
Pattern Recognition
- Detects common formats (email, URL, date)
- Extracts structural patterns
- Identifies value constraints
Model Creation
- Combines analyses into comprehensive DataFlood model
- Adds statistical models to properties
- Sets appropriate constraints

7. Document Generation Process

When generating documents, DataFlood:

Model Loading
- Parses the DataFlood model
- Initializes random number generator (with seed if provided)
Property Generation
- For each property, selects generation method based on:
  - Property type (string, number, boolean, etc.)
  - Available models (stringModel, histogram)
  - Entropy settings
Value Creation
- Strings: Uses appropriate method based on entropy
- Numbers: Samples from histogram or uniform distribution
- Booleans: Random true/false
- Objects: Recursively generates nested properties
- Arrays: Generates multiple items based on constraints
Validation
- Ensures generated values meet DataFlood model constraints
- Applies format requirements
- Maintains referential integrity

Advanced Concepts

Parent-Child Relationships (Tides)

In Tides generation, child documents can inherit fields from parents:

{
  "transactions": [
    {
      "parentStepId": "customers",
      "childSteps": [
        {
          "stepId": "orders",
          "linkingStrategy": "InjectParentId",
          "parentIdField": "customerId"
        }
      ]
    }
  ]
}

Custom Properties

Add dynamic properties during generation:

{
  "customProperties": {
    "timestamp": "{{now}}",
    "environment": "production",
    "correlationId": "{{uuid}}"
  }
}

Weighted Generation

Control relative frequency of different models in Tides:

{
  "steps": [
    {
      "modelPath": "premium-customer.json",
      "weight": 1.0,
      "generationProbability": 0.2
    },
    {
      "modelPath": "standard-customer.json",
      "weight": 3.0,
      "generationProbability": 0.8
    }
  ]
}

Best Practices

1. Sample Data Quality

Provide diverse, representative samples
Include edge cases and variations
Use at least 100 samples for statistical significance

2. Model Refinement

Review generated DataFlood models for accuracy
Adjust constraints if too restrictive
Add custom patterns for domain-specific data

3. Entropy Tuning

Start with default (no override)
Decrease for more predictable data
Increase for unique identifiers

4. Performance Optimization

Use appropriate batch sizes
Consider memory limits for large generations
Use streaming for very large datasets

Next Steps

Explore the CLI Reference for all commands
Learn about String Models in detail
Understand Histogram Configuration
See Examples for practical applications