DataFlood Core Concepts

Overview

DataFlood Models use statistical analysis methods to generate realistic synthetic data. Instead of simple random generation, a DataFlood model "learns" patterns, distributions, and relationships from your existing data so it can generate new data that maintains similar characteristics. DataFlood uses "just enough" ML to get the job done for real-time synthetic data generation.

Key Concepts

1. DataFlood Model

A DataFlood model is an enhanced structure that includes statistical models for data generation. It contains:

  • Standard DataFlood model properties (type, format, constraints)
  • Statistical models for strings (character patterns, vocabularies)
  • Histograms for numeric distributions
  • Pattern detection for formatted data

Example DataFlood model structure:

{
  "type": "object",
  "properties": {
    "email": {
      "type": "string",
      "format": "email",
      "stringModel": {
        "entropyScore": 3.2,
        "patterns": ["llll.llll@llll.lll"],
        "valueFrequency": {
          "john.doe@example.com": 5,
          "jane.smith@test.org": 3
        }
      }
    },
    "age": {
      "type": "integer",
      "minimum": 18,
      "maximum": 65,
      "histogram": {
        "bins": [
          {"rangeStart": 18, "rangeEnd": 30, "frequency": 0.3},
          {"rangeStart": 30, "rangeEnd": 50, "frequency": 0.5},
          {"rangeStart": 50, "rangeEnd": 65, "frequency": 0.2}
        ]
      }
    }
  }
}

2. String Models

String models capture patterns and characteristics of text data:

Components:

  • Value Frequency: Most common values and their occurrence counts
  • Character Probability: Distribution of characters in the data
  • N-grams: Common character sequences (bigrams, trigrams)
  • Patterns: Structural templates (e.g., "Llll Llll" for names)
  • Entropy Score: Measure of randomness/predictability

Pattern Notation:

  • L - Uppercase letter
  • l - Lowercase letter
  • d - Digit
  • s - Space
  • ., @, - - Literal characters

Example patterns:

  • Email: llll.llll@llll.lll
  • Phone: (ddd) ddd-dddd
  • Product ID: PROD-dddddd

3. Histograms

Histograms model numeric value distributions:

{
  "histogram": {
    "bins": [
      {
        "rangeStart": 0,
        "rangeEnd": 100,
        "freqStart": 0,
        "freqEnd": 40,
        "frequency": 0.4
      },
      {
        "rangeStart": 100,
        "rangeEnd": 500,
        "freqStart": 40,
        "freqEnd": 85,
        "frequency": 0.45
      },
      {
        "rangeStart": 500,
        "rangeEnd": 1000,
        "freqStart": 85,
        "freqEnd": 100,
        "frequency": 0.15
      }
    ],
    "totalSamples": 1000
  }
}

How Histograms Work:

  1. Values are grouped into bins based on ranges
  2. Each bin tracks its frequency (percentage of values)
  3. Generation samples from bins based on their frequency
  4. Values within bins are interpolated for variety

4. Format Detection

DataFlood automatically detects common data formats:

Format Detection Pattern Example
Email Contains @ and domain user@example.com
URL Starts with http/https https://example.com
UUID 8-4-4-4-12 hex pattern 123e4567-e89b-12d3-a456-426614174000
Date ISO 8601 format 2024-01-15
DateTime ISO 8601 with time 2024-01-15T10:30:00Z
IPv4 Four octets 192.168.1.1
IPv6 Colon-separated hex 2001:0db8:85a3::8a2e:0370:7334

5. Entropy and Generation Methods

Entropy controls how DataFlood generates strings:

Low Entropy (0.0 - 2.0)

  • Method: Sample from existing vocabulary
  • Use Case: Predictable values, limited variation
  • Example: Product categories, status codes

Medium Entropy (2.0 - 4.0)

  • Method: Pattern-based generation
  • Use Case: Structured data with variation
  • Example: Names, addresses, product names

High Entropy (> 4.0)

  • Method: Character probability distribution
  • Use Case: High randomness, unique values
  • Example: Random identifiers, passwords

6. Model Generation Process

When analyzing sample data, DataFlood:

  1. Structural Analysis

    • Identifies object properties and types
    • Detects arrays and nesting levels
    • Determines required vs optional fields
  2. Statistical Analysis

    • Calculates value frequencies
    • Builds character probability distributions
    • Creates n-gram models
    • Generates histograms for numbers
  3. Pattern Recognition

    • Detects common formats (email, URL, date)
    • Extracts structural patterns
    • Identifies value constraints
  4. Model Creation

    • Combines analyses into comprehensive DataFlood model
    • Adds statistical models to properties
    • Sets appropriate constraints

7. Document Generation Process

When generating documents, DataFlood:

  1. Model Loading

    • Parses the DataFlood model
    • Initializes random number generator (with seed if provided)
  2. Property Generation

    • For each property, selects generation method based on:
      • Property type (string, number, boolean, etc.)
      • Available models (stringModel, histogram)
      • Entropy settings
  3. Value Creation

    • Strings: Uses appropriate method based on entropy
    • Numbers: Samples from histogram or uniform distribution
    • Booleans: Random true/false
    • Objects: Recursively generates nested properties
    • Arrays: Generates multiple items based on constraints
  4. Validation

    • Ensures generated values meet DataFlood model constraints
    • Applies format requirements
    • Maintains referential integrity

Advanced Concepts

Parent-Child Relationships (Tides)

In Tides generation, child documents can inherit fields from parents:

{
  "transactions": [
    {
      "parentStepId": "customers",
      "childSteps": [
        {
          "stepId": "orders",
          "linkingStrategy": "InjectParentId",
          "parentIdField": "customerId"
        }
      ]
    }
  ]
}

Custom Properties

Add dynamic properties during generation:

{
  "customProperties": {
    "timestamp": "{{now}}",
    "environment": "production",
    "correlationId": "{{uuid}}"
  }
}

Weighted Generation

Control relative frequency of different models in Tides:

{
  "steps": [
    {
      "modelPath": "premium-customer.json",
      "weight": 1.0,
      "generationProbability": 0.2
    },
    {
      "modelPath": "standard-customer.json",
      "weight": 3.0,
      "generationProbability": 0.8
    }
  ]
}

Best Practices

1. Sample Data Quality

  • Provide diverse, representative samples
  • Include edge cases and variations
  • Use at least 100 samples for statistical significance

2. Model Refinement

  • Review generated DataFlood models for accuracy
  • Adjust constraints if too restrictive
  • Add custom patterns for domain-specific data

3. Entropy Tuning

  • Start with default (no override)
  • Decrease for more predictable data
  • Increase for unique identifiers

4. Performance Optimization

  • Use appropriate batch sizes
  • Consider memory limits for large generations
  • Use streaming for very large datasets

Next Steps