DataFlood Model Format Specification

Overview

DataFlood models extend traditional JSON data schemas with statistical modeling for realistic data generation. This document provides a complete specification of the DataFlood model format.

Base Structure

A DataFlood model is stored as a JSON document with the following root structure:

{
  "$schema": "https://smallminds.co/DataFlood/schema#",
  "title": "Model Title",
  "description": "Model Description",
  "type": "object|array|string|number|integer|boolean|null",
  "properties": {},
  "required": [],
  
  // DataFlood Extensions
  "dataFloodVersion": "1.0.0",
  "metadata": {},
  "globalSettings": {}
}

Type Definitions

Object Type

{
  "type": "object",
  "properties": {
    "propertyName": {
      // Property schema
    }
  },
  "required": ["propertyName"],
  "additionalProperties": false,
  "minProperties": 0,
  "maxProperties": 100,
  "propertyNames": {
    "pattern": "^[a-zA-Z][a-zA-Z0-9_]*$"
  },
  "dependencies": {
    "property1": ["property2", "property3"]
  }
}

Array Type

{
  "type": "array",
  "items": {
    // Schema for array items
  },
  "minItems": 0,
  "maxItems": 100,
  "uniqueItems": false,
  "contains": {
    // Schema that at least one item must validate against
  },
  "additionalItems": false
}

String Type

{
  "type": "string",
  "minLength": 0,
  "maxLength": 255,
  "pattern": "^[A-Z][a-z]+$",
  "format": "email|uri|date|date-time|uuid|ipv4|ipv6",
  "enum": ["value1", "value2"],
  
  // DataFlood Extensions
  "stringModel": {
    // Statistical model for string generation
  }
}

Number/Integer Types

{
  "type": "number|integer",
  "minimum": 0,
  "maximum": 100,
  "exclusiveMinimum": 0,
  "exclusiveMaximum": 100,
  "multipleOf": 0.01,
  
  // DataFlood Extensions
  "histogram": {
    // Distribution model for number generation
  }
}

Boolean Type

{
  "type": "boolean",
  "default": true,
  
  // DataFlood Extension
  "probability": 0.7  // Probability of true value
}

Null Type

{
  "type": "null"
}

DataFlood Modeling Elements

String Model

The stringModel property contains statistical information for string generation:

{
  "stringModel": {
    "entropyScore": 3.5,
    "patterns": [
      "Llll Llll",
      "Llll-Llll",
      "LLLL"
    ],
    "valueFrequency": {
      "common_value": 10,
      "another_value": 5,
      "rare_value": 1
    },
    "characterProbability": {
      "a": 0.08,
      "b": 0.02,
      "c": 0.03
    },
    "nGrams": {
      "bigrams": {
        "th": 0.05,
        "he": 0.04,
        "in": 0.03
      },
      "trigrams": {
        "the": 0.03,
        "ing": 0.02,
        "and": 0.02
      }
    },
    "lengthDistribution": {
      "5": 0.1,
      "10": 0.3,
      "15": 0.4,
      "20": 0.2
    }
  }
}

Pattern Notation

Symbol Meaning Example
L Uppercase letter A-Z
l Lowercase letter a-z
d Digit 0-9
s Space
w Word character a-zA-Z0-9_
x Hexadecimal 0-9a-f
X Uppercase hex 0-9A-F
. Any character
Literal Exact character -, @, etc.

Histogram

The histogram property defines numeric value distributions:

{
  "histogram": {
    "bins": [
      {
        "rangeStart": 0,
        "rangeEnd": 100,
        "frequency": 0.6,
        "freqStart": 0,
        "freqEnd": 60
      },
      {
        "rangeStart": 100,
        "rangeEnd": 500,
        "frequency": 0.3,
        "freqStart": 60,
        "freqEnd": 90
      },
      {
        "rangeStart": 500,
        "rangeEnd": 1000,
        "frequency": 0.1,
        "freqStart": 90,
        "freqEnd": 100
      }
    ],
    "totalSamples": 1000,
    "mean": 250.5,
    "median": 150,
    "mode": 75,
    "standardDeviation": 125.3,
    "distribution": "normal|uniform|exponential|custom"
  }
}

Enum with Probabilities

For weighted enum selection:

{
  "type": "string",
  "enum": ["Bronze", "Silver", "Gold", "Platinum"],
  "enumProbabilities": [0.5, 0.3, 0.15, 0.05]
}

Format Extensions

DataFlood recognizes additional formats:

{
  "format": "email|uri|date|date-time|time|duration|uuid|
            ipv4|ipv6|hostname|json-pointer|regex|
            credit-card|ssn|phone|postal-code"
}

Conditional Generation

Support for conditional schemas:

{
  "if": {
    "properties": {
      "country": {"const": "US"}
    }
  },
  "then": {
    "properties": {
      "postalCode": {
        "type": "string",
        "pattern": "^[0-9]{5}(-[0-9]{4})?$"
      }
    }
  },
  "else": {
    "properties": {
      "postalCode": {
        "type": "string",
        "pattern": "^[A-Z][0-9][A-Z] [0-9][A-Z][0-9]$"
      }
    }
  }
}

Complex Schema Examples

Nested Object with References

{
  "type": "object",
  "properties": {
    "customer": {
      "$ref": "#/definitions/customer"
    },
    "orders": {
      "type": "array",
      "items": {
        "$ref": "#/definitions/order"
      }
    }
  },
  "definitions": {
    "customer": {
      "type": "object",
      "properties": {
        "id": {"type": "string"},
        "name": {"type": "string"}
      }
    },
    "order": {
      "type": "object",
      "properties": {
        "orderId": {"type": "string"},
        "amount": {"type": "number"}
      }
    }
  }
}

OneOf/AnyOf/AllOf

{
  "type": "object",
  "properties": {
    "payment": {
      "oneOf": [
        {
          "type": "object",
          "properties": {
            "type": {"const": "credit_card"},
            "cardNumber": {"type": "string"}
          }
        },
        {
          "type": "object",
          "properties": {
            "type": {"const": "bank_transfer"},
            "accountNumber": {"type": "string"}
          }
        }
      ]
    }
  }
}

Recursive Structures

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "children": {
      "type": "array",
      "items": {
        "$ref": "#"
      }
    }
  },
  "dataFlood": {
    "maxRecursionDepth": 3
  }
}

Global Settings

Configure generation behavior globally:

{
  "globalSettings": {
    "defaultEntropy": 2.5,
    "stringDefaults": {
      "minLength": 1,
      "maxLength": 100
    },
    "numberDefaults": {
      "minimum": 0,
      "maximum": 1000000
    },
    "arrayDefaults": {
      "minItems": 0,
      "maxItems": 100
    },
    "nullProbability": 0.1,
    "generationSeed": 12345
  }
}

Metadata

Include documentation and tracking information:

{
  "metadata": {
    "version": "2.1.0",
    "author": "Data Team",
    "created": "2024-01-15",
    "modified": "2024-01-20",
    "tags": ["customer", "production"],
    "dataSource": "production_sample_2024",
    "sampleSize": 10000,
    "confidenceLevel": 0.95,
    "notes": "Updated based on Q1 2024 data"
  }
}

Validation Rules

Schema Validation

DataFlood schemas must:

  1. Be valid JSON
  2. Conform to DataFlood model structure
  3. Have consistent type definitions
  4. Include required DataFlood extensions for generation

Constraint Validation

Constraints must be logically consistent:

  • minimummaximum
  • minLengthmaxLength
  • minItemsmaxItems
  • minPropertiesmaxProperties
  • Enum values must match declared type
  • Patterns must be valid regular expressions

Statistical Model Validation

Statistical models must:

  • Have valid probability distributions (sum to 1.0)
  • Include non-empty patterns for string models
  • Have non-overlapping histogram bins
  • Contain valid entropy scores (0.0+)

Format Examples

Complete E-commerce Product Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Product",
  "type": "object",
  "properties": {
    "productId": {
      "type": "string",
      "pattern": "^PROD-[0-9]{6}$",
      "stringModel": {
        "patterns": ["PROD-dddddd"],
        "entropyScore": 1.0
      }
    },
    "name": {
      "type": "string",
      "minLength": 5,
      "maxLength": 100,
      "stringModel": {
        "patterns": ["Llll Llll", "Llll Llll Llll"],
        "valueFrequency": {
          "Wireless Mouse": 10,
          "Gaming Keyboard": 8,
          "USB Hub": 5
        },
        "entropyScore": 3.0
      }
    },
    "price": {
      "type": "number",
      "minimum": 0.99,
      "maximum": 9999.99,
      "multipleOf": 0.01,
      "histogram": {
        "bins": [
          {"rangeStart": 0.99, "rangeEnd": 49.99, "frequency": 0.4},
          {"rangeStart": 50, "rangeEnd": 199.99, "frequency": 0.35},
          {"rangeStart": 200, "rangeEnd": 999.99, "frequency": 0.2},
          {"rangeStart": 1000, "rangeEnd": 9999.99, "frequency": 0.05}
        ]
      }
    },
    "category": {
      "type": "string",
      "enum": ["Electronics", "Computers", "Accessories", "Gaming"],
      "enumProbabilities": [0.3, 0.3, 0.25, 0.15]
    },
    "inStock": {
      "type": "boolean",
      "probability": 0.85
    },
    "tags": {
      "type": "array",
      "items": {
        "type": "string",
        "minLength": 3,
        "maxLength": 20
      },
      "minItems": 0,
      "maxItems": 5,
      "uniqueItems": true
    },
    "specifications": {
      "type": "object",
      "properties": {
        "weight": {
          "type": "number",
          "minimum": 0.1,
          "maximum": 50,
          "unit": "kg"
        },
        "dimensions": {
          "type": "object",
          "properties": {
            "length": {"type": "number"},
            "width": {"type": "number"},
            "height": {"type": "number"}
          }
        }
      }
    }
  },
  "required": ["productId", "name", "price", "category"],
  "metadata": {
    "version": "1.0.0",
    "dataSource": "product_catalog_2024"
  }
}

Compatibility

DataFlood Model Compatibility

DataFlood models are designed with compatibility in mind:

  • Can be used as standard data validators
  • DataFlood extensions are ignored by standard validators
  • Standard data structures can be enhanced with DataFlood extensions

Version Compatibility

{
  "dataFloodVersion": "1.0.0",
  "compatibility": {
    "minVersion": "1.0.0",
    "maxVersion": "2.0.0"
  }
}

Best Practices

  1. Always include type definitions for all properties
  2. Use references to avoid duplication
  3. Add descriptions for complex properties
  4. Include examples in model documentation
  5. Version your schemas for tracking changes
  6. Validate schemas before deployment
  7. Use appropriate constraints to ensure realistic data
  8. Document statistical models and their sources

Migration to DataFlood Models

To convert standard data structures to DataFlood models:

  1. Analyze existing data to build statistical models
  2. Add string models to string properties
  3. Add histograms to numeric properties
  4. Add probabilities to boolean properties
  5. Add enum probabilities for weighted selection
  6. Test generation with small batches
  7. Refine models based on results

See Also