DataFlood Model Format Specification
Overview
DataFlood models extend traditional JSON data schemas with statistical modeling for realistic data generation. This document provides a complete specification of the DataFlood model format.
Base Structure
A DataFlood model is stored as a JSON document with the following root structure:
{
"$schema": "https://smallminds.co/DataFlood/schema#",
"title": "Model Title",
"description": "Model Description",
"type": "object|array|string|number|integer|boolean|null",
"properties": {},
"required": [],
// DataFlood Extensions
"dataFloodVersion": "1.0.0",
"metadata": {},
"globalSettings": {}
}
Type Definitions
Object Type
{
"type": "object",
"properties": {
"propertyName": {
// Property schema
}
},
"required": ["propertyName"],
"additionalProperties": false,
"minProperties": 0,
"maxProperties": 100,
"propertyNames": {
"pattern": "^[a-zA-Z][a-zA-Z0-9_]*$"
},
"dependencies": {
"property1": ["property2", "property3"]
}
}
Array Type
{
"type": "array",
"items": {
// Schema for array items
},
"minItems": 0,
"maxItems": 100,
"uniqueItems": false,
"contains": {
// Schema that at least one item must validate against
},
"additionalItems": false
}
String Type
{
"type": "string",
"minLength": 0,
"maxLength": 255,
"pattern": "^[A-Z][a-z]+$",
"format": "email|uri|date|date-time|uuid|ipv4|ipv6",
"enum": ["value1", "value2"],
// DataFlood Extensions
"stringModel": {
// Statistical model for string generation
}
}
Number/Integer Types
{
"type": "number|integer",
"minimum": 0,
"maximum": 100,
"exclusiveMinimum": 0,
"exclusiveMaximum": 100,
"multipleOf": 0.01,
// DataFlood Extensions
"histogram": {
// Distribution model for number generation
}
}
Boolean Type
{
"type": "boolean",
"default": true,
// DataFlood Extension
"probability": 0.7 // Probability of true value
}
Null Type
{
"type": "null"
}
DataFlood Modeling Elements
String Model
The stringModel
property contains statistical information for string generation:
{
"stringModel": {
"entropyScore": 3.5,
"patterns": [
"Llll Llll",
"Llll-Llll",
"LLLL"
],
"valueFrequency": {
"common_value": 10,
"another_value": 5,
"rare_value": 1
},
"characterProbability": {
"a": 0.08,
"b": 0.02,
"c": 0.03
},
"nGrams": {
"bigrams": {
"th": 0.05,
"he": 0.04,
"in": 0.03
},
"trigrams": {
"the": 0.03,
"ing": 0.02,
"and": 0.02
}
},
"lengthDistribution": {
"5": 0.1,
"10": 0.3,
"15": 0.4,
"20": 0.2
}
}
}
Pattern Notation
Symbol | Meaning | Example |
---|---|---|
L |
Uppercase letter | A-Z |
l |
Lowercase letter | a-z |
d |
Digit | 0-9 |
s |
Space |
|
w |
Word character | a-zA-Z0-9_ |
x |
Hexadecimal | 0-9a-f |
X |
Uppercase hex | 0-9A-F |
. |
Any character | |
Literal | Exact character | - , @ , etc. |
Histogram
The histogram
property defines numeric value distributions:
{
"histogram": {
"bins": [
{
"rangeStart": 0,
"rangeEnd": 100,
"frequency": 0.6,
"freqStart": 0,
"freqEnd": 60
},
{
"rangeStart": 100,
"rangeEnd": 500,
"frequency": 0.3,
"freqStart": 60,
"freqEnd": 90
},
{
"rangeStart": 500,
"rangeEnd": 1000,
"frequency": 0.1,
"freqStart": 90,
"freqEnd": 100
}
],
"totalSamples": 1000,
"mean": 250.5,
"median": 150,
"mode": 75,
"standardDeviation": 125.3,
"distribution": "normal|uniform|exponential|custom"
}
}
Enum with Probabilities
For weighted enum selection:
{
"type": "string",
"enum": ["Bronze", "Silver", "Gold", "Platinum"],
"enumProbabilities": [0.5, 0.3, 0.15, 0.05]
}
Format Extensions
DataFlood recognizes additional formats:
{
"format": "email|uri|date|date-time|time|duration|uuid|
ipv4|ipv6|hostname|json-pointer|regex|
credit-card|ssn|phone|postal-code"
}
Conditional Generation
Support for conditional schemas:
{
"if": {
"properties": {
"country": {"const": "US"}
}
},
"then": {
"properties": {
"postalCode": {
"type": "string",
"pattern": "^[0-9]{5}(-[0-9]{4})?$"
}
}
},
"else": {
"properties": {
"postalCode": {
"type": "string",
"pattern": "^[A-Z][0-9][A-Z] [0-9][A-Z][0-9]$"
}
}
}
}
Complex Schema Examples
Nested Object with References
{
"type": "object",
"properties": {
"customer": {
"$ref": "#/definitions/customer"
},
"orders": {
"type": "array",
"items": {
"$ref": "#/definitions/order"
}
}
},
"definitions": {
"customer": {
"type": "object",
"properties": {
"id": {"type": "string"},
"name": {"type": "string"}
}
},
"order": {
"type": "object",
"properties": {
"orderId": {"type": "string"},
"amount": {"type": "number"}
}
}
}
}
OneOf/AnyOf/AllOf
{
"type": "object",
"properties": {
"payment": {
"oneOf": [
{
"type": "object",
"properties": {
"type": {"const": "credit_card"},
"cardNumber": {"type": "string"}
}
},
{
"type": "object",
"properties": {
"type": {"const": "bank_transfer"},
"accountNumber": {"type": "string"}
}
}
]
}
}
}
Recursive Structures
{
"type": "object",
"properties": {
"name": {"type": "string"},
"children": {
"type": "array",
"items": {
"$ref": "#"
}
}
},
"dataFlood": {
"maxRecursionDepth": 3
}
}
Global Settings
Configure generation behavior globally:
{
"globalSettings": {
"defaultEntropy": 2.5,
"stringDefaults": {
"minLength": 1,
"maxLength": 100
},
"numberDefaults": {
"minimum": 0,
"maximum": 1000000
},
"arrayDefaults": {
"minItems": 0,
"maxItems": 100
},
"nullProbability": 0.1,
"generationSeed": 12345
}
}
Metadata
Include documentation and tracking information:
{
"metadata": {
"version": "2.1.0",
"author": "Data Team",
"created": "2024-01-15",
"modified": "2024-01-20",
"tags": ["customer", "production"],
"dataSource": "production_sample_2024",
"sampleSize": 10000,
"confidenceLevel": 0.95,
"notes": "Updated based on Q1 2024 data"
}
}
Validation Rules
Schema Validation
DataFlood schemas must:
- Be valid JSON
- Conform to DataFlood model structure
- Have consistent type definitions
- Include required DataFlood extensions for generation
Constraint Validation
Constraints must be logically consistent:
minimum
≤maximum
minLength
≤maxLength
minItems
≤maxItems
minProperties
≤maxProperties
- Enum values must match declared type
- Patterns must be valid regular expressions
Statistical Model Validation
Statistical models must:
- Have valid probability distributions (sum to 1.0)
- Include non-empty patterns for string models
- Have non-overlapping histogram bins
- Contain valid entropy scores (0.0+)
Format Examples
Complete E-commerce Product Schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Product",
"type": "object",
"properties": {
"productId": {
"type": "string",
"pattern": "^PROD-[0-9]{6}$",
"stringModel": {
"patterns": ["PROD-dddddd"],
"entropyScore": 1.0
}
},
"name": {
"type": "string",
"minLength": 5,
"maxLength": 100,
"stringModel": {
"patterns": ["Llll Llll", "Llll Llll Llll"],
"valueFrequency": {
"Wireless Mouse": 10,
"Gaming Keyboard": 8,
"USB Hub": 5
},
"entropyScore": 3.0
}
},
"price": {
"type": "number",
"minimum": 0.99,
"maximum": 9999.99,
"multipleOf": 0.01,
"histogram": {
"bins": [
{"rangeStart": 0.99, "rangeEnd": 49.99, "frequency": 0.4},
{"rangeStart": 50, "rangeEnd": 199.99, "frequency": 0.35},
{"rangeStart": 200, "rangeEnd": 999.99, "frequency": 0.2},
{"rangeStart": 1000, "rangeEnd": 9999.99, "frequency": 0.05}
]
}
},
"category": {
"type": "string",
"enum": ["Electronics", "Computers", "Accessories", "Gaming"],
"enumProbabilities": [0.3, 0.3, 0.25, 0.15]
},
"inStock": {
"type": "boolean",
"probability": 0.85
},
"tags": {
"type": "array",
"items": {
"type": "string",
"minLength": 3,
"maxLength": 20
},
"minItems": 0,
"maxItems": 5,
"uniqueItems": true
},
"specifications": {
"type": "object",
"properties": {
"weight": {
"type": "number",
"minimum": 0.1,
"maximum": 50,
"unit": "kg"
},
"dimensions": {
"type": "object",
"properties": {
"length": {"type": "number"},
"width": {"type": "number"},
"height": {"type": "number"}
}
}
}
}
},
"required": ["productId", "name", "price", "category"],
"metadata": {
"version": "1.0.0",
"dataSource": "product_catalog_2024"
}
}
Compatibility
DataFlood Model Compatibility
DataFlood models are designed with compatibility in mind:
- Can be used as standard data validators
- DataFlood extensions are ignored by standard validators
- Standard data structures can be enhanced with DataFlood extensions
Version Compatibility
{
"dataFloodVersion": "1.0.0",
"compatibility": {
"minVersion": "1.0.0",
"maxVersion": "2.0.0"
}
}
Best Practices
- Always include type definitions for all properties
- Use references to avoid duplication
- Add descriptions for complex properties
- Include examples in model documentation
- Version your schemas for tracking changes
- Validate schemas before deployment
- Use appropriate constraints to ensure realistic data
- Document statistical models and their sources
Migration to DataFlood Models
To convert standard data structures to DataFlood models:
- Analyze existing data to build statistical models
- Add string models to string properties
- Add histograms to numeric properties
- Add probabilities to boolean properties
- Add enum probabilities for weighted selection
- Test generation with small batches
- Refine models based on results