DataFlood CLI - Getting Started Guide

Overview

DataFlood is a command-line tool for generating synthetic JSON and CSV data based on DataFlood models. It analyzes existing data files to learn patterns and distributions, then generates new data that maintains similar statistical properties.

Prerequisites

.NET 9.0 Runtime or SDK installed
Access to sample data files (JSON or CSV) for DataFlood model generation

Quick Start

1. Generate a DataFlood model from Sample Data

First, DataFlood will examine your existing data files to create a DataFlood model:

## Analyze all JSON/CSV files in a folder
DataFlood /path/to/sample/data output-schema.json

This creates a DataFlood schema with:

Statistical string models (character patterns, n-grams)
Numeric histograms (value distributions)
Format detection (emails, URLs, dates)
Structural patterns and constraints

2. Generate Synthetic Documents

Use the schema to generate new data:

## Generate 10 documents
DataFlood generate output-schema.json --count 10

## Generate with specific seed for reproducibility
DataFlood generate output-schema.json --count 100 --seed 42

## Generate as CSV format
DataFlood generate output-schema.json --count 50 --format csv

3. Run Tests (Optional)

Verify DataFlood is working correctly:

## Run built-in test suite
DataFlood

Basic Commands

Model Generation

DataFlood <folder-path> [output-file]

folder-path: Directory containing sample JSON/CSV files
output-file: Output model file (default: generated-model.json)

Document Generation

DataFlood generate <model-file> [options]

Options:

--count, -c <number>: Number of documents (1-10000, default: 1)
--seed, -s <number>: Random seed for reproducibility
--output, -o <file>: Output filename
--format, -f <json|csv>: Output format (default: json)
--separate: Generate individual files for each document
--metadata: Include generation metadata
--entropy, -e <number>: Override string generation entropy

Sequence Generation (Tides)

DataFlood sequence <config-file> [options]

Generate time-based document sequences with parent-child relationships.

Options:

--output, -o <file>: Output filename
--max-docs, -n <number>: Maximum documents to generate
--format, -f <json|csv>: Output format
--validate-only: Validate configuration without generating
--metadata: Include generation metadata
--seed, -s <number>: Override random seed

Examples

Example 1: E-commerce Data Generation

Create sample data files in a folder:

// sample-product.json
{
  "productId": "PROD-123456",
  "name": "Wireless Mouse",
  "price": 29.99,
  "category": "Electronics",
  "inStock": true
}

Generate model:

DataFlood ./sample-data products-model.json

Generate 100 products:

DataFlood generate products-model.json --count 100 --output products.json

Example 2: CSV Output with Metadata

## Generate CSV with metadata comments
DataFlood generate model.json \
  --count 1000 \
  --format csv \
  --metadata \
  --output data.csv

Example 3: Reproducible Test Data

## Always generates the same data with seed
DataFlood generate model.json \
  --count 50 \
  --seed 12345 \
  --output test-data.json

Example 4: Individual Files

## Generate separate files: doc-001.json, doc-002.json, etc.
DataFlood generate model.json \
  --count 10 \
  --separate \
  --output doc.json

Entropy Control

The --entropy parameter controls string generation randomness:

Entropy Range	Generation Method	Use Case
0.0 - 2.0	Sample from existing values	Predictable, vocabulary-based
2.0 - 4.0	Pattern-based generation	Moderate variation
> 4.0	Character distribution	High randomness

Examples:

## Low entropy - uses existing vocabulary
DataFlood generate model.json --entropy 1.0 --count 10

## High entropy - more random generation
DataFlood generate model.json --entropy 5.0 --count 10

Output Formats

JSON Format

Default format with nested structure support:

[
  {
    "id": "generated-value",
    "name": "generated-name",
    "details": { ... }
  }
]

CSV Format

Flattened structure with dot notation for nested fields:

id,name,details.field1,details.field2
value1,name1,field1_value,field2_value

Metadata Options

Without --metadata (default):

Clean data files with just generated documents

With --metadata:

Includes generation timestamp
Seed value for reproducibility
Model file reference
Document count and numbering

Next Steps

Learn about Core Concepts - models and histograms
Explore the CLI Reference for all commands
See Examples for real-world scenarios
Configure Time-Based Tides for complex workflows

Troubleshooting

Common Issues

"Directory does not exist" error:

Verify the path to your sample data folder
Use absolute paths if relative paths aren't working

Generated data doesn't match expectations:

Check your sample data has enough variety
Adjust entropy settings for different randomness levels
Use more sample files for better pattern detection

Out of memory with large generation counts:

Reduce the count parameter
Generate in batches using different seeds
Use --separate to write individual files

For more help, see the Troubleshooting Guide.