DataFlood CLI - Getting Started Guide

Overview

DataFlood is a command-line tool for generating synthetic JSON and CSV data based on DataFlood models. It analyzes existing data files to learn patterns and distributions, then generates new data that maintains similar statistical properties.

Prerequisites

  • .NET 9.0 Runtime or SDK installed
  • Access to sample data files (JSON or CSV) for DataFlood model generation

Quick Start

1. Generate a DataFlood model from Sample Data

First, DataFlood will examine your existing data files to create a DataFlood model:

## Analyze all JSON/CSV files in a folder
DataFlood /path/to/sample/data output-schema.json

This creates a DataFlood schema with:

  • Statistical string models (character patterns, n-grams)
  • Numeric histograms (value distributions)
  • Format detection (emails, URLs, dates)
  • Structural patterns and constraints

2. Generate Synthetic Documents

Use the schema to generate new data:

## Generate 10 documents
DataFlood generate output-schema.json --count 10

## Generate with specific seed for reproducibility
DataFlood generate output-schema.json --count 100 --seed 42

## Generate as CSV format
DataFlood generate output-schema.json --count 50 --format csv

3. Run Tests (Optional)

Verify DataFlood is working correctly:

## Run built-in test suite
DataFlood

Basic Commands

Model Generation

DataFlood <folder-path> [output-file]
  • folder-path: Directory containing sample JSON/CSV files
  • output-file: Output model file (default: generated-model.json)

Document Generation

DataFlood generate <model-file> [options]

Options:

  • --count, -c <number>: Number of documents (1-10000, default: 1)
  • --seed, -s <number>: Random seed for reproducibility
  • --output, -o <file>: Output filename
  • --format, -f <json|csv>: Output format (default: json)
  • --separate: Generate individual files for each document
  • --metadata: Include generation metadata
  • --entropy, -e <number>: Override string generation entropy

Sequence Generation (Tides)

DataFlood sequence <config-file> [options]

Generate time-based document sequences with parent-child relationships.

Options:

  • --output, -o <file>: Output filename
  • --max-docs, -n <number>: Maximum documents to generate
  • --format, -f <json|csv>: Output format
  • --validate-only: Validate configuration without generating
  • --metadata: Include generation metadata
  • --seed, -s <number>: Override random seed

Examples

Example 1: E-commerce Data Generation

  1. Create sample data files in a folder:
// sample-product.json
{
  "productId": "PROD-123456",
  "name": "Wireless Mouse",
  "price": 29.99,
  "category": "Electronics",
  "inStock": true
}
  1. Generate model:
DataFlood ./sample-data products-model.json
  1. Generate 100 products:
DataFlood generate products-model.json --count 100 --output products.json

Example 2: CSV Output with Metadata

## Generate CSV with metadata comments
DataFlood generate model.json \
  --count 1000 \
  --format csv \
  --metadata \
  --output data.csv

Example 3: Reproducible Test Data

## Always generates the same data with seed
DataFlood generate model.json \
  --count 50 \
  --seed 12345 \
  --output test-data.json

Example 4: Individual Files

## Generate separate files: doc-001.json, doc-002.json, etc.
DataFlood generate model.json \
  --count 10 \
  --separate \
  --output doc.json

Entropy Control

The --entropy parameter controls string generation randomness:

Entropy Range Generation Method Use Case
0.0 - 2.0 Sample from existing values Predictable, vocabulary-based
2.0 - 4.0 Pattern-based generation Moderate variation
> 4.0 Character distribution High randomness

Examples:

## Low entropy - uses existing vocabulary
DataFlood generate model.json --entropy 1.0 --count 10

## High entropy - more random generation
DataFlood generate model.json --entropy 5.0 --count 10

Output Formats

JSON Format

Default format with nested structure support:

[
  {
    "id": "generated-value",
    "name": "generated-name",
    "details": { ... }
  }
]

CSV Format

Flattened structure with dot notation for nested fields:

id,name,details.field1,details.field2
value1,name1,field1_value,field2_value

Metadata Options

Without --metadata (default):

  • Clean data files with just generated documents

With --metadata:

  • Includes generation timestamp
  • Seed value for reproducibility
  • Model file reference
  • Document count and numbering

Next Steps

Troubleshooting

Common Issues

"Directory does not exist" error:

  • Verify the path to your sample data folder
  • Use absolute paths if relative paths aren't working

Generated data doesn't match expectations:

  • Check your sample data has enough variety
  • Adjust entropy settings for different randomness levels
  • Use more sample files for better pattern detection

Out of memory with large generation counts:

  • Reduce the count parameter
  • Generate in batches using different seeds
  • Use --separate to write individual files

For more help, see the Troubleshooting Guide.