DataFlood CLI - Getting Started Guide
Overview
DataFlood is a command-line tool for generating synthetic JSON and CSV data based on DataFlood models. It analyzes existing data files to learn patterns and distributions, then generates new data that maintains similar statistical properties.
Prerequisites
- .NET 9.0 Runtime or SDK installed
- Access to sample data files (JSON or CSV) for DataFlood model generation
Quick Start
1. Generate a DataFlood model from Sample Data
First, DataFlood will examine your existing data files to create a DataFlood model:
## Analyze all JSON/CSV files in a folder
DataFlood /path/to/sample/data output-schema.json
This creates a DataFlood schema with:
- Statistical string models (character patterns, n-grams)
- Numeric histograms (value distributions)
- Format detection (emails, URLs, dates)
- Structural patterns and constraints
2. Generate Synthetic Documents
Use the schema to generate new data:
## Generate 10 documents
DataFlood generate output-schema.json --count 10
## Generate with specific seed for reproducibility
DataFlood generate output-schema.json --count 100 --seed 42
## Generate as CSV format
DataFlood generate output-schema.json --count 50 --format csv
3. Run Tests (Optional)
Verify DataFlood is working correctly:
## Run built-in test suite
DataFlood
Basic Commands
Model Generation
DataFlood <folder-path> [output-file]
folder-path
: Directory containing sample JSON/CSV filesoutput-file
: Output model file (default: generated-model.json)
Document Generation
DataFlood generate <model-file> [options]
Options:
--count, -c <number>
: Number of documents (1-10000, default: 1)--seed, -s <number>
: Random seed for reproducibility--output, -o <file>
: Output filename--format, -f <json|csv>
: Output format (default: json)--separate
: Generate individual files for each document--metadata
: Include generation metadata--entropy, -e <number>
: Override string generation entropy
Sequence Generation (Tides)
DataFlood sequence <config-file> [options]
Generate time-based document sequences with parent-child relationships.
Options:
--output, -o <file>
: Output filename--max-docs, -n <number>
: Maximum documents to generate--format, -f <json|csv>
: Output format--validate-only
: Validate configuration without generating--metadata
: Include generation metadata--seed, -s <number>
: Override random seed
Examples
Example 1: E-commerce Data Generation
- Create sample data files in a folder:
// sample-product.json
{
"productId": "PROD-123456",
"name": "Wireless Mouse",
"price": 29.99,
"category": "Electronics",
"inStock": true
}
- Generate model:
DataFlood ./sample-data products-model.json
- Generate 100 products:
DataFlood generate products-model.json --count 100 --output products.json
Example 2: CSV Output with Metadata
## Generate CSV with metadata comments
DataFlood generate model.json \
--count 1000 \
--format csv \
--metadata \
--output data.csv
Example 3: Reproducible Test Data
## Always generates the same data with seed
DataFlood generate model.json \
--count 50 \
--seed 12345 \
--output test-data.json
Example 4: Individual Files
## Generate separate files: doc-001.json, doc-002.json, etc.
DataFlood generate model.json \
--count 10 \
--separate \
--output doc.json
Entropy Control
The --entropy
parameter controls string generation randomness:
Entropy Range | Generation Method | Use Case |
---|---|---|
0.0 - 2.0 | Sample from existing values | Predictable, vocabulary-based |
2.0 - 4.0 | Pattern-based generation | Moderate variation |
> 4.0 | Character distribution | High randomness |
Examples:
## Low entropy - uses existing vocabulary
DataFlood generate model.json --entropy 1.0 --count 10
## High entropy - more random generation
DataFlood generate model.json --entropy 5.0 --count 10
Output Formats
JSON Format
Default format with nested structure support:
[
{
"id": "generated-value",
"name": "generated-name",
"details": { ... }
}
]
CSV Format
Flattened structure with dot notation for nested fields:
id,name,details.field1,details.field2
value1,name1,field1_value,field2_value
Metadata Options
Without --metadata
(default):
- Clean data files with just generated documents
With --metadata
:
- Includes generation timestamp
- Seed value for reproducibility
- Model file reference
- Document count and numbering
Next Steps
- Learn about Core Concepts - models and histograms
- Explore the CLI Reference for all commands
- See Examples for real-world scenarios
- Configure Time-Based Tides for complex workflows
Troubleshooting
Common Issues
"Directory does not exist" error:
- Verify the path to your sample data folder
- Use absolute paths if relative paths aren't working
Generated data doesn't match expectations:
- Check your sample data has enough variety
- Adjust entropy settings for different randomness levels
- Use more sample files for better pattern detection
Out of memory with large generation counts:
- Reduce the count parameter
- Generate in batches using different seeds
- Use
--separate
to write individual files
For more help, see the Troubleshooting Guide.