What this tool generates
Synthetic datasets with the columns you choose, ready to import:
- id: sequential integer.
- firstName, lastName: realistic combinations.
- email: derived from the name, on the
example.comdomain. - phone: E.164 format.
- city, country: plausible cities and countries.
- age: integer between 18 and 75.
- signupDate: ISO 8601, within the last 3 years.
- isActive: boolean (~80% true).
- balance: decimal with two decimals, between -500 and 5000.
Output formats
- CSV. Compatible with Excel, Google Sheets and almost any database. UTF-8, comma separator, double quotes per RFC 4180.
- JSON. Array of objects. Use it for JS/Python test fixtures, seeds, or consumption from scripts.
- JSON Lines. One JSON object per line. Efficient for stream processing (jq, Spark, BigQuery import).
- SQL INSERT. Statements ready to paste into a SQL client. Generates a commented CREATE TABLE so you can adjust the schema.
How to import into your database
Each engine has its preferred CSV-load command:
- PostgreSQL:
COPY users FROM '/path/users.csv' DELIMITER ',' CSV HEADER; - MySQL:
LOAD DATA INFILE '/path/users.csv' INTO TABLE users FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS; - SQLite:
.mode csvfollowed by.import users.csv users. - BigQuery: web console or
bq load --source_format=CSV .... - MongoDB:
mongoimport --type=json --file=users.json --collection=users.
Recommended dataset sizes
The right size depends on what you're testing:
- 10-50 rows: functional tests, visual table validation.
- 100-500 rows: pagination, search and sorting tests.
- 1,000-10,000 rows: basic query performance tests.
- 10,000+ rows: load tests, indexes, query plans. For those, a Faker-based script is better.
Best practices with synthetic data
- Tag the source. A
source = 'synthetic'column lets you filter generated data when cleaning the database. - Reproducibility. If you need to regenerate the same dataset for CI, use a fixed seed (this tool doesn't support that; use Faker with a seed).
- Don't mix with production. Keep synthetic datasets in separate databases or tables prefixed
test_. - Version your fixtures. If a dataset works for a test, commit it to the repo.
- Watch out for PII. Even synthetic, some data can look personal. Document clearly that it's fake.
When to use Faker instead
Faker (in Node, Python, Ruby) is better for automated cases: you generate inside your code, with a seed for reproducibility, a specific locale, and unlimited volume.
This generator wins when you need a quick dataset without touching code: populate a staging table fast, ship a demo, build a mockup. The difference between "5 minutes without writing code" and "20 minutes integrating Faker".