Scaling

There are 2 ways to scale up Datagen:

Horizontally : Raise concurrency by raising number of Datagen instances
Vertically : Raise speed and parallelism by providing more resources foreach Datagen instance

Horizontally

All Datagen servers are independent, hence adding more servers is possible to scale but they will not add fault-tolerancy as of now.

If multiple servers are set, a load-balancer can be set in front of all of themwith a round-robin fashioned way with persistence on source IP.

A good practice is to dedicate datagen instances for either usage or users. (Example: Dedicate one instance to only generate HDFS particular data or to a group of users. Hence scale depending on number of usage or users.)

Vertically

For sure, the more memory and cpus are available, more data can be generated in a faster way.

Usually, you can run as many batches as you want, as they are run sequentially and this is the way to scale up rows generated by raising number of batches to run.

The limit factor is teh size of each batch as all data must fit into memory before being written or sent. To speed up each batch generation, number of threads can be raised so data is generated in parallel.

Hence, before scaling vertically, it is important to understand what is the limit number of rows that can be generated in one batch and how parallelized it could be.

To help you choose size of your instance(s), below is a benchmark.

Note: Tests have been conducted on EC2 instances and data has been generated locally to Parquet & CSV files.

Model used is a 30 columns with mix of string, integer, timestamps, long, bytes array with computed values (hence providing an enough complex model to generate data). https://datagen-repo.s3.eu-west-3.amazonaws.com/1.0.0/models/use-cases/stores/customer.json

Machine Type	Machine CPU	Machine Memory	-Xmx	Number of Rows	Number of Rows per Batch	Batches	Threads	Data Format	Time taken
t2.medium	2vCPU	4GB	4GB	100K	10K	10	10	Parquet	13s 859ms
t2.medium	2vCPU	4GB	4GB	100K	10K	10	10	CSV	2s 795ms

t3.large	2vCPU	8GB	8GB	1M	100K	10	10	Parquet	1m 2s 548ms
t3.large	2vCPU	8GB	8GB	1M	100K	10	10	CSV	48s 424ms

t3.xlarge	4vCPU	16GB	12GB	1M	100K	10	10	Parquet	1m 444ms
t3.xlarge	4vCPU	16GB	12GB	1M	100K	10	10	CSV	48s 392ms
t3.xlarge	4vCPU	16GB	12GB	10M	100K	100	10	Parquet	10m 57s 408ms

c5a.2xlarge	8vCPU	16GB	12GB	1M	100K	10	10	Parquet	28s 852ms
c5a.2xlarge	8vCPU	16GB	12GB	1M	100K	10	10	CSV	43s 399ms
c5a.2xlarge	8vCPU	16GB	12GB	10M	100K	100	10	Parquet	4m 19s 752ms
c5a.2xlarge	8vCPU	16GB	12GB	10M	100K	100	10	CSV	3m 40s 229ms

t3.2xlarge	8vCPU	32GB	24GB	10M	1M	10	24	Parquet	7m 26s 722ms
t3.2xlarge	8vCPU	32GB	24GB	10M	1M	10	24	CSV	6m 43s 315ms

c5.4xlarge	16vCPU	32GB	24GB	10M	1M	10	24	Parquet	3m 23s 81ms
c5.4xlarge	16vCPU	32GB	24GB	10M	1M	10	24	CSV	2m 42s 616m

Recommended Machines Size

Number of Rows to generate	CPU	Memory
0-10K	0,5-1	2GB
10K-100K	1-2	4GB
100K-1M	2	8GB
1M-10M	4-8	16GB
10M-100M	8	32GB
100M+	8-16	32GB