Scaling

There are 2 ways to scale up Datagen:

  • Horizontally : Raise concurrency by raising number of Datagen instances
  • Vertically : Raise speed and parallelism by providing more resources foreach Datagen instance

Horizontally

All Datagen servers are independent, hence adding more servers is possible to scale but they will not add fault-tolerancy as of now.

If multiple servers are set, a load-balancer can be set in front of all of themwith a round-robin fashioned way with persistence on source IP.

A good practice is to dedicate datagen instances for either usage or users. (Example: Dedicate one instance to only generate HDFS particular data or to a group of users. Hence scale depending on number of usage or users.)

Vertically

For sure, the more memory and cpus are available, more data can be generated in a faster way.

Usually, you can run as many batches as you want, as they are run sequentially and this is the way to scale up rows generated by raising number of batches to run.

The limit factor is teh size of each batch as all data must fit into memory before being written or sent. To speed up each batch generation, number of threads can be raised so data is generated in parallel.

Hence, before scaling vertically, it is important to understand what is the limit number of rows that can be generated in one batch and how parallelized it could be.

To help you choose size of your instance(s), below is a benchmark.

Note: Tests have been conducted on EC2 instances and data has been generated locally to Parquet & CSV files.

Model used is a 30 columns with mix of string, integer, timestamps, long, bytes array with computed values (hence providing an enough complex model to generate data). https://datagen-repo.s3.eu-west-3.amazonaws.com/1.0.0/models/use-cases/stores/customer.json

Machine Type Machine CPU Machine Memory -Xmx Number of Rows Number of Rows per Batch Batches Threads Data Format Time taken
t2.medium 2vCPU 4GB 4GB 100K 10K 10 10 Parquet 13s 859ms
t2.medium 2vCPU 4GB 4GB 100K 10K 10 10 CSV 2s 795ms
t3.large 2vCPU 8GB 8GB 1M 100K 10 10 Parquet 1m 2s 548ms
t3.large 2vCPU 8GB 8GB 1M 100K 10 10 CSV 48s 424ms
t3.xlarge 4vCPU 16GB 12GB 1M 100K 10 10 Parquet 1m 444ms
t3.xlarge 4vCPU 16GB 12GB 1M 100K 10 10 CSV 48s 392ms
t3.xlarge 4vCPU 16GB 12GB 10M 100K 100 10 Parquet 10m 57s 408ms
c5a.2xlarge 8vCPU 16GB 12GB 1M 100K 10 10 Parquet 28s 852ms
c5a.2xlarge 8vCPU 16GB 12GB 1M 100K 10 10 CSV 43s 399ms
c5a.2xlarge 8vCPU 16GB 12GB 10M 100K 100 10 Parquet 4m 19s 752ms
c5a.2xlarge 8vCPU 16GB 12GB 10M 100K 100 10 CSV 3m 40s 229ms
t3.2xlarge 8vCPU 32GB 24GB 10M 1M 10 24 Parquet 7m 26s 722ms
t3.2xlarge 8vCPU 32GB 24GB 10M 1M 10 24 CSV 6m 43s 315ms
c5.4xlarge 16vCPU 32GB 24GB 10M 1M 10 24 Parquet 3m 23s 81ms
c5.4xlarge 16vCPU 32GB 24GB 10M 1M 10 24 CSV 2m 42s 616m
Number of Rows to generate CPU Memory
0-10K 0,5-1 2GB
10K-100K 1-2 4GB
100K-1M 2 8GB
1M-10M 4-8 16GB
10M-100M 8 32GB
100M+ 8-16 32GB