Scaling
There are 2 ways to scale up Datagen:
- Horizontally : Raise concurrency by raising number of Datagen instances
- Vertically : Raise speed and parallelism by providing more resources foreach Datagen instance
Horizontally
All Datagen servers are independent, hence adding more servers is possible to scale but they will not add fault-tolerancy as of now.
If multiple servers are set, a load-balancer can be set in front of all of themwith a round-robin fashioned way with persistence on source IP.
A good practice is to dedicate datagen instances for either usage or users. (Example: Dedicate one instance to only generate HDFS particular data or to a group of users. Hence scale depending on number of usage or users.)
Vertically
For sure, the more memory and cpus are available, more data can be generated in a faster way.
Usually, you can run as many batches as you want, as they are run sequentially and this is the way to scale up rows generated by raising number of batches to run.
The limit factor is teh size of each batch as all data must fit into memory before being written or sent. To speed up each batch generation, number of threads can be raised so data is generated in parallel.
Hence, before scaling vertically, it is important to understand what is the limit number of rows that can be generated in one batch and how parallelized it could be.
To help you choose size of your instance(s), below is a benchmark.
Note: Tests have been conducted on EC2 instances and data has been generated locally to Parquet & CSV files.
Model used is a 30 columns with mix of string, integer, timestamps, long, bytes array with computed values (hence providing an enough complex model to generate data). https://datagen-repo.s3.eu-west-3.amazonaws.com/1.0.0/models/use-cases/stores/customer.json
Machine Type | Machine CPU | Machine Memory | -Xmx | Number of Rows | Number of Rows per Batch | Batches | Threads | Data Format | Time taken |
t2.medium | 2vCPU | 4GB | 4GB | 100K | 10K | 10 | 10 | Parquet | 13s 859ms |
t2.medium | 2vCPU | 4GB | 4GB | 100K | 10K | 10 | 10 | CSV | 2s 795ms |
t3.large | 2vCPU | 8GB | 8GB | 1M | 100K | 10 | 10 | Parquet | 1m 2s 548ms |
t3.large | 2vCPU | 8GB | 8GB | 1M | 100K | 10 | 10 | CSV | 48s 424ms |
t3.xlarge | 4vCPU | 16GB | 12GB | 1M | 100K | 10 | 10 | Parquet | 1m 444ms |
t3.xlarge | 4vCPU | 16GB | 12GB | 1M | 100K | 10 | 10 | CSV | 48s 392ms |
t3.xlarge | 4vCPU | 16GB | 12GB | 10M | 100K | 100 | 10 | Parquet | 10m 57s 408ms |
c5a.2xlarge | 8vCPU | 16GB | 12GB | 1M | 100K | 10 | 10 | Parquet | 28s 852ms |
c5a.2xlarge | 8vCPU | 16GB | 12GB | 1M | 100K | 10 | 10 | CSV | 43s 399ms |
c5a.2xlarge | 8vCPU | 16GB | 12GB | 10M | 100K | 100 | 10 | Parquet | 4m 19s 752ms |
c5a.2xlarge | 8vCPU | 16GB | 12GB | 10M | 100K | 100 | 10 | CSV | 3m 40s 229ms |
t3.2xlarge | 8vCPU | 32GB | 24GB | 10M | 1M | 10 | 24 | Parquet | 7m 26s 722ms |
t3.2xlarge | 8vCPU | 32GB | 24GB | 10M | 1M | 10 | 24 | CSV | 6m 43s 315ms |
c5.4xlarge | 16vCPU | 32GB | 24GB | 10M | 1M | 10 | 24 | Parquet | 3m 23s 81ms |
c5.4xlarge | 16vCPU | 32GB | 24GB | 10M | 1M | 10 | 24 | CSV | 2m 42s 616m |
Recommended Machines Size
Number of Rows to generate | CPU | Memory |
0-10K | 0,5-1 | 2GB |
10K-100K | 1-2 | 4GB |
100K-1M | 2 | 8GB |
1M-10M | 4-8 | 16GB |
10M-100M | 8 | 32GB |
100M+ | 8-16 | 32GB |