Example with CDP
This section provides basic examples of Data Generation within Cloudera Data Platform (CDP).
Basic Generation
HDFS
In Cloudera Manager:
Datagen > Actions > Generate 1 Million Customers to HDFS
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate data representing customers from different countries.
Output should be:
Let’s Verify
In a shell with a logged in user (optionally use datagen ones):
hdfs dfs -ls /user/datagen/hdfs/customer/
Found 90 items
-rw-r--r-- 3 datagen datagen 256024 2022-10-13 09:06 /user/datagen/hdfs/customer/customer-cn-0000000000.parquet
-rw-r--r-- 3 datagen datagen 255393 2022-10-13 09:06 /user/datagen/hdfs/customer/customer-cn-0000000001.parquet
-rw-r--r-- 3 datagen datagen 255618 2022-10-13 09:06 /user/datagen/hdfs/customer/customer-cn-0000000002.parquet
Hive
In Cloudera Manager:
Datagen > Actions > Generate 10 Million Sensors Data to Hive
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate data representing sensors data.
Output should be:
Let’s Verify
In a shell with a logged in user (optionally use datagen ones):
0: jdbc:hive2://ccycloud-2.lisbon.root.hwx.si> show databases;
...
INFO : OK
+---------------------+
| database_name |
+---------------------+
| datagen_industry |
| default |
| information_schema |
| sys |
+---------------------+
0: jdbc:hive2://ccycloud-2.lisbon.root.hwx.si> use datagen_industry;
...
0: jdbc:hive2://ccycloud-2.lisbon.root.hwx.si> show tables;
...
INFO : OK
+------------------+
| tab_name |
+------------------+
| plant |
| plant_tmp |
| sensor |
| sensor_data |
| sensor_data_tmp |
| sensor_tmp |
+------------------+
6 rows selected (0.059 seconds)
0: jdbc:hive2://ccycloud-2.lisbon.root.hwx.si> select * from plant limit 2;
...
INFO : OK
+-----------------+--------------------+------------+-------------+----------------+
| plant.plant_id | plant.city | plant.lat | plant.long | plant.country |
+-----------------+--------------------+------------+-------------+----------------+
| 1 | Chotebor | 49,7208 | 15,6702 | Czechia |
| 2 | Tecpan de Galeana | 17,25 | -100,6833 | Mexico |
+-----------------+--------------------+------------+-------------+----------------+
2 rows selected (0.361 seconds)
0: jdbc:hive2://ccycloud-2.lisbon.root.hwx.si> select * from sensor limit 2;
...
INFO : OK
+-------------------+---------------------+------------------+
| sensor.sensor_id | sensor.sensor_type | sensor.plant_id |
+-------------------+---------------------+------------------+
| 70001 | motion | 186 |
| 70002 | temperature | 535 |
+-------------------+---------------------+------------------+
2 rows selected (0.173 seconds)
0: jdbc:hive2://ccycloud-2.lisbon.root.hwx.si> select * from sensor_data limit 2;
...
INFO : OK
+------------------------+--------------------------------------+----------------------+
| sensor_data.sensor_id | sensor_data.timestamp_of_production | sensor_data.value |
+------------------------+--------------------------------------+----------------------+
| 88411 | 1665678228258 | 1895793134684555135 |
| 52084 | 1665678228259 | -621460457255314082 |
+------------------------+--------------------------------------+----------------------+
2 rows selected (0.189 seconds)
Ozone
In Cloudera Manager:
Datagen > Actions > Generate 1 Million Customers to Ozone
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate data representing customers from different countries.
Output should be:
Let’s Verify
In a shell with a logged in user (optionally use datagen ones):
ozone sh key list datagen/customer
{
"volumeName" : "datagen",
"bucketName" : "customer",
"name" : "customer-cn-0000000000.parquet",
"dataSize" : 255631,
"creationTime" : "2022-10-13T16:10:02.286Z",
"modificationTime" : "2022-10-13T16:10:07.866Z",
"replicationType" : "RATIS",
"replicationFactor" : 3
}
{
"volumeName" : "datagen",
"bucketName" : "customer",
"name" : "customer-cn-0000000001.parquet",
"dataSize" : 255633,
"creationTime" : "2022-10-13T16:10:08.187Z",
"modificationTime" : "2022-10-13T16:10:08.314Z",
"replicationType" : "RATIS",
"replicationFactor" : 3
}
HBase
In Cloudera Manager:
Datagen > Actions > Generate 1 Million Transaction to HBase
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate data representing transactions.
Output should be:
Let’s Verify
In a shell with a logged in user (optionally use datagen ones):
hbase:001:0> list
TABLE
datagenfinance:transaction
1 row(s)
Took 0.9031 seconds
=> ["datagenfinance:transaction"]
hbase:002:0> count 'datagenfinance:transaction'
Current count: 1000, row: 10223641061665677647491
Current count: 2000, row: 10450220651665677774524
Current count: 3000, row: 10680209721665677628857
Current count: 4000, row: 10909219011665677828439
Current count: 5000, row: 1114021121665677841475
Current count: 6000, row: 11370585341665677806053
Detailed Generation
This section explains what kind of data each data generation button creates.
HDFS & Ozone
HDFS & Ozone buttons created 1 million customers from different countries (using the different customer models under /opt/cloudera/parcels/DATAGEN/models/customer/) and pushed them in Parquet file.
Sample of data in JSON format:
{ "name" : "Loris", "id" : "790001", "birthdate" : "1987-01-11", "city" : "Stevensville", "country" : "USA", "email" : "Loris@company.us", "phone_number" : "+1 7225688066", "membership" : "SILVER" }
{ "name" : "Marcell", "id" : "490001", "birthdate" : "1950-06-22", "city" : "Pontecorvo", "country" : "Italy", "email" : "Marcell@company.it", "phone_number" : "+39 995887416", "membership" : "BRONZE" }
{ "name" : "Ryong", "id" : "520001", "birthdate" : "1941-02-05", "city" : "Yachiyo", "country" : "Japan", "email" : "Ryong@company.jp", "phone_number" : "+81 809127101", "membership" : "PLATINUM" }
HBase
HBase button created a 1 million transactions (using the transaction model under /opt/cloudera/parcels/DATAGEN/models/finance/transaction-model.json).
Sample of data in JSON format:
{ "sender_id" : "50902", "receiver_id" : "10391", "amount" : "0.8084345", "execution_date" : "1665728236778", "currency" : "EUR" }
{ "sender_id" : "21403", "receiver_id" : "68104", "amount" : "0.65117764", "execution_date" : "1665728285129", "currency" : "USD" }
Hive
Hive button created a 1 million sensors data (using different models under /opt/cloudera/parcels/DATAGEN/models/industry/).
It will generate 100 plants data like this:
{ "plant_id" : "1", "city" : "Bollene", "lat" : "44,2803", "long" : "4,7489", "country" : "France" }
It will generate 100 000 sensors like this (each can be linked to a plant):
{ "sensor_id" : "1", "sensor_type" : "humidity", "plant_id" : "690" }
It will generate 1 000 000 sensors data like this (each can be linked to a sensor):
{ "sensor_id" : "58764", "timestamp_of_production" : "1665728724586", "value" : "-3000244563995128335" }
Local files
In Cloudera Manager:
Datagen > Actions > Generate Local data as CSV, JSON, AVRO, ORC, PARQUET
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.
Output should be:
Let’s Verify
In a shell with a logged in user (optionally use datagen ones):
cat /home/datagen/customer/customer-fr-0000000000.json
{ "name" : "Josse", "id" : "120001", "birthdate" : "2001-08-03", "city" : "Meylan", "country" : "France", "email" : "Josse@company.fr", "phone_number" : "+33 444585074", "membership" : "BRONZE" }
{ "name" : "Piet", "id" : "120002", "birthdate" : "1970-06-17", "city" : "Bures-sur-Yvette", "country" : "France", "email" : "Piet@company.fr", "phone_number" : "+33 851063627", "membership" : "BRONZE" }
{ "name" : "Armand", "id" : "120003", "birthdate" : "1990-10-04", "city" : "Notre-Dame-de-Gravenchon", "country" : "France", "email" : "Armand@company.fr", "phone_number" : "+33 575158362", "membership" : "BRONZE" }
{ "name" : "Marvin", "id" : "120004", "birthdate" : "1960-10-04", "city" : "Saint-Pryve-Saint-Mesmin", "country" : "France", "email" : "Marvin@company.fr", "phone_number" : "+33 588241506", "membership" : "BRONZE" }
{ "name" : "Vivian", "id" : "120005", "birthdate" : "1994-04-28", "city" : "La Cadiere-d'Azur", "country" : "France", "email" : "Vivian@company.fr", "phone_number" : "+33 553370858", "membership" : "BRONZE" }
{ "name" : "Jakob", "id" : "120006", "birthdate" : "1976-08-02", "city" : "Chaville", "country" : "France", "email" : "Jakob@company.fr", "phone_number" : "+33 208782811", "membership" : "BRONZE" }
{ "name" : "Bo", "id" : "120007", "birthdate" : "1966-10-14", "city" : "Brignoles", "country" : "France", "email" : "Bo@company.fr", "phone_number" : "+33 068739422", "membership" : "PLATINUM" }
{ "name" : "Emilienne", "id" : "120008", "birthdate" : "1976-02-23", "city" : "Orange", "country" : "France", "email" : "Emilienne@company.fr", "phone_number" : "+33 303877991", "membership" : "BRONZE" }
{ "name" : "Elise", "id" : "120009", "birthdate" : "1965-11-28", "city" : "Cosne sur Loire", "country" : "France", "email" : "Elise@company.fr", "phone_number" : "+33 540812701", "membership" : "SILVER" }
{ "name" : "Roelof", "id" : "120010", "birthdate" : "1982-06-01", "city" : "Magny-en-Vexin", "country" : "France", "email" : "Roelof@company.fr", "phone_number" : "+33 252194443", "membership" : "BRONZE" }
cat /home/datagen/finance/transaction/transaction-0000000000.csv
sender_id,receiver_id,amount,execution_date,currency
"11292","27627","0.7721951","1665729006111","USD"
"49294","95851","0.4893235","1665729006111","EUR"
"68670","8844","0.009439588","1665729006111","USD"
"61487","46071","0.22023022","1665729006111","EUR"
"14383","57358","0.07566887","1665729006111","YEN"
"89570","96238","0.35353237","1665729006111","USD"
"66066","69065","0.87496656","1665729006111","USD"
"43894","87454","0.11435127","1665729006111","USD"
"76777","19367","0.06878656","1665729006111","EUR"
"53649","14975","0.9570634","1665729006111","EUR"
ls -R /home/datagen/industry/
/home/datagen/industry/:
plant sensor sensor_data
/home/datagen/industry/plant:
plant-0000000000.avro
/home/datagen/industry/sensor:
sensor-0000000000.parquet
/home/datagen/industry/sensor_data:
sensor_data-0000000000.orc
SolR
In Cloudera Manager:
Datagen > Actions > Generate 1 Million Weather Data to SolR
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.
Output should be:
It will generate 1 million weather data like this (using the weather model under /opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json)
{ "city" : "Seysses", "date" : "2021-03-25", "lat" : "43,4981", "long" : "1,3125", "wind_provenance_9_am" : "NORTH", "wind_force_9_am" : "3", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "12", "pressure_9_am" : "1004", "pressure_9_pm" : "1008", "humidity_9_am" : "46", "humidity_9_pm" : "52", "temperature_9_am" : "22", "temperature_9_pm" : "-8", "rain" : "false" }
Let’s Verify
Access SolR UI, (login as a user with enough rights):
Kudu
In Cloudera Manager:
Datagen > Actions > Generate 1 Million Public Service Data to Kudu
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.
Output should be:
It will generate 1 million weather data like this (using the weather model under /opt/cloudera/parcels/DATAGEN/models/public_service/incident-model.json )
{ "city" : "Le Rove", "lat" : "43,3692", "long" : "5,2503", "reporting_timestamp" : "1665732947892", "emergency" : "URGENT", "type" : "WATER" }
Let’s Verify
Go to Hue or an Impala shell and make an INVALIDATE METADATA command to refresh the cache, then you will be able to see in database: datagen a new table publicservice_incident :
Kafka
Datagen > Actions > Generate 1 million weather data to Kafka in JSON OR Public Service Data to Kafka in Avro
It launches a Cloudera Manager command making different API calls to Datagen Web server to generate multiple data using almost all possible models.
Output should be:
It will generate 1 million weather data like this (using the weather model under /opt/cloudera/parcels/DATAGEN/models/public_service/weather-model.json)
{ "city" : "Seysses", "date" : "2021-03-25", "lat" : "43,4981", "long" : "1,3125", "wind_provenance_9_am" : "NORTH", "wind_force_9_am" : "3", "wind_provenance_9_pm" : "WEST", "wind_force_9_pm" : "12", "pressure_9_am" : "1004", "pressure_9_pm" : "1008", "humidity_9_am" : "46", "humidity_9_pm" : "52", "temperature_9_am" : "22", "temperature_9_pm" : "-8", "rain" : "false" }
Let’s Verify
You can make a kafka-console-consumer with enough rights and consume the topic from the beginning to verify production of messages.
But we will instead login to Streams Messaging Manager with a user’s with enough rights and see data:
If you picked th data generation with AVRO format, in Streams Messaging Manager:
If you picked th data generation with AVRO format, you can go to Schema Registry URL (login with a user’s with enough rights) and see the newly added schema:
Finally, if you have SQL Stream Builder installed in your cluster, make sure that user’s ssb & flink have access rights to generated topic, logged to the web console, upload your keytab if necessary and create the table on kafka topic (in JSON):
Then do a sample query to visualize data:
TROUBLESHOOT
In case of any error, please check the logs through Cloudera Manager or directly on the machine, they are located at /var/log/datagen/ .