How to create a model ?

Let’s create a simple model to generate some data into Hive file:

I would like to generate something that will represent employees:

  • A name
  • Their location city
  • Their birthdate
  • Their phone number
  • Years of experience in the company
  • Their employee ID (in 6 digits)
  • Their department (among HR, CONSULTING, FINANCE, SALES, ENGINEERING, ADMINISTRATION, MARKETING)

And the company is based in Germany, as all employees by the way.

So here is the final JSON I outcome:

{
    "Fields": [
      {
        "name": "name",
        "type": "NAME",
        "filters": ["Germany"]
      },
      {
        "name": "city",
        "type": "CITY",
        "filters": ["Germany"]
      },
      {
        "name": "phone_number",
        "type": "PHONE",
        "filters": ["Germany"]
      },
      {
        "name": "years_of_experience",
        "type": "INTEGER",
        "min": 0,
        "max": 10
      },
      {
        "name": "employee_id",
        "type": "INCREMENT_INTEGER",
        "min": 123456
      },
      {
        "name": "department",
        "type": "STRING",
        "possible_values": ["HR", "CONSULTING", "FINANCE", "SALES", "ENGINEERING", "ADMINISTRATION", "MARKETING"]
      }
    ],
    "Table_Names": {   
        "HIVE_HDFS_FILE_PATH": "/user/datagen/hive/employee_model/",
        "HIVE_DATABASE": "datagen_test",
        "HIVE_TABLE_NAME":  "employee_model",
        "HIVE_TEMPORARY_TABLE_NAME":  "employee_model_tmp"
    },
    "Options": {}
  }

Test a Model

To test a model before launching a data generation, it is possible to use the API to test it.

Under model-tester-controller, an API /model/test takes as input a path to a model or directly upload a model and it returns one row generated with this model.

Output is:

{ "name" : "Gerhilt", "city" : "Beelen", "phone_number" : "+49 299776078", "years_of_experience" : "2", "employee_id" : "123457", "department" : "FINANCE" }

Launch Data Generation

Now, we are ready, using the swagger or making direclty an API call (with curl, postman or anything else), we launch a data generation like this:

Command in the swagger:

curl -X POST "https://ccycloud-1.lisbon.root.hwx.site:4242/datagen/hive" -H  "accept: */*" -H  "Content-Type: multipart/form-data" -F "batches=10" -F "model_file=@model-test.json;type=application/json" -F "rows=10000" -F "threads=10"

Returns following UUID:

{ "commandUuid": "1567dfba-a8f9-4da9-b389-9bc30f4ec1d5" , "error": "" }

In Datagen Webserver logs, we can see at the end:

Let’s Verify

If you log into hue with enough privileges (or beeline), we have a new database: datagen_test with a table employee_model and some data in it: