Skip to main content

Complete example

In this section you will learn how to load and transform data using the samples files created when bootstrapping a new project. You will learn how to:

  • import: recognize new arriving files that need to be loaded into your warehouse
  • load: Validate file records and load data into you warehouse
  • transform: Apply transformation oin your previously loaded data

Sample scenario

Say we have to ingest customers, orders, locations and sellers into the warehouse. The customers and orders are provided by the "sales" department while the sellers and locations datasets are provided by the HR department.

The following rules need to be applied at load time:

The locations dataset should replace any previous imported locations dataset.

The customers and sellersdatasets are just updates of the previous imported ones.

The orders dataset contains new, updated and deleted orders. Once imported, we want the deleted orders to be removed from the warehouse and to only keep the last update of each order.

The customers and orders datasets are sent by the "sales" department as CSV files. Below is an extract of these files.

customers dataset

File customers-2018-05-10.psv from "sales" department

idsignupcontactbirthdatename1name2
A0097012010-01-31 23:04:15me@home.com1980-10-14DonaldObama
B3086292016-12-01 09:56:02you@land.com1980-10-14BarackTrump

orders dataset

File orders-2018-05-10.csv from the "sales" department

order_idcustomer_idamountseller_id
12345A009701123.65AQZERD
56432B30862923.8AQZERD

sellers dataset

The sellers dataset below is sent as JSON array by the HR department.

File sellers-2018-05-10.json from the HR department

[
{
"id":"AQZERD",
"seller_email":"me@acme.com",
"location_id": 1
},
{
"id":"TYUEZG",
"seller_email":"acme.com",
"location_id": 2
}
]
caution

We intentionnally set an invalid email for the second seller above to illustrate how errors are handled and monitored.

locations dataset

The locations dataset below is sent as JSON file by the HR department with one json record per line.

File locations-2018-05-10.json from the HR department

{ "id":1, "address": { "city":"Paris", "stores": ["Store 1", "Store 2", "Store 3"], "country":"France" }}
{ "id":2, "address": { "city":"Berlin", "country":"Germany" }}
note

Before sending the files, the "sales" department zip all its files into a single compressed file and put them in the folder /mnt/incoming/sales of the landing area.

note

the HR department does not zip its files. It simply copy them into the folder /mnt/incoming/hr of the landing area.

Recap

We will load the following files:

From the sales department, customers and orders in delimiter separated files :

  • customers and orders are appended to the previous imported data
  • new orders are added
  • updated orders replace existing ones
  • and some orders may even be deleted when marked as such in the input dataset

From the HR department, sellers and locations in json files :

  • sellers are imported in a cumulative way while locations are imported as full content and overwrite the existing locations dataset
  • sellers are received as an array of json objects
  • locations are received in JSON file with one json record per line

In Starlake terms we will end up with with:

  • two domains: sales and hr. A domain is equivalent to a database schema or a BigQuery Dataset.
  • four tables: the customers and orders tables in the sales domain and the loations and sellers tables in the hr domain.

Import data

Build it

Clone the project, install sbt 1.5+ and run sbt clean assembly. This will create the assembly in the target/scala-2.12 directory or simply download the assembly artefact from Maven Central

Run it

To run the quickstart on a local filesystem, simply copy the content of the quickstart directory to your /tmp directory. This will create the /tmp/metadata and the /tmp/incoming folders.

Import the datasets into the cluster:

/app/starlake/starlake.sh import

This will put the datasets in the /tmp/datasets/pending/ folder. In real life, this will be a HDFS or Cloud Storage folder.

Run the ingestion process as follows :

/app/starlake/starlake.sh watch

This will ingest the four datasets of the two domains (hr & sales) and store them as parquet files into the folders:

  • /tmp/datasets/accepted for valid records
  • /tmp/datasets/rejected for invalid records
  • /tmp/datasets/unresolved for unrecognized files

When run on top of HDFS or any cloud warehouse, these datasets are also available as tables.