Skip to main content
Version: 1.2.0

Connections

Connections are defined in the connections section under the root attribute application.

The following types of connections are supported:

Local File System

The local file system connection is used to read and write files to the local file system.

application:
connections:
local:
type: local

Files will be stored in the area directory under the datasets directory. Défault values for area and datasets can be set in the application section.

application:
datasets = "{{root}}/datasets" # or set it through the SL_DATASETS environnement variable.
area:
pending: "pending" # Location where files of pending load are stored. May be overloaded by the ${SL_AREA_PENDING} environment variable.
unresolved: "unresolved" # Location where files that do not match any pattern are moved. May be overloaded by the ${SL_AREA_UNRESOLVED} environment variable.
archive: "archive" # Location where files are moved after they have been processed. May be overloaded by the ${SL_AREA_ARCHIVE} environment variable.
ingesting: "ingesting" # Location where files are moved while they are being processed. May be overloaded by the ${SL_AREA_INGESTING} environment variable.
accepted: "accepted" # Location where files are moved after they have been processed and accepted. May be overloaded by the ${SL_AREA_ACCEPTED} environment variable.
rejected: "rejected" # Location where files are moved after they have been processed and rejected. May be overloaded by the ${SL_AREA_REJECTED} environment variable.
business: "business" # Location where transform tasks store their result. May be overloaded by the ${SL_AREA_BUSINESS} environment variable.
replay: "replay" # Location rejected records are stored in their orginial format. May be overloaded by the ${SL_AREA_REPLAY} environment variable.
hiveDatabase: "${domain}_${area}" # Hive database name. May be overloaded by the ${SL_AREA_HIVE_DATABASE} environment variable.

Google BigQuery

Starlake support native and spark / dataproc bigquery connections.

application:
connections:
bigquery:
type: "bigquery"
options:
location: "us-central1" # EU or US or ...
authType: "APPLICATION_DEFAULT"
authScopes: "https://www.googleapis.com/auth/cloud-platform" # comma separated list of scopes
#authType: SERVICE_ACCOUNT_JSON_KEYFILE
#jsonKeyfile: "/Users/me/.gcloud/keys/starlake-me.json"
#authType: "ACCESS_TOKEN"
#gcpAccessToken: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
accessPolicies: # Required when applying access policies to table columns (Column Level Security)
apply: true
location: EU
taxonomy: RGPD

Apache Spark / Databricks

Spark connections are used to read and write data from Spark.

application:
connections:
spark:
type: "spark"
options:
# any spark configuration can be set here

Snowflake


application:
connectionRef: {{connection}}
connections:
snowflake:
type: jdbc
options:
url: "jdbc:snowflake://{{SNOWFLAKE_ACCOUNT}}.snowflakecomputing.com"
driver: "net.snowflake.client.jdbc.SnowflakeDriver"
user: {{SNOWFLAKE_USER}}
password: {{SNOWFLAKE_PASSWORD}}
warehouse: {{SNOWFLAKE_WAREHOUSE}}
db: {{SNOWFLAKE_DB}}
keep_column_case: "off"
preActions: "alter session set TIMESTAMP_TYPE_MAPPING = 'TIMESTAMP_LTZ';ALTER SESSION SET QUOTED_IDENTIFIERS_IGNORE_CASE = true"

Amazon Redshift


application:
connections:
redshift:
options:
url: "jdbc:redshift://account.region.redshift.amazonaws.com:5439/database",
driver: com.amazon.redshift.Driver
password: "{{REDSHIFT_PASSWORD}}"
tempdir: "s3a://bucketName/data",
tempdir_region: "eu-central-1" # required only if running from outside AWS (your laptop ...)
aws_iam_role: "arn:aws:iam::aws_count_id:role/role_name"

Postgres

application:
connectionRef: "postgresql"
connections:
postgresql:
type: jdbc
options:
url: "jdbc:postgresql://{{POSTGRES_HOST}}:{{POSTGRES_PORT}}/{{POSTGRES_DATABASE}}"
driver: "org.postgresql.Driver"
user: "{{DATABASE_USER}}"
password: "{{DATABASE_PASSWORD}}"
quoteIdentifiers: false