2 posts tagged with "Data Engineering"

Polars versus Spark

May 28, 2024 · 6 min read

Starlake Core Team

Introduction

Polars is often compared to Spark. In this post, I will highlight the main differences and the best use cases for each in my data engineering activities.

As a Data Engineer, I primarily focus on the following goals:

Parsing files, validating their input, and loading the data into the target data warehouse.
Once the data is loaded, applying transformations by joining and aggregating the data to build KPIs.

However, on a daily basis, I also need to develop on my laptop and test my work locally before delivering it to the CI pipeline and then to production.

What about my fellow data scientist colleagues? They need to run their workload on production data through their favorite notebook environment.

This post addresses the following points:

How suitable each tool is for loading files into your data warehouse.
How easy and powerful each tool is for performing transformations.
How easy it is to test your code locally before deploying to the cloud

Load tasks

Data loading seems easy at first glance, as almost all databases and APIs offer some sort of single-line command to load a few lines or millions of records quickly. However, this simplicity disappears when you encounter real-world cases such as:

Fixed-width fields: These files are typically exported from mainframe databases.
XML files: I sometimes work in the finance industry where SWIFT is a common XML file format that will be around for some time.
Multi-character CSV: For example, where the separator consists of two characters like ||.
File validation: You cannot trust the files you receive and need to check their content thoroughly.

Loading correct CSV or JSON files using Spark or Polars is straightforward. However, it is also straightforward using your favorite database command line utility, making this capabilities somewhat redundant and even slower than the native data warehouse load feature.

However, in real-world scenarios, you want to ensure the incoming file adheres to the expected schema and load a variety of file formats not supported natively. This is where Spark excels compared to Polars, as it allows for the parallel loading of your JSONL or CSV files and offers through map operations local and distributed validation of your incoming file.

As XML and multichar and multiline CSV are only supported by Spark, dealing with file parsing for data loading, Spark is definitely the most suitable solution.

Transform tasks

Three interfaces are provided to transform data: SQL, Dafarames and Datasets.

SQL

SQL is the preferred language for most data analysts when it comes to computing aggregations. One of its key benefits is its widespread understanding, which eliminates the learning curve.

You can use SQL with both Spark and Polars, but Polars has significant limitations. It does not support SQL Window functions nor does it support the SQL statements for update, insert, delete, or merge operations.

Dataframes

Both Polars and Spark offer excellent support for DataFrames. The main added value of using DataFrames is the ability to reuse portions of code and access features not available in standard SQL, such as lambda functions, JSON handling and array manipulation.

Datasets

As a software engineer, I particularly appreciate Datasets. Datasets are typed DataFrames, meaning that syntax, column names, and types are checked at compile time. If you believe that statically typed languages greatly enhance code quality, then you understand the added value of datasets.

Datasets are only supported by Spark, allowing you to write clean, reusable, and statically typed transformations. They are available exclusively to statically typed languages such as Java or Scala.

Spark stands out as the only tool with complete support for SQL, DataFrames, and Datasets. Polars’ limited support for SQL makes it less suitable for my data engineering tasks.

Cloud Era

At least 60% of the projects I have worked on are cloud-based, primarily targeting Amazon Redshift, Google BigQuery, Databricks, Snowflake, or Azure Synapse. All of these platforms offer serverless support for Spark, making it incredibly easy to run workloads by simply providing your PySpark script and letting the cloud provider handle the rest.

In the cloud environment, Spark is definitely the tool of choice as I see it.

Single node

There has been much discussion about Spark being slow or too heavy for single-node computers. I was particularly interested in running this test since I currently execute most of my workloads on a single-node Google Cloud Run job with Spark embedded in my Docker image.

I decided to conduct this test on my almost 3-year-old MacBook Pro M1 Max with 64GB of memory. The test involved loading 27GB of CSV data, selecting a few attributes, computing metrics on those selected attributes, and then saving the results to a Parquet file.

I ran Spark with default settings and without any fine-tuning. This means it utilized all 10 cores of my MacBook Pro M1 Max but only 1 gigabyte of memory for execution.

::: note I could have optimized my Spark workload, but given the small size of the dataset (27GB of CSV), it didn't make sense. The default settings were sufficient. :::

Here are the results after a cold restart of my laptop before each test to ensure the test did not benefit from any operating system cache.

Apache Spark pipeline took: 29 seconds
Polars pipeline took: 56 seconds

::: note Rerunning Polars gave me 23 seconds instead of 56 seconds. This discrepancy is probably due to filesystem caching by the operating system. :::

Load and Save Test: Load a 27GB CSV file with 193 columns per record and save the result as parquet.

Apache Spark pipeline took: 2mn 18s
Polars pipeline took: 2mn 32s

Load parquet and filter on column value then return count : Load a 74 millions records parquet file with 193 columns, filter on 'model' column and return count.

Apache Spark pipeline took: 3 seconds
Polars pipeline took: 28 seconds

The table below summarises the results

Task	Spark 1GB memory	Polars All the available memory
Load CSV, aggregate and save the aggregation as parquet	29s	56s
Load CSV and Save parquet	2mn 18s	2mn 32s
Load Parquet, filter and count	3s	28s

Conclusion

I don’t think it is time to switch from Spark to Polars, at least for those of us accustomed to the JVM, running workloads in the cloud, or even working on small datasets. However, Polars may be a perfect fit for those familiar with pandas.

As of today, Spark is the only framework I see that can handle both local and distributed workloads, adapt to on-premise and cloud serverless jobs, and provide the complete SQL support required for most of our transformations.

Starlake OSS - Bringing Declarative Programming to Data Engineering and Analytics

May 14, 2024 · 6 min read

Hayssam Saleh

Starlake Core Team

Introduction

The advent of declarative programming through tools like Ansible and Terraform, has revolutionized infrastructure deployment by allowing developers to achieve intended goals without specifying the order of code execution.

This paradigm shift brings forth benefits such as reduced error rates, significantly shortened development cycles, enhanced code readability, and increased accessibility for developers of all levels.

This is the story of how a small team of developers crafted a platform that goes beyond the boundaries of conventional data engineering by applying a declarative approach to data extraction, loading, transformation and orchestration.

Starlake

The Genesis

Back in 2015, at the helm of ebiznext, a boutique data engineering company, we faced a daunting challenge. Our client, a prominent entity in need of a robust big data solution, sought to harness the power of Hadoop and Spark. Despite our modest size (20 people), we dared to compete against industry giants with tenfold resources (100.000+ headcount).

Our only chance to succeed was to innovate: we needed a data platform that could exponentially outperform the traditional ETL solutions pushed by our competitors. To build data pipelines these GUI based ETLs require an effort that is proportional to the number and complexity of the sources.

Determined to disrupt this norm, we embarked on a quest to devise a DevOps friendly platform capable of lightning-fast data ingestion from any source, without the drawbacks of ETLs or specialized engineering skills.

The day of the tender, our ability to deliver a solution that could load data in a few weeks instead of many months allowed us to stand out from the competition and win the project.

Expisode 1: Smartlake Emerges

note

The basic idea behind Smartlake was that no datawarehouse would stay clean if data quality is checked after the data has been loaded and this pre-load quality checks needed to be handled by data owners.

This left us with little choice but to embrace the declarative approach. Empowering business users, we devised a system where data formats and transformations could be described in simple JSON files. Smartlake wasn’t merely a code generator; it was a versatile engine, seamlessly ingesting diverse data formats, executing transformations, and orchestrating operations with unparalleled efficiency.

To streamline user interaction, we devised an intuitive Excel-to-JSON converter, enabling effortless specification of input formats. Thanks to Smartlake and its declarative approach, the business users were able to define load and transformation operations in a matter of minutes.

Smartlake Standout features

Load almost any file format at Spark speed (CSV, JSON, XML, FIXED WITH, Multi-record types …) or Kafka topic
Validate fields using user-defined schemas with user defined semantic types
Apply transformations on the fly to data being loaded (GDPR, normalisation, computed fields ...) with and without schema evolution
Sink to almost any target including Spark, Kafka, Elasticsearch.

Episode 2: Evolution to Starlake

note

The basic idea behind Starlake was to bring in all Smartlake benefits to the cloud by leveraging serverless services and Cloud Datawarehouses capabilities while minimising development and execution costs.

As the data landscape evolved, so did our vision. Cloud data warehouses emerged as formidable competitors to Spark for query execution. Recognizing this shift, we evolved Smartlake into Starlake, preserving its declarative essence while embracing YAML for enhanced readability. We maintained Spark’s prowess to run inside single or multiple container(s) for data ingestion, leveraging cloud data warehouses for query execution.

This strategic blend allowed us to optimize performance and cost-effectiveness based on specific workload requirements. The result was a reimagined platform, tailored for the cloud era, yet grounded in the principles of efficiency and simplicity that defined its inception.

The result is the Starlake OSS project that you can find on Github.

The capabilities of Starlake are extensively described here.

The people behind Starlake

Smartlake, the precursor to Starlake, owes its existence to the collective efforts of numerous individuals, but a select few stand out for their exceptional contributions:

Sam Bessalah With Sam’s presence, rallying others became effortless. His visionary outlook and knack for simplifying complexities proved transformative, setting a new standard for implementation.
Olivier Girardot Every team has its coding wizard, and Olivier filled that role impeccably. From leveraging Spark codegen to exploring mathematical frameworks like matryoshka, he pushed the boundaries, mentoring the team with his expertise spanning Docker, Ansible, Python, Scala and Spark internals.
Valentin Kasas Valentin championed functional programming in Scala. Introducing concepts like recursion schemes, he empowered the team to craft code that was not just functional but also elegant and maintainable.

As the journey progressed towards the cloud, long time data experts joined and made Starlake what it is today:

Bounkong Khamphousone The speed and efficiency of Starlake’s extraction and load engines owe much to his contributions.
Mohamad Kassir His direct involvement with customer projects and his in-depth knowledge of cloud platforms and business needs have been major assets in the evolution of Starlake.
Abdelhamide EL ARIB An early contributor to the load engine, this foresight and execution prowess played a significant role in shaping the platform’s today capabilities.
Stephane Manciot The developer behind Starlake’s declarative workflows on top of Airflow and Dagster, was pivotal in shaping its operational backbone.
Cyrille Chépélov A master of codebase optimisation, Cyrille’s rewrite efforts were instrumental in ensuring the reentrant nature of Starlake’s API.

The next journey

Today with hundreds of gigabytes of data loaded and transformed daily into thousands of tables in various data warehouses, we can confidently say that Starlake is battle tested and ready for the most demanding data engineering & analytics challenges.

As Starlake is, and will always be open source, join us in building a supportive community. Your insights and feature requests aren’t just welcome, they guide our roadmap.

Get Started with Starlake:

Check-out the other features
Explore our documentation

Join our community on GitHub

P.S. Please star the repository: https://github.com/starlake-ai/starlake. Also, any issue or enhancement with Starlake, please just report it. It will fall under the scope of gracious care taking of course.

Introduction​

Load tasks​

Transform tasks​

Cloud Era​

Single node​

Conclusion​

Introduction​

The Genesis​

Expisode 1: Smartlake Emerges​

Episode 2: Evolution to Starlake​

The people behind Starlake​

The next journey​

Introduction

Load tasks

Transform tasks

Cloud Era

Single node

Conclusion

Introduction

The Genesis

Expisode 1: Smartlake Emerges

Episode 2: Evolution to Starlake

The people behind Starlake

The next journey