Write-Audit-Publish for Data Lakes in Pure Python (no JVM)

Author:Murphy  |  View: 25269  |  Time: 2025-03-22 22:06:01
Look Ma: no JVM! Photo by Zac Ong on Unsplash

Introduction

In this blog post we provide a no-nonsense, reference implementation for Write-Audit-Publish (WAP) patterns on a Data Lake, using Apache Iceberg as an open table format, and Project Nessie as a data catalog supporting git-like semantics.

We chose Nessie because its branching capabilities provide a good abstraction to implement a WAP design. Most importantly, we chose to build on PyIceberg to eliminate the need for the JVM in terms of developer experience. In fact, to run the entire project, including the integrated applications we will only need Python and AWS.

While Nessie is technically built in Java, the data catalog is run as a container by AWS Lightsail in this project, we are going to interact with it only through its endpoint. Consequently, we can express the entire WAP logic, including the queries downstream, in Python only!

Because PyIceberg is fairly new, a bunch of things are actually not supported out of the box. In particular, writing is still in early days, and branching Iceberg tables is still not supported. So what you'll find here is the result of some original work we did ourselves to make branching Iceberg tables in Nessie possible directly from Python.

So all this happened, more or less.

What on earth is WAP?

Back in 2017, Michelle Winters from Netflix talked about a design pattern called Write-Audit-Publish (WAP) in data. Essentially, WAP is a functional design aimed at making data quality checks easy to implement before the data become available to downstream consumers.

For instance, an atypical use case is data quality at ingestion. The flow will look like creating a staging environment and run quality tests on freshly ingested data, before making that data available to any downstream application.

As the name betrays, there are essentially three phases:

  1. Write. Put the data in a location that is not accessible to consumers downstream (e.g. a staging environment or a branch).
  2. Audit. Transform and test the data to make sure it meets the specifications (e.g. check whether the schema abruptly changed, or whether there are unexpected values, such as NULLs).
  3. Publish. Put the data in the place where consumers can read it from (e.g. the production data lake).
Image from the authors

This is only one example of the possible applications of WAP patterns. It is easy to see how it can be applied at different stages of the data life-cycle, from ETL and data ingestion, to complex data pipelines supporting analytics and ML applications.

Despite being so useful, WAP is still not very widespread, and only recently companies have started thinking about it more systematically. The rise of open table formats and projects like Nessie and LakeFS is accelerating the process, but it still a bit avant garde.

In any case, it is a very good way of thinking about data and it is extremely useful in taming some of the most widespread problems keeping engineers up at night. So let's see how we can implement it.

WAP on a data lake in Python

We are not going to have a theoretical discussion about WAP nor will we provide an exhaustive survey of the different ways to implement it (Alex Merced from Dremio and Einat Orr from LakeFs are already doing a phenomenal job at that). Instead, we will provide a reference implementation for WAP on a data lake.

Tags: Apache Iceberg Data Lake Data Lakehouse JVM Python

Comment