Skip to main content
Exploring ideas, sharing knowledge
Hidden Peaks Unlocked!
Looks like you found the hidden peaks! Future posts are now visible.
Peaks Hidden Again
The future posts are hidden once more. You know how to find them again.

Smart Data Lake Builder

Bet

Declarative data pipeline framework built on Apache Spark

Data |

Metrics

Learning UX Potential Impact Ecosystem Market Standard Maintainability
Learning UX
5/5
Potential
5/5
Impact
5/5
Ecosystem
4/5
Market Standard
2/5
Maintainability
4/5

What is it

Smart Data Lake Builder (SDLB) is an open-source data processing framework that allows you to build and manage data pipelines using a declarative configuration approach. It supports various data sources and targets, transforms data using Apache Spark, and provides intelligent features like schema evolution, data quality checks, and automatic incremental processing. Think of it as “infrastructure as code” for data pipelines.

My Opinion

SDLB is what happens when someone actually understands the pain of building data lakes. Instead of writing 5,000 lines of Spark code, you write HOCON configuration. SDLB handles the rest. It’s the closest thing to declarative data engineering that actually works at scale.

The Declarative Advantage

The power of SDLB is in its declarative nature. You define what you want your pipeline to do, not how to do it. Want to read from S3, apply a transformation, and write to Snowflake? That’s 20 lines of configuration, not 200 lines of Spark code.

The abstraction layer means SDLB can optimize execution, handle schema evolution, and manage parallelism automatically. You focus on business logic; the framework handles the infrastructure complexity.

The Scala Dependency

SDLB is built on Scala and Spark, which means you’re buying into that ecosystem. If your organization already runs Scala on Databricks, this is a natural fit. But if you’re a Python shop, the learning curve is real. The DSL is powerful, but Scala’s type system and functional paradigms can be daunting for teams coming from PySpark.

The “Smart” Features

The “Smart” in SDLB isn’t marketing fluff. The framework genuinely handles complex data engineering problems:

  • Schema evolution: When source schemas change, SDLB can automatically adapt
  • Incremental processing: It tracks state and only processes new data
  • Data quality: Built-in validation and anomaly detection
  • Orchestration: DAG-based execution with automatic dependency resolution

These are problems every data team solves manually, usually with hundreds of lines of boilerplate code. SDLB solves them once, in the framework.

The Adoption Barrier

The biggest issue is that SDLB is not a market standard. When you hire data engineers, they know Airflow, dbt, and Dagster. They don’t know SDLB. This means training overhead and a smaller talent pool.

But for teams willing to invest, the productivity gains are substantial. Once you’ve built pipelines in SDLB, going back to raw Spark feels like going back to assembly language.

The Tooling Gap

One area where SDLB needed improvement was IDE support. HOCON configuration files are powerful but lack the discoverability of code. That’s why I built an LSP (Language Server Protocol) implementation to bring autocomplete, validation, and documentation to SDLB development.

Conclusion

SDLB is a powerful framework that abstracts away the complexity of Spark pipelines. If you’re building a modern data lake on Spark and want a declarative, opinionated approach, SDLB is worth the investment. The Scala dependency and market adoption are real concerns, but the productivity gains for committed teams are substantial. See my epic series on building an LSP for SDLB for a deep dive into extending the ecosystem.

Share this article