Smart Data Lake Builder
BetDeclarative data pipeline framework built on Apache Spark
Metrics
What is it
Smart Data Lake Builder (SDLB) is an open-source data processing framework that allows you to build and manage data pipelines using a declarative configuration approach. It supports various data sources and targets, transforms data using Apache Spark, and provides intelligent features like schema evolution, data quality checks, and automatic incremental processing. Think of it as “infrastructure as code” for data pipelines.
My Opinion
SDLB is what happens when someone actually understands the pain of building data lakes. Instead of writing 5,000 lines of Spark code, you write HOCON configuration. SDLB handles the rest. It’s the closest thing to declarative data engineering that actually works at scale.
The Declarative Advantage
The power of SDLB is in its declarative nature. You define what you want your pipeline to do, not how to do it. Want to read from S3, apply a transformation, and write to Snowflake? That’s 20 lines of configuration, not 200 lines of Spark code.
The abstraction layer means SDLB can optimize execution, handle schema evolution, and manage parallelism automatically. You focus on business logic; the framework handles the infrastructure complexity.
The Scala Dependency
SDLB is built on Scala and Spark, which means you’re buying into that ecosystem. If your organization already runs Scala on Databricks, this is a natural fit. But if you’re a Python shop, the learning curve is real. The DSL is powerful, but Scala’s type system and functional paradigms can be daunting for teams coming from PySpark.
The “Smart” Features
The “Smart” in SDLB isn’t marketing fluff. The framework genuinely handles complex data engineering problems:
- Schema evolution: When source schemas change, SDLB can automatically adapt
- Incremental processing: It tracks state and only processes new data
- Data quality: Built-in validation and anomaly detection
- Orchestration: DAG-based execution with automatic dependency resolution
These are problems every data team solves manually, usually with hundreds of lines of boilerplate code. SDLB solves them once, in the framework.
The Adoption Barrier
The biggest issue is that SDLB is not a market standard. When you hire data engineers, they know Airflow, dbt, and Dagster. They don’t know SDLB. This means training overhead and a smaller talent pool.
But for teams willing to invest, the productivity gains are substantial. Once you’ve built pipelines in SDLB, going back to raw Spark feels like going back to assembly language.
The Tooling Gap
One area where SDLB needed improvement was IDE support. HOCON configuration files are powerful but lack the discoverability of code. That’s why I built an LSP (Language Server Protocol) implementation to bring autocomplete, validation, and documentation to SDLB development.
Conclusion
SDLB is a powerful framework that abstracts away the complexity of Spark pipelines. If you’re building a modern data lake on Spark and want a declarative, opinionated approach, SDLB is worth the investment. The Scala dependency and market adoption are real concerns, but the productivity gains for committed teams are substantial. See my epic series on building an LSP for SDLB for a deep dive into extending the ecosystem.