The data team has grown tremendously at Blinkit in the past 12 months; with increasing demands of optimisations, growth and efficiencies from business our analytics also have grown from nascent to mature stages, though there is still much more to achieve but few quick learnings shared from our journey so far.
Different business vertical PODs have a corresponding data team which enables them to build, implement and evaluate various initiatives from a data perspective relevant to that specific business function.
Two core teams - data engineering & data warehousing - provide central services to all analytics teams in building and maintaining underline infra and data warehouse table sets. The data warehousing team recently switched to ETL builds via dbt. This post is about our journey so far with dbt.
The Challenges
Legacy data pipeline build had quite a few daunting tasks in day-to-day work , which needed a solve for future scalability and quality outcomes
- Build time was on the higher side
- Maintainability & RCAs were time-consuming
- Time to release was higher due to inbuilt complexity
- Code redundancy due to the monolithic sql code base was higher
- Compute infrastructure requirements were higher
- The change impact analysis was harder before the actual change went live
Foundation
All of the above challenges allowed us to explore different ways of working. The recent move of our analytics data engine from Redshift to TRINO gave us the grounds to build things from scratch and utilize dbt to solve certain day-to-day issues.
data tech landscape
dbt
dbt (data build tool) is a workflow ecosystem that helps build transformational data pipelines with high speed and trustworthy reliability. It makes things pretty simple in abstracting connection management, threading and multiple other aspects needed to build future-ready data pipelines
dbt comes in two flavors. dbt-cloud which is a managed service provided by dbt-labs and another one is dbt-core which is a community-supported open-source version of the same. We use dbt-core in our day-to-day operations along with dbt-trino as a connector to our backend engine.
dbt project structure
- Currently, we have 30+ data marts & 900+ data models built using dbt ranging from facts, dimension, hourly-daily-weekly aggregates, finance reconciliations, reporting and compliance model sets.
- We kept our project structure simple; the below screenshot gives a glimpse of the same.
This is pretty much the same standard recommended structure from dbt-labs but a few additional measures we took care of while working are below.
1. Single function-specific models are grouped under a specific data mart
2. All common lake tables/source-similar references remain in the staging folder — this helped reduce query runs for each run
3. Similar ways macros are managed specific to a function/mart to keep better management
4. Ensured each mart has a core model and intermediate/work interim model set
5. Snapshots were utilized to build SCD type-2 dimension which gave a standardized way to build dimensions for related facts
Advantages
- Build Time gain - overall time to build a pipeline reduced significantly with model reusability and macro implementations ~ from a few days to hours.
- Complex RCAs done - with the implementation of staging -> imd -> core model set it became super quick to identify any data lapses or issues if anything broken on the final transformed data sets.
- Source Change Impact Analysis - at times there are needs to backfill data for previous dates or completely re-do a pipeline for newly implemented changes at source — with dbt, we were able to implement such things faster, run in parallel, compare results and make decisions of GO / NO_GO quickly
- Quality Outcomes - using built-in dbt tests gave more confidence in the final consumable tables set - which saved lots of operational time on data validation over a period of time
- Materialisation — dbt inbuilt materialization methods gave multiple options for us to explore and design our pipelines better; when to use a view, table or incremental models makes us a team more design focused rather than just focusing on writing a chain of sqls.
- Run Time improvements - Due to code modularization and effective usage of intermediate models we were able to reduce the runtime of core jobs from 40-60 mins to 12-15 mins per run, almost 60-70% gain.
Learnings
- Mindset Change - before dbt most of the pipelines were monolithic in nature. Working with dbt modularized the pipelines to more logical granular pieces.
- Code Refactoring - when migration happened from REDSHIFT to TRINO lift & shift did not work as-is; lots of refactoring was needed but once got used to it things moved quickly in a good way by modularizing the big sql and reusing the base models in dbt.
- dbt-trino — it’s an adaptor which helps dbt to work with TRINO. This guy made our life a lot simpler. With a few quick code walkthroughs to the internals on the codebase, we were quickly able to adapt and build on it; specifically on incremental strategy set which is quite interesting e.g. delete+insert is much faster than usual updates/merge.
- Jinja magic — Jinja gave a good number of ways to reduce the number of lines in actual SQL builds. Initially building macros was a bit time taking but over some time it became quicker and we tried to do different stuff.
- Observability - this was a big roadblock where we were not able to have the right metrics on how many queries dbt is firing to the underline DB; is all of them needed? Any observation or eagle eye was missing but with few innovative ideas we built a simple frame to capture all fired queries and build a log table in runtime — that gave us a good hang of what was going on. Eventually, a Superset dashboard was built which helped a lot — more on it in a dedicated post later
Next
We still think we just scratched the surface around dbt , there are a lot more things to learn and explore. Few things planned for the future are TRINO table maintainability via dbt, common profiling for better mart management via different projects, exploring ways to make us future focused AI ready and contributing to the community with small learning we had so far.
references
dbt, dbt-trino, trino, airflow