Lifeattails- Enforcing DBT Dependencies in Snowflake

tl;dr: If you use dbt and Snowflake, we made a dbt package you can use to make sure that you keep your dependencies tidy and understandable. Check it out here: https://github.com/tailsdotcom/dbt-snowflake-dependency-tests

We use dbt to version control our analytical logic, it's a great tool and has really step changed the way we keep track of how that logic changes over time and enabled an ever growing team to contribute to it. By their own description:

Analytics engineering is the data transformation work that happens between loading data into your warehouse and analysing it. dbt allows anyone comfortable with SQL to own that workflow.

In short it allows us to define all the transformations we use for day to day analytics in one place as SQL files and deploy that to our data warehouse, Snowflake. The logic can then be edited by anyone on the team without having to worry about the underlying implementation or about things like connections, database clients or scheduling.

We have just over 400 models in our dbt deployment, and that means we have started to put in place some automated ways of keeping our repository clean.

Snowflake is a newer tool for us, and it's been a great replacement for Redshift which we used to use. One downside however is that in Redshift we had an easy way of keeping the dependencies of models in check (using the pg_depend table which it inherits from its postgres past). In Snowflake that has been much harder.

Recently however Snowflake released the GET_OBJECT_REFERENCES command which looked like it might provide a way of doing this. The challenge with this is that it has to be called individually for each view that you want to check dependencies for, and there's no way of calling it in bulk. We've been scratching our heads at tails for a way of getting around this problem for a few months, until we realised that the answer was already in front of us...

dbt Magic

dbt uses the Jinja2 templating library under the hood, and allows us to do all kinds of fun with loops and variables. Using this library we can get the name of a given view, use that to provide the arguments to GET_OBJECT_REFERENCES , use the results in another query which has also been supplied with a list of regular expressions, and use the Snowflake query engine to do all the hard work.

We can use the dbt post-hook functionality to install this function to be run after every single model in a schema, and once we've got a list of the offending dependencies (the ones which don't match our list of regular expressions), then we can use some of the dbt macros to raise exceptions to alert them to the user:

...
Completed with 1 errors:

Database Error in model my_problem_model (models/base/my_problem_model.sql)
  Invalid Dependencides for view. Invalid dependencies: PROBLEM_DB.PROBLEM_SCHEMA. Allowed patterns are: ['foo_db\\..*', 'bar_db\\..*']
  compiled SQL at target/compiled/my_project/base/my_problem_model.sql

This gives us back some of the functionality we lost in transitioning over from Redshift to Snowflake and helps us keep our dependencies in line.

So plain sailing then?

One difference in how GET_OBJECT_REFERENCES is implemented to how pg_depend was implemented is how far the dependencies go.

For pg_depend the only dependencies listed are the ones directly referenced in the view itself.
For GET_OBJECT_REFERENCES, all the dependencies are listed, even ones not directly referenced. This means if view_a is defined as select * from source_table and view_b is defined as select * from view_a then if we call GET_OBJECT_REFERENCES on view_b we will get view_a and source_table even though it's not referenced directly in the definition of view_b.

This means that we have to be a bit more clever with how we express our dependency rules. In the past we used to say that our base schema was the only one that could depend on non-dbt models and that our core schema could only depend on base. But now that dependencies are transparent, we will see non-dbt models in the GET_OBJECT_REFERENCES for our core schema, if only because core depends on base and base depends on non-dbt models.

Our way around this problem for now, is just to put the rules on the base schema and say that it cannot depend on the core schema, rather than mandating the dependency in reverse. It doesn't achieve quite the same end, but it gets us closer and allows us to keep our repository clean with less mental effort.

Got ideas?

If you've got ideas on how we can improve this tool, we're really keen to have contributions and suggestions in the GitHub repo directly. We use this tool every day, and so improvements to it help us and every other team using it. https://github.com/tailsdotcom/dbt-snowflake-dependency-tests

If you like dbt, and like dogs and live in the London area, then maybe you'll also like working at tails, and we'd love to hear from you. We're always looking for smart data people to join the team, so head to tails.com/careers and get in touch.

Enforcing DBT Dependencies in Snowflake

dbt Magic

So plain sailing then?

Got ideas?

Alan Cruickshank

NumPy for Data Scientists

Hackdays at tails.com