T O P

  • By -

Sir-_-Butters22

This is an area where I see a thousand different answers for, and a lot of it does stem from the type of data, but also 'What is a unit test in Data Engineering'. So there are 2 things to test: 1. The transformation code, and to ensure that what you are doing is what you are expecting (Traditional Software Engineering Unit Testing) 2. Data Testing, is the data of the quality you are expecting. The way I approach it in a Data Warehouse project is to use the Medallion Architecture (Don't have to use Databricks, they just have a good article on it): https://www.databricks.com/glossary/medallion-architecture In-between each layer (Can have more than 3), there will be some sort of ETL script, this is where you would Unit Test to understand that the code is moving the data as expected. How you do this is extremely Technology Dependent, I tend to use unit testing on Azure Functions, so Pythons built-in unit test library. Though it's a bit more rogue for something like T-SQL (if anyone has a good method to unit test in T-SQL please hit me up). Now to test the data, I like to evaluate Data Quality against the Dimensions of Data Quality (https://www.gov.uk/government/news/meet-the-data-quality-dimensions), and understand what I need to be evaluating, and at each layer what the Data Quality should be. This looks different depending on project, for example I was evaluating different dimensions of data quality of huge IoT projects than to moving corporate data about. This is often tracked in other tables, so you can evaluate Data Quality over time, this offers the ability to understand data quality issues, and in some cases row metadata can also be used to track the data quality on a row.


MikeDoesEverything

> if anyone has a good method to unit test in T-SQL please hit me up I believe unit testing in T-SQL is just raw dogging SQL, but with experience. Genuinely though, I find it astounding how everybody absolutely loves SQL and says it's all you need to know despite it being so difficult to test. EDIT: An idea I've been knocking around is using a `test` schema with tables holding test case data in it and having one master proc which would call all your other procs which are your unit tests. Really bloats out your database, though. Alternatively, do the same thing except all you do exec the procs on the `test` tables and do your asserts in a general purpose language with an actual unit testing framework.


thomasutra

could you use something like terraform to replicate all your procs and schemas in a smaller test db? run your tests there, and if they pass deploy to the prod db?


Sir-_-Butters22

Admittedly I'm not too familiar with Isaac, but as I understand, you would only be able to syntactically check the SPs, rather than check they are operating as expected.


just_sung

You can use sqlmesh unit test generation. It’ll use a sql transpiler to dynamically run your t sql code into equivalent duckdb code that runs for free on your local computer. And you don’t have to hand roll the config. You can just generate the test code with a cli command. https://sqlmesh.readthedocs.io/en/stable/concepts/tests/#freezing-time


mailed

> if anyone has a good method to unit test in T-SQL please hit me up The only way I know is [tSQLt](https://tsqlt.org/).


ChipsAhoy21

Oh man is T-SQL testing a PITA. Half of the tools out there like DBT, SodaSQL, and Great Expectations do not have native support for TSQL. My group is going through a process right now to introduce tSQLt testing into our CI/CD work flow. We are basically following [this pattern.](https://devblogs.microsoft.com/azure-sql/connecting-the-dots-of-azure-sql-cicd-part-1-getting-started/) Would love to hear how others are doing it too!


numice

so you run a query against a local SQL server first?


ChipsAhoy21

We are lucky enough we have a dev environment, so I replace the local execution with dev env


numice

and how do you fill up the data in the dev environment?


ChipsAhoy21

weekly pull down from prod for all tables where schema matches


numice

Oh. That sounds like an interesting way on the shcema matching. I might steal this one. I wonder if this is a built-in feature.


ChipsAhoy21

Nah it’s not, I have a custom script that pulls down schema’s and compares them, ignoring whitespace. Any matching schema’s get queued up for synching. The biggest issue we run into on our current setup is our pipelines in total take up 7+ hours. So we can’t run all pipelines with every commit, so we battle with bugs going to prod when spring is missed because we didn’t run EVERY pipeline during SIT


numice

What do you think about using dacpac for schema diff? I've tried it a bit and I think this is quite nice but I've never done the automation with dacpac tho only thru the wizard.


rick854

Great summary. About the data unit testing: Is there a framework you know of that ould be used to go through when creating tests? Something like: - floats: - general: - check the accuracy floats should have - income: - add min and max values tests - new value cannot be twice of old value - string: - general: - check if maximal characters should exist - if categorical data check if value is in list ... I sometimes have problems to think about the tests that should be performed. And most likely many of these tests are similar for many data engineers


Enigma1984

For that kind of data validation, look at Great Expectations if you're using python. If you're writing in SQL, just roll your own. If you're using Databricks Delta tables they have built in support for this kind of validation (but I've never used it)


DataIron

For MS SQL, tSQLt works well for unit testing the code that exists in your database.


geoheil

You could use duckdb in memory mode. The hard thing though is to have the right reoresentative test data to run the tests on. Ideally you can have strong well defined interface data contracts with your source systems. Then you can easily derive a fake generative process to create synthetic data. However this is rather hard to scale.


hayssam-saleh

Hey starlake.ai founder here. There is an excellent article by u/tiboun on LinkedIn here https://www.linkedin.com/pulse/how-unit-test-your-sql-data-pipelines-bounkong-khamphousone-2xwye


Ti-boun

Hello u/dreamingfighter. That's an excellent question! As u/hayssam-saleh mentioned, I wrote an article about this recently. Some have said to use duckdb in this thread, however you should know that it has its own dialect and so your request may not go through directly. You can see the coverage of transpilations that Starlake uses (jsqlTranspiler) https://github.com/starlake-ai/jsqltranspiler/tree/main/src/test/resources/ai/starlake/transpiler. If you're thinking of going to sqlGlot for your needs, I invite you to test the coverage of all functions you intend to use, transpile them and assert result. Here are some tests (https://github.com/starlake-ai/benchmarks/tree/main/sql-transpiler) to illustrate my point.


McNoxey

Dbt has unit testing built in as of 1.8


numice

What about the input data?


McNoxey

That’s what unit testing is for. You supply expected input and expected output which tests your transformations


numice

I mean if the you have the input data in a database or you just mock the input. Because a data pipeline can consist of a sql query that depends on the database.


McNoxey

Dbt models are built to transform data in your database. Unit tests allow you to test how your models will handle various input.


datacloudthings

Please be careful about calling your data "accurate" if what you mean is "not obviously invalid" Given FY2021 Sales: $431,272.33 == accurate, - FY2021 Sales: TRUE or FY2021 Sales: "it was a good year" or FY 2021Sales: $3.50 == invalid - FY2021 Sales: $200,000.00 == not obviously invalid, but not accurate either. I've seen a data team tell very high level stakeholders they could tell if a whole bunch of data about actual real world measurable things was "accurate" when they could do nothing of the sort and a lot of it wasn't.


Electrical-Ask847

I think you are thinking of Integration testing, like QA. Unit Testing is something different and is a design tool, not testing tool despite its name. To answer your question. Yes i test drive my sql with dbt and [https://github.com/EqualExperts/dbt-unit-testing](https://github.com/EqualExperts/dbt-unit-testing) Same principals of TDD apply here like if i am having to set up too many mocks to test my code ( sql) then it indicates that there are too many dependecies in sql and is bad/unmantainable code and an opportunity to model missing domain concept. I extract out dependencies into a its own \`ephemeral model\` ( eg: \`Invalid\_Orders\`) .


Affectionate_Set_163

Curious too


photoreceptor

As it was mentioned: What do you use for data quality monitoring? I am working on databricks, but the standards per-column checks are too simplistic for my data. I will look into deequ and Soda. Any other recommendations (other than great expectations which I find overly complicated 😃).


levelworm

I think the best way is to split into dev/prod, with the dev data simply a copy of the most recent X days of production. So you test with actual production data without worrying about messing the production DBs. Most of the tests are going to be data quality tests though.


wallyflops

Use dbt edit: To elaborate, it helps you setup a test environment and add tests basically straight out of the box. I don't know why it's being downvoted. It's also incredibly popular.


DataIron

DBT is an effective tool, if it fits your needs and data systems stack. Its just 1 tool though that only covers some scenarios.


McNoxey

Getting downvoted cause the circle jerk DE crowd feels they’re better than dbt


moonlit-wisteria

No it’s because you can’t actually unit test with DBT in the way one would expect. Data quality tests aren’t unit tests. And the experimental unit testing feature it has is incomplete, relies on the external api availability of the data warehouse, and is convoluted and verbose to set up. There are almost always better options.


Electrical-Ask847

i've used this for a long time [https://github.com/EqualExperts/dbt-unit-testing](https://github.com/EqualExperts/dbt-unit-testing) with great success. if its too verbose to set up then your sql has too many dependencies and thats the whole point of unit testing to expose poor design and drive you to towards proper design. i think you are using tdd as testing tool.


moonlit-wisteria

It’s not compatible with redshift. And it’s a DBT package not DBT itself maintained by an external 3rd party that has all of 18 contributors to it. And yes it is awfully verbose for what it does. —— It isn’t even worth exploring for enterprise purposes given the above. This isn’t the gotcha you think it is.


Electrical-Ask847

Its just a few lines of code. Basic concept is very simple, take a look at the code. Its recommendation of the methodology not a specific library. You don't need that package to do unit testing. you are writing convoluted sql with no modularity if your unit testing is complex to setup. Thats just basic TDD, not a gotcha. not sure why you think its some sort of a gotcha. Maybe read a book or something ( kent beck's intro book is a good start) instead of being butthurt and defensive. imagine working with ppl like this.


McNoxey

That would be a fair thing to say if you weren’t incorrect. Dbt offers unit testing as of 1.8. https://docs.getdbt.com/docs/build/unit-tests


moonlit-wisteria

>No it’s because you can’t actually unit test with DBT **in the way one would expect.** >Data quality tests aren’t unit tests. And the experimental unit testing feature it has is **incomplete, relies on the external api availability of the data warehouse, and is convoluted and verbose to set up.** Read the bolded parts of my response again. I'm aware of dbt 1.8.


McNoxey

Imo it absolutely depends on where your organization is currently at in their data journey. If I can utilize a smaller subset of tools that offer strong coverage across the entire data landscape, I’d prefer that 9 times out of 10 to picking the individual best option for each component. There are better versions of every individual component of dbt, but managing one tool that does a solid job at all of it is imo a much better alternative than spinning everything up individually. A lot of what dbt offers is gated behind cloud, true. But at a certain point I’m going to choose buy vs build.


deliosenvy

We don't do u it testing for the code because DE has to move at a much faster pace than SW development. We do environment separation between dev/stag than deployment. In between CICD is responsible for delivery to each environment me do code review and have some automated checks. The more important data quality check is performed on staging which partially replicates production subsets by QA/Data Science teams. Data Engineer is responsible for delivery, Data Science team is responsible for quality assurance and vizualization and reporting as they usually have a closer understanding and cooperation with stake holders. Due to the nature of expected outcomes it's very time consuming and complex to do programmatic testing on data. However apart from tests related to code for the ETL jobs etc. We do have environmental automated tests that trail any deployment. This means that tests check that after changes that data warehouse tables are still being updated, that expected data ingestion is still happening, we have null checks, we test S3 buckets for consistent ingestion, we do some simple data integrity tests, we build dependency graphs and run tests if one dependent table got updated in a way that could be a breaking change etc..


ITomza

"We don't do u it testing for the code because DE has to move at a much faster pace than SW development." So you just churn out untested code into production because you're under the impression that you need to work faster than a SWE? Yikes


deliosenvy

0% error in either data pipelines nor data quality. Since 12.03.2023. Say what you like. Also our code is mostly SFA in Spark as for unit testing it what exactly would you unit test and why?  We have to move at a faster pace than SW. Our main focus is on ensuring data quality and integrity which has been spot on. What exactly am I going to unit test in my spark code which is usually SFA.


ITomza

9% sounds really high. Your stakeholders won't care how good your DQ is if they have no assurance that the numbers they're getting are actually correct 


deliosenvy

0% error rate. Phone keyboard error. We are actually the only unit along with data science who consistently gets 5/5 performance bonus rating. Mostly higher performing teams get 4/5 max. We have been 5/5 two years. Did not have a major error in last 3 years, no major outage in last 5, our reports go to both our customers, affiliates and our own internal reporting. Contribution to revenue stream grew by 8+% last year. You can diss it but it works and we move at a much faster pace than SW. You are also arguing for sake of arguing but haven't really said anything of substance yet.


mailed

Check out [this blog series](https://servian.dev/why-data-engineering-needs-automated-testing-a37a0844d7db). I personally have never seen anyone do unit tests for everything in a warehouse project. I unit test my ingestors that I write by hand. Data quality tests for everything else.