How to overcome the Curse of ETL Regression Testing

7 min readFeb 22, 2019

New Data Projects often start with high expectations. Why not consolidate all our data from countless source systems into one single DWH or data lake and achieve some valuable insights by using AI, BI, Self Services etc.? Many companies completely underestimate the specific complexity of Data Architecture. But often the real challenge begins after UAT or Go-Live. Patches and changes must be rolled-out with extreme care. The stability of the platform may not be put at risk. I have seen financial institutes spending multi million dollars for one full end2end regression test (one single release). It was always some kind of painful semi-manual shotgun test which involved the whole organization. Many data platform projects tend to cumulate scope to one or two big releases a year in order to avoid the overhead of small releases. It is pretty obvious that these kind of super-heavy tests are show-stoppers for an Agile Strategy.

tl;dr Data Regression Tests should be combined with small releases and follow a risk-based strategy. Carefully aim your tests to find typical issues you have seen in the past. Automate everything and focus on Metadata tests instead of slow raw data tests.

Anti-Pattern Golden Source Test

The basic idea of Golden Source Tests is to simulate a production go-live on a test system in order to find expected data changes and unexpected side effects. Typically a clone of the productive system incl. software and data is installed on a test system. After that a full ETL load is executed in order to define the baseline. Now the new release can be installed and another ETL load can be executed based on the updated software and parameters. Finally the output of the new release batch load can be compared to the baseline output. The comparison could be focused on BI reports, consumer systems or target data tables.

That sounds like a reasonably approach. Nevertheless the trouble begins when you find deviations. And you will find deviations! A deviation may have several root causes:

It may be an expected deviation caused by software changes
It may be an unexpected side effect caused by software bugs or wrong configuration
It might be non-deterministic effect of the ETL software which is basically OK, but which simply makes the test more complicated

Your most skilled team members will spend hours and days to analyze deviations. Often deviations occur in a very early stage of the processing and spread across several higher data layers. A simple change of a currency code may cause millions of side effects across the whole data model. Often simply nobody can really judge whether a deviation is OK or NOK.

Another challenge is the extreme logistical effort of cloning all involved systems and data within the end2end data pipelines in order to have a consistent state. Often configuration and timing issues block the whole test.

Long talk short sense: Such tests should be used in exceptional cases only!

Anti-Pattern Test Engine

I have seen these “test engines” at some large DWH projects. Some clients maintain a full SQL-based clone of the whole ETL-functionality in order to test the whole system end2end. You might imagine that maintaining a duplicate of the whole code is an extremely expensive way of testing. In some cases they also thought of implementing some kind of meta test cases for their test engine because they struggle with regressions in their test engine.

Just to clarify one thing: I value the idea of test-driven development. Nevertheless such an approach should focus on providing appropriate generic input data instead of cloning ETL jobs as super-complex SQLs.

Basic Rules of effective Regression Testing

Use small iterations
Use unit tests and continuous integration
Define a risk-based regression test strategy
Maintain a set of fully-automated lightweight regression test cases

1. Small iterations

Effective testing begins with your development lifecycle. The ability to roll-out and test many small iteration will strengthen your operations procedures and minimize the risk per change. The challenge is to optimize and automate the deployment process. Make sure you can package and deploy a single atomic ETL job including all parameter files and required DML in isolation. Automation is key to eliminate regressions caused by human error.

2. Use Unit Tests and Continuous Integration

Software and configuration bugs should be detected as early as possible. Well designed unit tests based on generic data can detect 90% of all bugs before the code hits any important test stage. I recommend the blog post of Slawomir Chodnicki which covers the topic in depth.

3. Define a risk-based Regression Test Strategy

Before you start implementing any kind of regression test case think about your test strategy. The idea of testing each and every feature of your system again and again will never work in a data-driven context. You will end up with the struggle of Golden Source Tests. So make sure you aim your tests on the real weak spots of your system. A good start would be to analyze the typical bugs you found in the past. Aim on the 80% typical issues.

The majority of bugs is often caused by design issues. That means your software is working fine and according to the requirments. Nevertheless from a functional point of view the result of the ETL processing does not make sense. That is a typical problem when working with complex data. The requirements are often based on weak assumptions which will simply fail when hitting some unique data constellations. Such a design issue might threaten you system stability so you need to take care off that.

Another problem are data migration procedures. Often the effects can only be judged when executed against real data. Therefor data migration prodcedures must be covered by the regression tests typically.

Finally I often see that data projects struggle with unstable test systems which is often caused by complex manual deployment and configuration processes. If the problem can not be fixed by adjusting the processes (it should) than these aspects must be covered via regression tests.

Important rule: Crash early. Aim your tests to find typical issues where they actually happen instead of hunting phantoms by checking deviations at consumer level.

4. Maintain a set of fully-automated lightweight Regression Tests

For our purpose you should use a custom or packaged test automation solution which can be configured with a versioned rule base. You might also want to schedule the test set via Build Server (jenkins, TeamCity etc.). We worked with UFT, Tosca BI and JUnit-based solutions. Basically choose a solution which suits best to the technical capabilities of your team.

Keep in mind that we are searching for regressions. Our tests do not check the correct implementation of the ETL software. Instead we hunt for typical symptoms of failures by applying simple rule patterns to a broad set of data objects.

What does that mean? Typically correct data loads have some static patterns which can be easily found within the meta data. Just imagine you load customer data on a daily basis. One simple rule could be: If customer load was successful the customer table must be filled with data. That sounds like nonsense but trust me, i have seen testers searching hours and days for the root cause of some problem and finally found out that there was just an emtpy table. That simple rule can be applied for every source object which must contain data after the load.

Another typical pattern would be to check for meta dependencies between data objects. If there is no forced integrity between the data objects you could check that. You could also easily check for record count relations. If a consumer receives all our customers you could check whether the number of records within our customer staging table is equal to the number of records within the outbound interface +/- 10%.

Unusual granularities and empty values are often indicators for errors. You could e.g. check whether all your customer fields are properly filled with values. Sometimes certain granularities indicate errors, e.g. 60% of your transactions must be currency dollar. These kind of tests can be additional parameterized rules within your growing regression test rule base.

All of these test patterns have some aspects in common:

They are simple and very quick during execution
They can be parameterized and quickly be automated
They boost your test coverage simply be raising the number of applied data objects
They do not touch the raw data (slow)

You might also want to save effort for debugging by using automated conditional tests. E.g. If record count rule fails apply a granularity check. If that also fails, apply a row-by-row compare to a historical load or check for parameter & DML changes.

After initial setup you will have a powerful smoke test you can run after each and every full load. Regression testing will be a no-brainer and efforts will shrink dramatically.

Finally

Lightweight regression tests will never find 100% of all possible bugs. Instead they aim to find the 80% common issues and the most critical ones. Combined with a DevOps or CI/CD delivery strategy the approach will minimize effort and risk for each iteration. Make sure to have a Test Automation Engineer who is able to set up the foundation framework. Also make sure that the whole team understands the aspects of risk-based testing.