DataOps - Accelerating changes for your next data initiative

Data initiatives tends to be long and painful affairs. Let's take a look at how applying DataOps coupled with Cloud technologies can help accelerate your data projects

Raihaan Raman

What is DataOps?

DataOps is a methodology that combines Agile ways of working, the breakdown of organisational silos typical to a DevOps transformation, and test strategies adapted from both the software engineering and lean manufacturing domains. The methodology was created to address some of the unique challenges encountered in large scale data endeavours.


Why do we need DataOps?

The data usage of large organisations has sky-rocketed in the last decade. These days, it is not uncommon to have data teams tackle a myriad of different types of data projects, from reporting, business intelligence systems, dashboards, analytics and machine learning.

Traditionally, each project would build a pipeline to transfer the required data from source to an intermediary data store. The data in the intermediary store would then be transformed, cleansed, and re-purposed for the exclusive use of the data product being developed. If you mapped a small portion of your data pipelines and products, you would probably end up with a diagram like the below:


Traditional data flow

Most organisations who scale up their data initiatives would end up facing the following problems:

  1. The way data is organised and labelled in the source system may change, and the impact is often only noticed in the final data product
  2. The time it takes to get data from source to a usable and valuable state is prohibitively long
  3. It’s near impossible to create sandbox environments that allow data pipeline and product developers to confidently experiment with production data in order to develop cutting-edge data products


How does DataOps solve the problems?

In order to be truly effective, DataOps warrants changes in how we perceive data and how we use it to build analytical products. Achieving this vision requires separating the concerns of creating the data used by the data products, and the creation of the data products themselves.


Separate the concerns

To better illustrate why this separation is beneficial, let’s use a simple analogy. Suppose that you are a jeweller that makes gold jewellery. You would want to know that the gold bars you buy have the purity level you ordered. To solve this problem, you would buy your gold bars from a trusted gold supplier, knowing that they have taken the appropriate measures to purify and test the gold they are selling to you.

The same applies to data. Where the data is piped from should be tested and governed, so that you have the trust you need to build data products that will potentially drive strategic business decisions. In the value creation process, you should aim for the following outcomes:

  • Ensure the quality and integrity of the data
  • Build repeatable, scalable, resilient and highly-available data pipelines
  • De-personalise and version changes to data sets.

There are tools and techniques that help achieve the outcomes listed above, e.g. using Statistical Process Control techniques to continuously monitor samples of data flowing through the pipelines and ensure they have the desired quality before they reach the aggregation points

In the product creation process, you should aim for the following outcomes:

  • Obtain regular feedback from the product’s users during the build of the data product itself
  • Constantly collaborate with the data engineers from the value creation process to ensure the data you are using is adequate

  • Treat your data product as you would any other piece of software. Use version control, and describe as much in code as possible (e.g. infrastructure as code and configuration as code)


Add in a good dose of monitoring and devops

As depicted in the above diagram, the data flowing through the pipelines itself is continuously monitored. Changes made to the data in source systems that don’t meet your data quality rules can be caught before they make it into a report.


Using the Cloud to kickstart your DataOps journey

Serverless data tools, which are fully-managed products, allow data engineering teams to rapidly iterate on new or changed data pipelines. Cloud providers manage the availability, scalability and resilience of these tools. Additionally, the setup and configuration of the tools can be described as code, which makes them repeatable. This in turn increases the rate of innovation and decreases the time to create large volumes of trustworthy data.

Having the ability to store large amounts of data in a secure, robust and scalable manner is greatly simplified with cloud storage products. Cloud storage provides seamless access to your data via standardised interfaces and in some cases, even allow you to query your data sets without requiring them to be loaded in a specialised data store such as a relational database. Having cheap and plentiful storage allow teams building data products the freedom to experiment with a private copy of their data, and get early feedback from users of data products


Conclusion

DataOps is a methodology that lies at the intersection of Agile, DevOps and data disciplines. It allows your data team to stay lean and keep up with the ever-changing requirements of running a data-driven business, without sacrificing the quality of your data products. Implementing DataOps in your organisation requires a serious shift in mindset when approaching data problems, but it’s one that will ultimately give your business a key competitive advantage.


Note: A shout out and thanks to former Cevon James Strain for his contribution to this piece