Harnessing Data Observability + DataOps + FinOps for Healthy Data Adoption!

Data Governance
Article
Jul 26, 2024
|
Sanjay Agrawal

What do the following have in common?

  • A service in the cloud data warehouse in the company went rogue and suddenly created 4,000 tables unexpectedly.
  • A test automation glitch in another company caused a sudden spike in usage for one of the test Snowflake warehouses, leading to queries running unexpectedly for hours.
  • An external vendor that the company was relying on, made a breaking change to the data that resulted in its own critical customer-facing table to get only 10% of the expected data.

Data Operations – Every Day is a Brand New Challenge

On the surface, these appear to be disconnected issues - three different issues, three different companies. However, left undetected, each of these issues has the power to single-handedly disrupt the customer’s data environment, and in a very bad way.

The first one is an unexpected Metadata blow up where 4,000 tables could have easily multiplied by 10x by the time the team realized it was happening.

The second one is about a budget leak in tens of thousands of dollars that would derail the company’s data budget had this leak gone undetected or waited until the next bill to find out.

The third one is a classic data quality crisis that would have directly put the company’s own reputation at stake in front of its own customers.

Common thread

These scenarios represent daily challenges faced by data engineers. They build new data pipelines while dealing with unexpected issues across the stack. Often, they encounter problems in parts of the stack beyond their control, requiring reliance on external parties for smooth operations.

The Bigger Challenge

Data Engineers do not have the luxury of time to define all the data and resource checks that could capture the dynamic and complex nature of their data stack. And on top of that they neither have the budget to buy different solutions for such problems, nor the bandwidth to build different solutions for each problem.

To put into perspective, building such a solution in house for production is not a small task. It needs sustained long term commitment to the cause of making data operations super efficient that in turn requires deep expertise in building distributed systems, large scale AI models for the space, strong SQL knowledge and in-depth understanding of underlying cloud data warehouses architecture to know how to optimize and that skill set requirement is just the start. Then there is building the consumption layer and on call rotation to maintain such a system plus gathering requirements to continue to adapt to changing data and internal customer requirements.

In short they need a million dollar team!

Emerging Trend in Cloud Data Operations

As companies advance their cloud data warehouse adoption, we see a critical need for the following wrt data operations.

1. Merging FinOps and DataOps responsibilities

CxOs and data leaders are rightfully wary of their spiraling out of control cloud data warehouse spend. That’s not a surprise, as teams brought in the traditional capex mindset when they followed the company’s mandate to transition to cloud. Three to four years into their cloud adoption journey, and even sooner, companies are now discovering that the capex mindset and old data practices don’t work at all in the cloud.

When was the last time a business asked its data teams to remove a data pipeline because they were no longer using it? It's a rhetorical question. Of course the answer, without exception, is NEVER.

i.e., there is no exit strategy for data and compute. Consumption-based pricing requires a strong data and compute exit strategy to have healthy data adoption. Otherwise, it's a one-way road to increasing the spend.

The reality is that FinOps can not optimize spend in isolation; they need to deeply understand the usage of the data to make meaningful business decisions. Likewise, data teams need to understand the risks to budget when they onboard new data sets, and to identify trends to make the necessary trade-offs between the cost of having data delivered to business vs. the current dollar impact of that data.

The fact that every Cloud Data warehouse (CDW) has a unique pricing structure (for example, Snowflake compute spend depends not only on query latency but also on how these are distributed over time) and enterprises typically have more than one CDW, makes it very difficult for the teams to overall maximize ROI of their data investment across such an expansive data footprint.

2. Expanding Data Observability to Spend and Actions

Gartner called out the four critical features of data observability that extend all the way from monitoring and detection to resolution and better prevention.

This holistic, end-to-end approach is the right way of looking at the space from the vantage point of data teams.

Today’s state-of-the-art data observability offerings fall remarkably short. Previously, we shared how instead of taking on the really hard problem of surfacing the “unexpected” data quality issues for data engineering teams automatically, current vendors in this space primarily focus on the easiest piece of the pie: leading with manual data quality checks. This fundamental limited approach has not worked well in practice.

Indeed, it was not a surprise when, during our discussions with data practitioners, they questioned the ROI of such data observability systems repeatedly, asking: How is the system really helping us reduce our own effort if we have to define and manage everything?

There’s an imperative to automatically connect the dots between quality, spend, performance and usage

Spend and data quality: yin and yang of healthy CDW adoption. For one of our Google BigQuery customers, a large public company, businesses were seeing bad data. Why? Because the load failed due to resource usage being above capacity. Why? There was a query running at the time, consuming a large fraction of its company-wide resources. Getting rid of the offending query: Priceless.

Powering the 5-whys. If the focus was purely on alerting people of underlying data quality issues, the obvious conclusion would be to bring the problem to the CFO and ask for more capacity – aka, dollars. However, by empowering the data team to get to the 5 whys, the journey starts from data quality and gets down to query performance and resource utilization, and by connecting them with the right business team, the outcome was magical: the user removed the query that was the real root cause of the problem – saving them time and money.

Impact $0.5M worth of slot capacity.

This was all possible because the data engineering team got a holistic picture of the problem that spanned quality, usage and spend. Without the ability for the data teams to look at these together, companies waste dollars and time, overrun budgets and continue to suffer from organically growing bad practices.

We strongly believe that for healthy data adoption to exist, data observability out of the box augmented with recommendations and actions as outlined above as “end goal” by Gartner, seamless 5-whys across different pillars of quality, spend, usage and performance is key. Only then are data teams empowered to deliver the right data at the right time and at the right cost to businesses.

3. Leading with Automation

Previously, we had talked about my co-founder Shashank’s experience at Meta, where he led the data quality and observability effort with an automation-first (1.5M + tables, 10K+ data practitioners) approach. With automation, coverage for data quality went from single digits (~7%) to monitoring the entire CDW for data quality.

Data teams simply don’t have the time to define and manage checks as the underlying data ecosystem continues to evolve and change. The key empowerment to these teams comes from a maniacal focus on automation. They need a system that automatically selects what, how and when to monitor on their behalf.

Only through such automation can teams get coverage for the entire CDW – be it for quality, performance, spend or usage. That, in turn, provides data engineering teams with the confidence that the system has their back, which is key to adoption of such a technology.

Conclusion

Be respectful of data teams’ time. They simply don’t have time to define and chase every issue. Understand and address the real problems that these teams are facing, stop the out-of-control spending, and set up your entire business for healthy data adoption.

Healthy data adoption falls within the purview of data engineering, but it's not one problem. Instead of creating more silos and continuing to layer more observability above such silos:

Now is the time to cater to the data engineering persona and not to one isolated problem!

Book a demo to get started!

Article written by
Sanjay Agrawal
CEO, Co-founder of Revefi
Table of Contents
Transform your data observability experience with Revefi
Get started for free