The Product side of Modern Data Stacks

Olalekan Elesin
6 min readMar 18, 2022

Post covers how to measure the success of internal data platforms as products with strong emphasis on business outcomes. Written for IT leaders, specifically CDOs, VPs or Directors of Data, also applicable to technical resources challenged with creating, supporting or using production-grade internal data platforms as products.

Background

More often than not, I see data platform teams focus on the metrics like:

  1. “total size of our data lake in terabytes or petabytes”
  2. “X number of queries per timeframe (daily/weekly.monthly)”
  3. “streaming Y billion events per timeframe (daily/weekly.monthly)”.
  4. “we have Z data domains (data lakes) in our data mesh”

In fairness, these are good starting points but not where to stop. If truly internal data platforms should be considered as products, then what is measured should be indicative of the outcomes of the platform: raising the game with data-informed decision making (including AI/ML), and not the number of trending data open source frameworks we can put together on our architecture canvas.

In the remainder of this post, we will look at 3 important metrics to track the impact of your internal data platform in your organization.

3 DATA PLATFORM IMPACT METRICS

1. Data Usage Index

Growing a data lake from 0 to petabytes of data is not any different from growing page views without user engagement on any user generated content (UCG) platform. One could easily argue data lake is analogous to UCG platform, with the content being data which can be generated by humans or machines. As such, the data lake becomes more valuable with more data ingested. However, the data becomes valuable when it is used not when it is stored. This premise is what many data platform teams, and platform leads miss out on (Recruiters also do not make this better).

The mission of data platform teams is not to amass large amounts of data. The mission is to enable that organizations can leverage one of their most strategic assets, data, for decision making at petabyte-scale. Growing the data lake size to petabytes is one output of this, enabling data-informed at petabyte-scale is the outcome. Hence, focus on the outcome.

In practice, data platform teams can measure this outcome by dividing amount of data scanned by the size of the data lake.

data_usage_index = amount_of_data_scanned_gb / data_lake_size_gb
Figure 1: Measuring Data Usage

This will give you an idea if your organization is making data-informed decisions at petabyte-scale or not. Example: if your data lake size is 50TB and amount of data scanned is 10GB, then your data usage index is 0.0002. This clearly indicates your data platform is doing well on the supply side i.e. data producers, but little on demand side, data consumers. It may also mean that your data platform only caters for reporting use cases and advanced analytics workloads are not being built with your ‘modern data stack’.

The goal is to drive this metric to 1 or slightly greater than 1. As data platform leaders, one of your key objectives is to increase data-drivenness through the adoption of the technology capabilities in your data platform; not just the shiny architecture called ‘The Modern Data Stack’.

PS: Depending on the structure of data lake (or data swamp/ocean), you slice this metric by teams, business units, or cost centers.

2. Query execution time by data size scanned

Monitoring query execution time is important to understand the performance of the query engine(s) in your ‘Modern Data Stack’, available to your users. However, this metric only tells you how long queries run but not why. Leaving us in the dark on potential under-served needs that the query engines might not be covering. Therefore, rather than measuring X number queries or query execution time across percentiles, focus on query execution time per data size scanned.

This metric shows the efficiency of the query engines provisioned over your data lake. We see this being used in comparing the performance of data warehouse solutions, e.g. Databricks vs Snowflake.

Source: Databricks Website — Snowflake Claims Similar Price/Performance to Databricks, but Not So Fast!

More than a measure of query engine efficiency, it can serve as a proxy for the user experience and inform of non-functional improvements to your platform. Example: given 2 SQL queries scanning 5TB each, with query A execution time of 200ms and query B execution time of 5ms. It is clear that there is an underserved need to improve the data platform user experience. Depending on the query engine used, the solution might range from automated data format conversion, data compaction mechanisms, SQL trainings, new data lake query engines, etc.

3. Data Mesh Domains Collaboration Index

According to Starburst.io, Data mesh is defined as:

Data mesh is a new approach based on a modern, distributed architecture for analytical data management. It enables end users to easily access and query data where it lives without first transporting it to a data lake or data warehouse. The decentralized strategy of data mesh distributes data ownership to domain-specific teams that manage, own, and serve the data as a product.

Data mesh presents a new logical view of the technical architecture and organizational structure. However, if implemented wrongly, the decentralized nature may recreate to problems the centralized data lakes were meant to solve, data silos. Hence, it is important that data platform leaders (probably the office of the Chief Data Officer), assess their organizations’ maturity, actively evangelize, and drive collaboration across organizational domains rather than pushing architectural redesigns based on the latest “data tech trends & buzzwords”.

Measuring collaboration across data mesh domains is one of the strong indicators of organizational collaboration and data usage maturity. Assuming an organization has 5 data domains (A, B, C), data platform leaders should be looking at number of queries from domain A accessing data in domains B, and C. And repeated for all domains.

domain_A_collab_index = number_tables_from_domain(A,B,C)_query
Figure 2: Data Access Distribution by Domain to measure collaboration in Data Mesh Architecture

This effectively lets data platform leaders (CDO’s office) know whether or not the strategic decision to adopt data mesh is not only optimizing for time-to-analytics/insights. Technical implementation requires a bit of instrumentation but it is doable majority of the query engines, Amazon Athena query logs, Presto/TrinoDB query listener, Apache Spark Listener, etc.

Rather than measuring “Z data domains (data lakes) in our data mesh”, focus on measuring how data mesh users are cross collaborating.

Other Metrics

I refer to other metrics not because they are not important but they have been covered extensively in other texts:

  1. Measure your data platform team’s software delivery performance i.e., Accelerate Metrics.
  2. Measure your data platform team’s self-service capabilities.
  3. Measure data quality and pipeline SLAs.

Conclusion

Companies setting up or with internal data platform need to make sure that teams’ objectives are based on a clear product mindset which are measurable and indicative of strategic impact. If this is not the case, start today not as a feature team which is usually the case if the team only has engineers. Rather as empowered product teams.

If you enjoyed this article, please tap the claps 👏 button and share on social media.

Interested in learning more about Olalekan Elesin or want to work together? Reach out through LinkedIn.

Bonus: My thoughts on Data Mesh

As much as I see the data mesh approach gain acceptance, I have some concerns:

  1. Like the monolith vs micro-services argument, not all organizations MUST start out with data mesh. If you do not have the organization set up for a data mesh architecture, you don’t to have it yet
  2. Central data lakes were meant to remove the data silos. If your organization still struggles with siloed data problems, my view is that a data mesh might not be the solution.
  3. Let’s face it data management is hard, and it requires people and skill. If your organization has tons of technical debt in its core technology or constrained head count, my strong recommendation is to invest in what keeps your business operational and profitable.

--

--

Olalekan Elesin

Enterprise technologist with experience across technical leadership, architecture, cloud, machine learning, big-data and other cool stuff.