For enterprises with large centralized IT organizations, distributing the data platform into domains will be a long process or even unthinkable. For these organizations, building data products without the “Mesh” is more practical and doable.
Company Mentioned
In 2019, distributed data mesh has been offered as a route for companies to leverage data at scale. The next-generation enterprise data platform architecture was suggested to be the convergence of distributed domain-driven architecture, self-serve platform design, and product thinking with data. One of the central ideas of this paradigm is the treatment of “domain data” as a product and the introduction of the data mesh architecture.
However, for enterprises with large centralized IT organizations, distributing the data platform into domains will be a long process or even unthinkable. For these organizations, building data products without the “Mesh” is more practical and doable.
In the next paragraphs, the concept of a data product is borrowed and adapted from Zhamak Dehgani while leveraging existing data lake and data warehouse architectures.
Central to building data products is the identification of its life cycle. Taking SDLC as a reference, the data product lifecycle has similar stages including requirements gathering and feasibility study, data pipeline development, testing, deployment, and continuous improvement.
The data product development process can be further elaborated to support products: exposed raw or transformed data, operationalized insights, exposed analytics model, and automated decision.
To take out the mystery in the composition of a data product — its anatomy is illustrated below which is composed of input data (can be multiple formats), the code or data pipeline to deliver the output data (can also be of multiple formats), and the environment where these three components reside.
A data product can have different data inputs (some are flat files, some are APIs, some are direct database connections) which are transformed into the needed outputs through business rules or a computational algorithm. In mature use cases, outputs are delivered through different “output ports” to maximize their value. Data product output or “output ports” can be through an API, file-based extracts, automated triggers or decisions, or visualizations.
This anatomy in a way also surfaces the required skills of data product developers. The developer team should have members with expertise in governance, infrastructure, data ingestion, data transformation, machine learning, and end-to-end data pipeline. An effective data products team should combine these skills around an infrastructure, whether on cloud or on-premise. While there is much hype about bringing everything on the cloud, data product teams should be flexible in their development process, deployment, and continuous improvement.
In addition, data product owners who hail from the business side of the enterprise should keep this anatomy in mind. Unlike traditional software development teams where feature requests can be broken down into very agile execution, this may not be straightforward in data product development.
In place of a data mesh infrastructure, the data lake and data warehouse can be organized according to the domains of data being stored. This will enable easier domain-centric data products. Cross-domain data products will then be a matter of policy and governance.
Without redistributing existing centralized data lakes and data warehouses, domain-centric data products can be delivered with the carefully designed organization of data. Hive can be used for data virtualization to organize data per domain.
With data becoming very ubiquitous in the enterprise, proper definition of a data product, its lifecycle, and development process should now be part of the enterprise process. In doing this, the real value of data through the value of data products can be measured, governance can be put in place (while supporting individual experimentation) and opportunities for monetization can also be discovered.