The concept of a Data Product, not least due to the interest in the Data Mesh framework, has become a mantra in modern data management. Data professionals indicate it, with different meanings and in different contexts, as one of the pillars of a data-driven organization.
Yet how does it differ in practice from a simple dataset? What are its distinctive characteristics and practical applications? How exactly does one create and use a data product?
The generic concept of a product
Let us start with the metaphor the definition suggests. A data product is a product that concerns a share of knowledge based on data (reports, databases, results of an ML/AI model, etc.).
Generally speaking, a product is a good or a service that satisfies the consumers’ needs. It meets a number of characteristics and attributes.
- Satisfaction of a need: in a marketing-oriented vision, the product is not only a useful good but also a means of communication between those who produce it and those who consume it. A necessary but not sufficient prerequisite for the success of an enterprise is that the offer must constantly be in tune with the consumers’ needs. The ability to intercept, if not anticipate, potential consumers’ needs or interests and satisfy them by offering the appropriate products is a critical factor for product/service dissemination.
- Usefulness and usability: the product must be consumable by potential customers. Usually, the product is presented in a catalog that describes its main characteristics, potential uses, and distinctive features. It is also often accompanied by instructions that indicate its main features, use modes, any prerequisites for its use (for example, tools necessary to assemble a piece of furniture), and warnings.
- Guarantee: the vendor’s brand, certification marks (Bio, ESG, ISO, etc.), and cover against possible defects or malfunctions for a certain period of time. Sometimes even the description of the raw materials and the process of making the product fosters a perception of quality.
All these characteristics and others related to the general concept of a product apply to data products taking into account their specifics.
Since the object of a data product is data-based information, the characteristics above, in the specific area of data products, consist of metadata: ownership (the vendor), quality (the warranty), potential intended use (the mode of use), data lineage (the origin), personal data security and protection constraints (warnings) and so on.
What is a Data Product and what it is for?
In light of the above, to get into the substance of specific characteristics of a data product, we should refer to two of the four founding principles of the Data Mesh framework, even though the concept of data product also applies in other contexts.
A Data Product is a collection of data accompanied by the code necessary for its consumption and metadata describing its characteristics.
- Domain Ownership Principle – In the Data Mesh, data is organized in Domains. A Domain contains data homogeneous with respect to determined criteria (origin, aggregation, consumption). It is under the ownership of an interdisciplinary (business and ICT) team of actors from the sphere of jurisdiction of this data. The concept of Data Domain is inspired by the DDD (Data Domain Design) model for software development. Organizing by Data Domains aims to attribute the data management to those who have the best knowledge about it, thus reducing the data entropy that appears when the data is concentrated in an environment managed by those who have no detailed knowledge of individual areas, unlike those who directly work in the field. However, by applying just the Domain Ownership principle, we risk achieving no other effect but breaking down the data in silos. And unless counterbalanced, this situation would create issues with integration and overall consistency.
- Data as a Product Principle – This is where the second principle comes up. It indicates how individual domains interact to ensure smooth and efficient management of business processes. As already mentioned, the Data Mesh framework is oriented to managing analytical data resulting from operations of aggregation, integration, and quality checks. However, it is worth specifying that this data may have many different purposes. It can be used for analysis and research (also by applying artificial intelligence and machine learning techniques and technologies), internal or external reporting (for example, for complying with regulatory reporting requirements), as well as support operational processes downstream of those of origin. A Data Product is a collection of data along with the code necessary for its consumption and metadata that describes its characteristics (content, precision and accuracy, freshness, sources, use modes, ownership, etc.). A Data Product is created within a Data Domain from data originating from inner operational processes and/or other Data Products and intended for consumption by this or other Data Domains. A set of Data Products is registered in a central catalog and serves as a communication network that ensures adequate integration and interoperability between the Domains. A Data Product must meet a minimum set of criteria to be considered as such.
Let us briefly consider these criteria.
- Easily searchable: It must be easy to identify a data product. This can only be attained by using metadata (such as meanings, contexts, properties, origin source, data path, etc.). The centralized search service allows data consumers to find a dataset of interest with ease. Therefore, each data product comes along with metadata that facilitates its discovery.
- Usable: Once identified, a data product should have a delivery method according to a global convention that helps users access it systematically. Aiming at ease of use, each data product must include a delivery method to ensure its complete usability in line with the corporate standards for accessibility and compliance rules.
- Accountable and accurate: Since nobody would use data they don’t trust to make decisions or for their processing, data product owners must ensure their artifact is accompanied with metadata on source reliability and to what extent it reflects verified events or high probability of reliability of the knowledge produced in processing and transformations. To deliver an acceptable level of quality, one should apply such techniques as data cleansing and automated verification of data integrity at the moment of data product creation. Associating with each data product its provenance and data path increases the consumers’ confidence in it and its suitability for the intended use.
- Understandable: Using quality products requires no help on behalf of the manufacturer. They can be discovered, understood, and consumed autonomously. Creating datasets as products with minimal impact to be used by data engineers and data scientists requires well-described data semantics and syntax. Ideally, sample datasets should accompany them as examples.
- Interoperable and compliant with global standards: There must be interoperability between the domains using a data product. This feature can only be achieved through centralized standardization methods that allow producing knowledge while respecting common classification and contextualization rules.
- Secure and regulated by global access control: Secure access to data products is a must, whether the architecture is centralized or not. For data products, access control must be applied at the highest level of granularity, i.e., for each product. Similarly to operational domains, it is possible to define access control policies centrally but apply them at the moment of access to each data product.
- Intrinsic value: Like any product, data products must also contain business metadata that facilitates the perception of its enterprise value (intended use, business processes where it is used, etc.) regardless of the model used for value estimation (Mark-up cost, sales/exchange value, etc.).
Data product application contexts
An artifact with these characteristics finds its application in many contexts.
We have already seen the fundamental role of enabling communication between Domains in a Data Mesh framework. This architectural and social data management paradigm, in addition to the two principles above, also has those of Self-serve data platform and Federated computational governance. Further details on this subject are available here.
More generally, organizing data assets into data products makes it easy to extract their value for very different people within a company, including those not directly involved in data management. A Data Product Catalog lists and describes these objects. Data products become part of the catalog only after the certification and qualification process verifies they meet the above criteria. This meaning of a data product does not make it necessary to apply the other principles of the Data Mesh framework. However, it meets the need for data asset valuation.
A further step from the Data Product Catalog is the establishment of a Data Marketplace. It is an environment where users in different roles can not only view the characteristics of existing data products but activate their use (extemporaneous or periodic for updated versions) or request the creation of new data products via dedicated processes.
Data product, data governance, and data sharing
It is easy to see that the processes that manage data product creation, maintenance, and consumption must rely on a governance system that would ensure their efficiency, viability, and sustainability. Without going into the detail of these aspects, we believe it is essential to highlight the value of a model organized by data products. It fosters the role of the Chief Data Officer and enables the data sharing system that is now a priority for the market.
The importance of data sharing is evident from the 2022 Gartner Chief Data Officer survey. 86% of participants claimed data sharing to be essential to the success of their organization.
Therefore, for the potential consumers, a data product is an object that, to be freely exchanged, must contain all the metadata collected from the governance system that make it available in forms and modes in line with the users’ needs.
Despite the many advantages of data sharing, this principle may encounter significant resistance from both organizations and individuals. The metadata must help dispel the bias related to data quality, potential improper use, privacy, security, and ownership.
Why is it necessary to have adequate tools supporting data product management?
Often the lack of adequate tool support for efficient management of descriptive and implementational metadata becomes an obstacle to a data asset governance model that can use the data potential and put it at the service of the business needs.
It is a certainly complicated path. It requires vision, determination, correct choices. But it is also a path that cannot be ignored if the analysts’ predictions are proven correct:
- By 2022, more than half of data & analytics services are automated and no longer require human intervention.
- By 2023, 30% of businesses get a higher ROI of resources engaged in data & analytics governance while governing the minimum share of all their data assets relevant for achieving their strategic goals.
Irion EDM® is a fully metadata-driven Enterprise Data Management system. This flexible tool enables and supports the evolution over time of a Data Governance system. It allows efficient implementation of different styles of governance.
Metadata plays a crucial role in Data Management solutions implemented with Irion EDM®:
- it describes the technical and business characteristics of the data and its relationships with other entities (IT assets, business units, processes, rules, and others) and thus enables the fundamental characteristics of a data product (searchability, usability, compressibility, accountability, interoperability, etc.);
- it is flexible, i.e., it can adapt to represent the entities, events and phenomena relevant for a particular company, thus making data products useful in business decisions;
- it is dynamic, i.e., it can vary over time adapting to changes in business and application scopes;
- sometimes it is actuating, i.e., it drives the platform’s Data Management engines (connection to sources, Data Integration, control rule application, data enrichment, data classification, analytics, orchestration, etc.). Therefore, it ensures data product delivery according to the consumer’s needs and architectural choices;
- it is integrable, i.e., business, technical, actuating metadata can be related to each other and represented and explored in one model. Such a model supports the work of all roles engaged in business data management, such as data engineers, business analysts, data owners, data scientists.
About the authors
You may also be interested in: