From Data Lake to Data Mesh
Reading Time: 15 Minutes
We collect, cleanse, enrich, and index retail data using our patented data collection platform.
ClearDemand’s state-of-the-art ML system creates a retail market map, provides rich insights, and performs autonomous actions. These are exposed using our well-architected SaaS platforms (embracing 6 pillars of architecture and clean code architecture).
We host a wide array of key datasets in the retail vertical:
- Analytics/actions for global pricing and promotion
- Global product catalog and assortment
- Global brand catalog
- Global category catalog
- Inventory and availability
These retail domain datasets are rich in business value and are truly big data in nature.
What is a data lake?
A data lake is a centralized, domain-agnostic data persistence architecture that allows you to store structured and unstructured data at scale. It separates storage, and computes to scale for huge volume, and accommodates varied load and access patterns – all at a reduced cost.
What is data mesh?
Data mesh is an industry-leading approach to data management. It defines a clear domain-based design paradigm to group and manage datasets ownership. Data mesh treats datasets like product – all powered by a self-serve data platform and governed by a federated governance mechanism to effectively scale the data operations of an analytics organization.
Data Lake vs. Data Mesh – Key Differences
- Architecture Paradigm
- Data Lake: Centralized – all data is stored in a single, large repository
- Data Mesh: Decentralized – data is owned and managed by domain-specific teams
 
- Data Ownership
- Data Lake: Typically centralized – managed by a core data engineering team
- Data Mesh: Typically domain-oriented – each business unit owns its data as a product
 
- Scalability
- Data Lake: Scales by adding more storage/computing power to a centralized system
- Data Mesh: Scales by distributing responsibilities across autonomous domain teams
 
- Governance Approach
- Data Lake: Centralized governance with unified policies for data quality, access, compliance
- Data Mesh: Federated governance with standardized policies, but enforcement and execution are domain-driven
 
- Flexibility to Evolve
- Data Lake: Slower to adapt to change
- Data Mesh: Faster iteration and responsiveness; new data products can be developed independently
 
Our Data Journey: The Beginning
We started by hosting datasets in a data lake, which provided immediate benefits:
» Flexibility – hosting structured, unstructured, and/or semi-structured datasets in a centralized lake.
» Viability – separating storage and compute to accommodate different usage and load patterns across the organization.
» Availability – executing as a fast-paced startup with incredible cost benefits compared to the previous generation enterprise data warehouse architecture, solutions, and tools.
We started with a highly decentralized execution model that helped us move fast and rollout tons of advanced capabilities in a short period of time.
But it came with a few problems: data duplication, source proliferation, data quality and integrity divergence between related sources and a bunch of domain agnostic data ownerships. We quickly identified these issues and consciously created focus groups that followed a loose domain ownership model. The split of these focus groups were based on the “data pipeline architecture/organization model”.
As highlighted in the data mesh paper, the above pipeline architecture/org structure might appear to be an effective ownership model initially. However, in practice, all the focus groups must work to launch even very small, new functionality. This created a siloed hyper-specialized data platform team with very little understanding of the source domains that generate the data. They lack the domain expertise of the analytics consumption teams that they cater analytics to. This limited our ability to achieve our ideal speed and scale.
Data lakes are no longer the centerpiece of the overall architecture of a matured analytics ecosystem. The data lake architecture fails to gracefully accommodate changes in the data landscape and leads to proliferation of sources of datasets within the organization and impedes the speed of response to change.
The Present & The Future
To provide a truly decentralized architecture that avoids the above mentioned issues, we came to a conclusion that data mesh is the right data architectural and organizational pattern.
Data mesh fits our company’s needs in the short and long term.
“We’re happy to have business and tech alignment in our core operating model. We have a data-oriented strategy where we are convinced beyond doubt that quality data, ML, and advanced analytics form our strategic differentiator in the market, explains the company’s CTO, Venkat PK.
He continues that the company’s executives are “spearheading data maturity models within the organization and have a long-term commitment to invest in advanced architectural/organization transformations like data mesh in the right form and shape.”
4 Pillars of Data Mesh
1. Domain – Domain oriented data decomposition and ownership
The entire data ecosystem is grouped and tagged to source-oriented domain data, consumer-oriented domain data, or shared domain data. In the process, we have domain-based data ownerships. There are clear rules on who should own any new dataset requirements in the organization. This process stops any inefficient data set proliferation in the organization.
2. Data as a Product – Data and product thinking convergence
We treat data like a product – with structure, quality, and accountability. Each dataset belongs to a specific domain, following clear rules for ownership and access. We define both physical and logical structure of the data with disciple. We track how data flows and changes (lineage), and clearly document any transformations.
Every data product includes quality checks, alert thresholds, and rules to ensure accuracy and trust. We monitor data for freshness, unexpected changes, and overall health. We also define retention policies, enable observability, and support dev ops best practices.
With this, the key focus shifts to the data within a domain.
The pipelines become the data product’s internal implementation.
3. Data Platform – Data and self-serve platform design convergence
At a physical layer, data mesh’s self-serve data platform provides access to scalable polyglot data storage, data products schema, data pipeline declaration and orchestration, data products lineage, compute and data locality, etc.
At a logical level, there are proposals in the data mesh paper to have a multi-plane architecture that includes layers like data infrastructure provisioning plane, data product developer experience plane and data mesh supervision plane to name a few.
We use our existing cloud service capabilities to drive the platform aspect of this transformation. Based on what we’ve learned so far, we’re planning to invest in purpose-built platform capabilities in the coming months to ensure a smooth, scalable transition.
4. Federated computational governance – Make decentralization work efficiently
Data mesh completely decentralizes the governance aspect of the data as a product. It relies on federated custodian of data governance by domain owners. The domain owners define how to model data quality, data security/monitoring, model polysemes, reliability, and operational excellence of data as a product.
Despite such localized decision making and autonomy, they need to comply with the standard defined by the global federated governance team and automated by the platform.
Our experienced team has created domain-based ownership and key point of contacts in each of these domains to put together a global federated governance.
We’ll keep maturing and transforming these pillars.
The Data Mesh Pitfall to Address
Data mesh tries to address most of the pitfalls associated with decentralized architectures via the power of a matured data platform. But, building and embracing such platform capabilities can take some time. The challenge in decentralizing specialized roles (data engineers, data scientists, etc.) based on domains in an organization limits communication and coordination in specialized job families.
It reduces opportunities for collaborative learning and structuring a proper growth path for these specialized roles. This could eventually lead to poor data standards and reduce the pace of execution of data related problems without organizational maturity. We know the key issues with data mesh when it’s not backed by a full-fledged data platform and are working on an operating model to address this.
References:
The Latest Insights – Straight to Your Inbox
Sign up for the ClearDemand mailing list for actionable strategies, upcoming events, industry trends, and company news.
 
																														 
																														 
																														 
																														 
																														 
																														 
																														 
																														