The potential of AI has been acknowledged for years, along with the understanding that data is the key driver of this potential. This realization sparked a gold rush to amass data, a pursuit driven more by the allure of future possibilities than by practical use cases. Consequently, this rush resulted in piles of rubble rather than gold. Now, as AI capabilities become increasingly tangible, many organizations look to leverage their so-called data lakes but are faced with costly data puddles, scattered across departments, low in quality, and high in complexity, which are difficult to govern and ineffective to process.
This article delves into common mistakes in data science strategies, exploring the resultant challenges and proposing basic solution patterns as both preventative and remedial measures. These solution patterns map directly to the concept of Data as a Product (DaaP) and Data Mesh.
Echoes of the Past
When I first started my career, my first mentor shared a perspective with me that has stuck with me ever since: the IT world is essentially a cycle of revisiting old ideas. He pointed out that he was rarely taken aback by “new” trends because, in his view, they were often just repurposed old concepts, now wrapped up in the latest jargon for new use cases. When he told me, I was not sure I understood, but every new trend, practice, and technology I learn about seems to prove him right.
In software design, we often talk about “design patterns” as solutions that can be applied repeatedly to solve common problems. Interestingly, the concept did not emerge within the realm of software development but rather in the field of conventional architecture, credited to Christopher Alexander, who discussed the role of patterns in facilitating the design of cities and buildings (see here). It turns out that this approach of identifying and naming recurring solutions isn’t unique to IT or city architecture; it’s something you find in almost every professional field. When a particular solution proves effective in various scenarios and stands the test of time, it’s natural for it to be recognized, named, and reused.
There’s also a tendency for people to want to be seen as innovators. It’s a way to stand out and build both a career and a personal brand. I think that’s totally fine. In fact, it’s necessary. It drives people to explore new ways of doing things, even if they’re just applying well-established patterns in new areas. But this desire for innovation also leads to a lot of new buzzwords and trends fueled by hype. Everyone gets excited about the latest “breakthrough” that promises to fix everything, even though it might just be a new spin on a classic concept.
This ties back to what I mentioned earlier regarding my mentor’s assertion that most “new” trends and concepts in IT are essentially repurposed old patterns. It reminds me of the saying “Insanity is doing the same thing over and over again and expecting different results.” Whenever we confront a new challenge or venture into uncharted territories, we tend to repeat past mistakes. Subsequently, experts seize the moment to introduce a “new” concept that purportedly solves everything, only to reinvent/apply the same proven solution patterns that we ought to have recognized from the start.
Solution Patterns
The DevOps movement champions the principle of “you build it, you run it,” emphasizing full lifecycle responsibility. This ethos not only fosters a culture of ownership among team members but also facilitates the dismantling of expertise silos. Such an environment inherently drives quality and innovation, as individuals are more invested in the outcomes of their work. Moreover, this approach significantly reduces the time to market for new features and products by streamlining processes and fostering collaboration.
Transitioning from monolithic architectures to a microservices-based approach represents the concept of decomposing complex systems into smaller, more manageable components. This significantly enhances resiliency, fault tolerance, and scalability. Perhaps the most critical advantage of this movement is the ability to closely align these smaller components with specific business use cases. This alignment allows for a more nuanced and effective response to the unique requirements of business domains.
Conway’s Law posits that the architecture of a system often reflects the communication patterns of the organization that created it. When teams are organized around specific domains of expertise, it tends to result in software systems that are fragmented along these lines of expertise, leading to a misalignment with the business’s value streams. Such fragmentation can introduce considerable inefficiencies, as the resulting organizational silos impede the flow of information and decision-making, thereby slowing down development processes and limiting innovation. This insight has led to the adoption of cross-functional and fusion teams, which prioritize the formation of teams around value stream segments rather than areas of expertise.
So, setting aside the buzzwords, what are the patterns that contribute to success?
- Fostering a culture of ownership and embracing full lifecycle responsibility
- Decomposing complex systems into smaller, manageable components
- Aligning IT components with specific business use cases
- Organizing teams around value stream segments rather than siloing them by expertise
Pitfalls of Data Science Strategies
In many organizations, data science seems to be treated like an actual science: a little bubble that is tasked to “do innovation”. This often comes with the deployment of a centralized data platform that is expected to utilize, without a cohesive, organization-wide data strategy. This approach is fundamentally flawed from the outset.
The “throw-it-over-the-wall” mentality regarding data involves producers simply dumping data into a centralized platform. This approach fosters a lack of ownership, with no clear responsibility for the data’s maintenance, quality, consistency, and utility.
Data monoliths have emerged as a continuous effort to collect data all in one place, for it to be later processed by experts. However, data monoliths also often emerge accidentally for reasons similar to those in software development: not being aligned to a business domain or use case, a platform or infrastructure that forces or incentivizes centralization, lack of structure, and inadequate planning combined with a chaotic evolution over time. Consequently, this leads to challenges in scalability, increased scaling costs, complexity, reliability issues, and difficulties in maintenance, development, and data consumption/utilization.
The siloing of data science and data engineering from other departments frequently results in a misalignment with the data management needs of specific business use cases. This hinders the flow of information and decision-making, slowing down development processes and constraining innovation. This misalignment introduces significant inefficiencies and complicates the effective use of data within these domains. Organizational silos hinder the flow of information and decision-making, slowing down development processes and constraining innovation.
The segregation of data from business domains often results in a disconnect between the data and the specific use cases it is intended to support. The practice of indiscriminately dumping data into a centralized platform, from where it is then accessed, can lead to the erosion of critical domain-specific details, reducing the data’s quality and reliability. To maximize its value, data must be closely aligned with either the domain from which it originates or the domain for which it is specifically curated and utilized.
Isolating expertise in data engineering and data science from the broader organization limits the effective integration of data insights and capabilities across the company’s value streams, significantly restricting the potential impact and value created. The lack of collaboration between software developers, data engineers, and domain experts often leads to inefficiencies. Without a concerted effort to bridge these disciplines, the development and implementation of data-driven solutions become disjointed, resulting in suboptimal outcomes that fail to meet the nuanced requirements of specific business contexts. This disjointed approach hinders the agile adaptation of data processing to evolving business needs, stifling innovation and growth.
Data as a Product
At its core, Data as a Product (DaaP) is about rethinking our approach to data, treating it with the same care and intention as we would any product in the market. This means looking at data as a valuable offering that needs to be designed, maintained, and improved with the end user in mind. The goal is to bridge the gap between those who create and manage data and those who rely on it to make decisions or derive insights. By adopting this mindset, we ensure that data is not only collected and stored but is also accessible, relevant, and of high quality.
To effectively implement Data as a Product, clear ownership is key. Ideally, the data product should be owned by the team in the business area where the data originates. This team has the closest connection to the data’s context and purpose, making them best suited to ensure its relevance and utility. The leader of the data product, much like a product manager, is tasked with maintaining the data’s quality, keeping it up-to-date, and tailoring it to meet user needs. The customers of the data product include anyone within the organization who uses this data to inform decisions or gain insights. Shifting to a DaaP approach requires organizations to view data as an asset that requires active management and continuous improvement, much like any service or product.
Ownership
With DaaP, ownership means teams are fully invested in their data, constantly seeking ways to enhance its value based on user feedback and usage patterns, akin to how product updates are informed by customer reviews.
Modularization
Viewing data as individual products naturally leads to the breakdown of large, complex data sets into more manageable, user-friendly segments, simplifying access and understanding for end-users.
Aligning with Business Needs
By developing data products with specific business objectives in mind, DaaP ensures that data is practical, relevant, and directly contributes to achieving business goals.
Team Collaboration
Both the data product owners and customers are envisioned as cross-functional teams, encompassing a range of roles from data scientists, data engineers, and software developers, to domain experts and business analysts. This mix of skills and perspectives helps to build data products that are not only technically sound but also deeply integrated with business insights, leading to more effective and impactful outcomes. The consuming teams leverage their diverse expertise to generate business value by integrating the data into value stream segments or providing further enriched data products themselves.
Data Mesh
Data Mesh emerges as a pivotal architectural paradigm, steering away from the traditional centralized data management approach to a more decentralized, value stream segments aligned model. This paradigm champions the concept of treating data as a product, akin to the principles outlined in the Data as a Product (DaaP) section, emphasizing the need for curated, domain-oriented data sets that are tailored to specific business requirements. Data Mesh intricately weaves together data, metadata, access patterns, and governance into a cohesive framework that simplifies data management, making it more accessible and actionable. This model facilitates a seamless transition from complex, unwieldy data pipelines to more manageable, self-contained data products that encapsulate all necessary components for their function, thereby enabling even non-technical teams to handle data with reduced complexity.
Integrating Data Mesh within an organization, alongside DaaP principles, necessitates a concerted effort across technology, culture, and processes. Technologically, the foundation rests on developing a self-service data platform that embodies the Data Mesh architecture, empowering teams to independently manage their data domains. Culturally, this integration demands a shift towards a product-centric view of data, where continuous improvement and user satisfaction are paramount. Operationally, it’s crucial to define clear responsibilities for data producers and consumers, ensuring that data products are not only developed and maintained effectively but also aligned with the broader organizational goals and user needs.
Culture of Ownership
Data Mesh promotes a culture where teams are empowered to own and manage their data as products aligned to their business domain, mirroring the ethos of “you build it, you run it.” This ownership ensures that data is maintained, updated, and aligned with the evolving needs of its users, fostering a sense of responsibility and accountability.
Modularization
By advocating for a decentralized approach, Data Mesh naturally breaks down complex, monolithic data systems into smaller, domain-specific data products. This simplification makes data more accessible and manageable, paralleling the benefits of decomposing complex systems into microservices in software architecture.
Alignment with Business Domains
Data Mesh ensures that data products are closely aligned with specific business use cases, enhancing their relevance and utility. This facilitates the flow of information and decision-making, accelerating development processes.
Team Collaboration
The Data Mesh approach requires cross-functional teams, including data engineers, data scientists, business analysts, and domain experts, as it moves away from the conventional model of a siloed data engineering team for handling data. This collaborative approach ensures that data products are built, consumed, and integrated with a comprehensive understanding of both technical and business requirements.
Embrace it
The cycle of technology trends often brings us back to familiar challenges and we should not forget to apply the solution patterns we know lead to success.
The initial rush to accumulate as much data as possible in hopes of future AI breakthroughs has left many organizations stifled by complexity rather than enabled for innovation. Recognizing and addressing the common pitfalls in data strategies and applying known solution patterns is key to moving forward.
Concepts like Data as a Product (DaaP) and Data Mesh offer practical ways to manage data more effectively, focusing on quality, accessibility, and alignment with business goals. These strategies encourage a shift towards treating data with the same attention and care as any other key business asset, promoting ownership and cross-functional collaboration.
… Transform your mountain of data rubble into a network of conveyor belts, each efficiently delivering precious insights directly where they are needed.