Did you know that you use a Data Mesh every day? The internet is one. So, what is it? And how it might help you?
By Jonathan Farina, Chief Technology Officer, WCKD RZR
This is a thorny one! What is a Data Mesh?
If you Google it, you’ll come across various explanations, but if you’re not Computer Science focused, you’ll still end up wondering: “What IS a data mesh?”
As computing has matured, we’ve continued to store more data. In the 70’s and 80’s this wasn’t a huge problem. You could simply store the data in a single file or database. This made sense as the data storage was light and data access was targeted for specific users. E.g., accessing the CRM database to add a new lead or update customers’ details.
In the 90’s, we started to use more computers and store more data. The more we accumulated, the more people realised, they could make use of the stored information. This led to the creation of “Data Warehouses”.
These are Data stores usually defined by many fixed data domains. What is a Data domain? It’s simply a collection of data. E.g., Customer or Credit card data. Now we had much larger Data Warehouses that would curate and store fixed domain sets or “data assets” which users could then query.
However, these assets were fixed and if you needed data that wasn’t in the asset, you were stuck. You’d have to go to a specialised team, ask them to add the data, which would take weeks or months and by then it was either too costly or the need had passed. So, it was better but not great. Still quite limited in its usage and again the data was comparatively small. As data quantities grew, the data warehouses started to creak under the load.
Then in the early 00’s we started to think of ‘Data Lakes’. Large areas where data can be brought together for lots of different uses, further we could use “compute” e.g.: lots of computers working together, to crunch through data. This meant we could store much larger sets of data to begin with and we could build larger and more useful data assets.
We now had to have specialised teams whose job it was to ingest data into the lake, more people to wire-up the data or “transform” it into data assets, and we also had the “concentration” risk, e.g., the data now sits in one place and can be seen by anyone (if you have access).
This presents many issues. Firstly, a bottleneck when you try to use the data; you need specialists to crunch the data for your specific need or “use case”, which means getting in the queue. You have security concerns given the large amounts of data available and therefore hoops to jump through to ensure you can even see the data (and only the correct data); and importantly, as the data has been crunched just for your use case, re-crunching it for someone else’s use case, means the consistency of the data may not be correct. This means outputs could potentially be different. This is bad.
Finally, the cost to maintain these environments is relative to the size of the data and the organisation you need to store, process, and provide the data. So, we try to find a better way, and this leads us to the Data Mesh.
A Data Mesh?
Imagine if you could access data wherever it is produced, rather than a copy, which may be stale. Further, imagine that the data is well described or annotated so you know what you’re looking at. It is also reasonable to assume that you might also have teams of people building specific data assets (which is nothing more than doing the initial steps of joining separate distinct datasets together) such as all your customer records into a “Customer” Asset or all your account information into an “Accounts” asset. Now, you layer on top of this, some form of descriptive dictionary which explains what the data is, and you provide a search engine or data catalogue to search for the data across the mesh, similarly to how we use index cards in a library to find a book. Now you have something genuinely useful, scalable because as we add new data sources to the mesh, they become available for use and as we change our data requests or “use cases” we can simply search for whatever data we want or need.
The concept isn’t new. In fact, you use a type of data mesh every day. The internet is a data mesh! If you think about it, you access data from each web server (by typing in a known address (such as www.bbc.co.uk) this data is curated to be relevant and useful and these datasets are well described and searchable using search engines. As the web has expanded, your ability to search and use it hasn’t changed, simply the amount of data has expanded.
So, there you have it. A whistlestop tour of recent data engineering history. The next step is to understand how you can, and if you should, build your own Data Mesh. What do you need and how should it be put together?
Another thorny one, for another day!