Data Warehouses: Past, Present, and Future
Welcome to the first in our new Informatica Analytics Chalk Talk series of videos and blog posts. I’m the host and I’ll be sharing advice and best practices gathered from the hundreds of consultations and deployments I’ve led with enterprise customers around data management to support analytics initiatives. Below is a transcript from the first Chalk Talk video, which delves into the evolution of the data warehouse. Watch the video or read the blog (or do both!) and please share your views about the future of the warehouse. Find me on Twitter @leanlyle.
I keep hearing this idea that the data warehouse is obsolete. I’ve heard it at trade shows, and I’ve also read it on blogs. I’m going to debunk this, or at least get people thinking a different way (I hope). The data warehouse is not obsolete. Because here’s the thing: haven’t we heard this before?
We saw this in 1996 where the EII, Enterprise Information Integration space was creating start-ups that were promising to virtualize different data sources, so that you didn’t require a data warehouse. You could create a virtual data warehouse in your own ODBC space. Well that didn’t work out, now did it?
Those of us who actually taste and know data really deeply, know that you can’t solve the really complex data problems that way. It just doesn’t work like that. The other thing that shows that data warehousing is not obsolete is a CIO Insight Survey that found 70 percent of the respondents were increasing their data warehousing spending. That wouldn’t be happening if it was ‘obsolete’.
So, what’s going on?
Well, with a traditional data warehouse, you’ve got all sorts of different sources. You’re processing, preparing, cleaning, integrating the data so that you can then present it in your data warehouse for predictive analysis, business intelligence and everything else.
Okay, but the world’s gotten a little more complex now.
We’ve got cloud, we’ve got mobile, we’ve got social media data, we’ve got machine data, sensor data. All these different sources. The data’s growing.
In so many presentations on big data, you’ll see slides talking about the petabytes and zetabytes of data that you’ve got to be able to process to be able to handle the new world. But to me, that’s missing the point.
You want to know why data warehousing is really, really relevant going forward, and why 70 percent of CIOs are increasing in their data warehousing spending. It’s because it’s not the volume and velocity of data that’s the problem. As that grows, we can just increase our hardware, increase our scale to be able to handle that.
The problem is the increasing complexity that comes with the variety, the heterogeneity of data. We’re being asked to relate and integrate this data. But the tools that we’re being given, the new stuff, Hadoop or Spark, or whatever the next thing’s going to be, that doesn’t give us any magical solution to that. We’re still really fighting to solve the same problem: how do I take data that’s coming from different sources and then relate it together, so the business can ask and answer new questions? To solve that, we need to rely on a lot of the same skills that we’ve always needed.
Evolving the data warehouse
In fact, we can actually utilize the new technologies in ways that allow us to work a lot faster and integrate the business in the discussion much earlier.
So, let’s look at some of the different technologies that allow us to be smart about when and how much we dive into the details of how to relate that data, that allow us to provide what we know to help the business.
An assembly line for analytics
One of the things I see that’s going on that’s really interesting, is companies building almost an assembly line of data where, as new data comes in, they can just dump it in the document NoSQL area. Almost like shoving it into a SharePoint server, but with much better searching capabilities.
We can ask some questions of this information as well as provide the business with the ability to see this, and understand what they need. Then, they can give us feedback on how analytically interesting something is, and how important it might be to provide more thought into understanding the schema, or structure and how one data set might relate to others.
As this happens that may suggest moving it into a key value NoSQL system that allows us a little more capability in terms of the questions that we can ask, and the things that we can answer.
Finally, we look at, okay, we’ve found something that’s really interesting and we want to operationalize it in the data warehouse. The reason we want to do this is the more relational you need the data to be—the more something relates to other things and your questions depend on those relations—the more you need to put it in an actual relational database.
Now, we’ve got column stores and other technologies that we also should be looking at. So, even that area has gotten a lot more interesting, and we should be investigating these things. We can spin up these technologies in the cloud and test them out, so we can do things we couldn’t do several years ago.
The important thing is, as the data warehouse team we have to be diving into this experimenting and understanding this, so we stay ahead of the business. If we don’t do it, they will, and they’ll just create their own thing to answer their own question at their own speed. Then the shadow IT world begins to explode, and the hairball gets worse.
So, let’s come back a little bit.
We’ve got all these different tools that are available to us, but ultimately, the data warehousing and operationalizing of the data is just as important as it ever was. So, keep at it. Investigate the new technologies, and understand that the data and schema-oriented skills that you’ve developed are just as important now as they’ve ever been.
Next part of the Analytics Chalk Talks series:
Part II: Big Data Analytics Chalk Talk