Treating Big Data Performance Woes with the Data Replication Cure Blog Series – Part 1
“Big Data” is all the rage – it is virtually impossible to check out any information management media channel, online resource, or community of interest without having your eyeballs bathed in articles touting the benefits and inevitability of what has come to be known as big data. I have watched this transformation over the past few years as data warehousing and business analytics appliances have entered the mainstream. Pure and simple: what was the bleeding edge of technology twenty years ago in performance computing is now commonplace, with Hadoop being the primary platform (or more accurately, programming environment) for developing big data analytics applications.
And I know this, since it has been twenty years since I started working on data-parallel, high performance platforms. And the driving factor then is the same as the driving factor today: increasing performance to reduce the time it takes to get the results delivered to the right person at the right time.
Back in those old days, I was actively involved in evaluating the performance bottlenecks for computational performance, but I was pretty much focused on the CPU itself. In reality the bottleneck was always associated with data latency: streaming data from its persisted state (from the disk) through memory, cache, and then into the registers so that the CPU could execute the instructions without having to sit idly waiting for the data to arrive.
But let’s fast forward those twenty years, and the problem remains the same. In-memory analytics is a big feature of the frameworks for big data analytics, but the issues involve ensuring that the analytical appliances are not sitting idle waiting for data to stream in from disk. Luckily, we can learn from previous experience, and the same caching methods used at the hardware and CPU level are just as reasonable to apply at the appliance level. Therefore, this month’s blog series is going to look at how data replication techniques take advantage of opportunities for ensuring big data analytics performance in relation to streamlined data access and delivery. I will discuss this topic more on May 23 for the Information-Management.com EspressoShot webinar, Treating Big Data Performance Woes with the Data Replication Cure.