Tag Archives: data growth
Talking to architects about analytics at a recent event, I kept hearing the familiar theme; data scientists are spending 80% of their time on “data wrangling” leaving only 20% for delivering the business insights that will drive the company’s innovation. It was clear to everybody that I spoke to that the situation will only worsen. The coming growth everybody sees in data volume and complexity, will only lengthen the time to value.
Gartner recently predicted that:
“by 2015, 50% of organizations will give up on managing growth and will redirect funds to improve classification and analytics.”
Some of the details of this study are interesting. In the end, many organizations are coming to two conclusions:
- It’s risky to delete data, so they keep it around as insurance.
- All data has potential business value, so more organizations are keeping it around for potential analytical purposes.
The other mega-trend here is that more and more organizations are looking to compete on analytics – and they need data to do it, both internal data and external data.
From an architect’s perspective, here are several observations:
- The floodgates are open and analytics is a top priority. Given that, the emphasis should be on architecting to manage the dramatic increases in both data quantity and data complexity rather than on trying to stop it.
- The immediate architectural priority has to be on simplifying and streamlining your current enterprise data architecture. Break down those data silos and standardize your enterprise data management tools and processes as much as possible. As discussed in other blogs, data integration is becoming the biggest bottleneck to business value delivery in your environment. Gartner has projected that “by 2018, more than half the cost of implementing new large systems will be spent on integration.” The more standardized your enterprise data management architecture is, the more efficient it will be.
- With each new data type, new data tool (Hive, Pig, etc.), and new data storage technology (Hadoop, NoSQL, etc.) ask first if your existing enterprise data management tools can handle the task before people go out and create a new “data silo” based on the cool, new technologies. Sometimes it will be necessary, but not always.
- The focus needs to be on speeding value delivery for the business. And the key bottleneck is highly likely to be your enterprise data architecture.
Rather than focusing on managing data growth, the priority should be on managing it in the most standardized and efficient way possible. It is time to think about enterprise data management as a function with standard processes, skills and tools (just like Finance, Marketing or Procurement.)
Several of our leading customers have built or are building a central “Data as a Service” platform within their organizations. This is a single, central place where all developers and analysts can go to get trustworthy data that is managed by IT through a standard architecture and served up for use by all.
For more information, see “The Big Big Data Workbook”
*Gartner Predicts 2015: Managing ‘Data Lakes’ of Unprecedented Enormity, December 2014 http://www.gartner.com/document/2934417#
The other comparison is that data is like solar power. Like solar power, data is abundant. In addition, it’s getting cheaper and more efficient to harness. The juxtaposition of these images captures the current sentiment around data’s potential to improve our lives in many ways. For this to happen, however, corporations and data custodians must effectively balance the power of data with security and privacy concerns.
Many people have a preconception of security as an obstacle to productivity. Actually, good security practitioners understand that the purpose of security is to support the goals of the company by allowing the business to innovate and operate more quickly and effectively. Think back to the early days of online transactions; many people were not comfortable banking online or making web purchases for fear of fraud and theft. Similar fears slowed early adoption of mobile phone banking and purchasing applications. But security ecosystems evolved, concerns were addressed, and now Gartner estimates that worldwide mobile payment transaction values surpass $235B in 2013. An astute security executive once pointed out why cars have brakes: not to slow us down, but to allow us to drive faster, safely.
The pace of digital change and the current proliferation of data is not a simple linear function – it’s growing exponentially – and it’s not going to slow down. I believe this is generally a good thing. Our ability to harness data is how we will better understand our world. It’s how we will address challenges with critical resources such as energy and water. And it’s how we will innovate in research areas such as medicine and healthcare. And so, as a relatively new Informatica employee coming from a security background, I’m now at a crossroads of sorts. While Informatica’s goal of “Putting potential to work” resonates with my views and helps customers deliver on the promise of this data growth, I know we need to have proper controls in place. I’m proud to be part of a team building a new intelligent, context-aware approach to data security (Secure@SourceTM).
We recently announced Secure@SourceTM during InformaticaWorld 2014. One thing that impressed me was how quickly attendees (many of whom have little security background) understood how they could leverage data context to improve security controls, privacy, and data governance for their organizations. You can find a great introduction summary of Secure@SourceTM here.
I will be sharing more on Secure@SourceTM and data security in general, and would love to get your feedback. If you are an Informatica customer and would like to help shape the product direction, we are recruiting a select group of charter customers to drive and provide feedback for the first release. Customers who are interested in being a charter customer should register and send email to SecureCustomers@informatica.com.
In my last blog, I talked about the dreadful experience of cleaning raw data by hand as a former analyst a few years back. Well, the truth is, I was not alone. At a recent data mining Meetup event in San Francisco bay area, I asked a few analysts: “How much time do you spend on cleaning your data at work?” “More than 80% of my time” and “most my days” said the analysts, and “they are not fun”.
But check this out: There are over a dozen Meetup groups focused on data science and data mining here in the bay area I live. Those groups put on events multiple times a month, with topics often around hot, emerging technologies such as machine learning, graph analysis, real-time analytics, new algorithm on analyzing social media data, and of course, anything Big Data. Cools BI tools, new programming models and algorithms for better analysis are a big draw to data practitioners these days.
That got me thinking… if what analysts said to me is true, i.e., they spent 80% of their time on data prepping and 1/4 of that time analyzing the data and visualizing the results, which BTW, “is actually fun”, quoting a data analyst, then why are they drawn to the events focused on discussing the tools that can only help them 20% of the time? Why wouldn’t they want to explore technologies that can help address the dreadful 80% of the data scrubbing task they complain about?
Having been there myself, I thought perhaps a little self-reflection would help answer the question.
As a student of math, I love data and am fascinated about good stories I can discover from them. My two-year math program in graduate school was primarily focused on learning how to build fabulous math models to simulate the real events, and use those formula to predict the future, or look for meaningful patterns.
I used BI and statistical analysis tools while at school, and continued to use them at work after I graduated. Those software were great in that they helped me get to the results and see what’s in my data, and I can develop conclusions and make recommendations based on those insights for my clients. Without BI and visualization tools, I would not have delivered any results.
That was fun and glamorous part of my job as an analyst, but when I was not creating nice charts and presentations to tell the stories in my data, I was spending time, great amount of time, sometimes up to the wee hours cleaning and verifying my data, I was convinced that was part of my job and I just had to suck it up.
It was only a few months ago that I stumbled upon data quality software – it happened when I joined Informatica. At first I thought they were talking to the wrong person when they started pitching me data quality solutions.
Turns out, the concept of data quality automation is a highly relevant and extremely intuitive subject to me, and for anyone who is dealing with data on the regular basis. Data quality software offers an automated process for data cleansing and is much faster and delivers more accurate results than manual process. To put that in math context, if a data quality tool can reduce the data cleansing effort from 80% to 40% (btw, this is hardly a random number, some of our customers have reported much better results), that means analysts can now free up 40% of their time from scrubbing data, and use that times to do the things they like – playing with data in BI tools, building new models or running more scenarios, producing different views of the data and discovering things they may not be able to before, and do all of that with clean, trusted data. No more bored to death experience, what they are left with are improved productivity, more accurate and consistent results, compelling stories about data, and most important, they can focus on doing the things they like! Not too shabby right?
I am excited about trying out the data quality tools we have here at Informtica, my fellow analysts, you should start looking into them also. And I will check back in soon with more stories to share..
The term “big data” has been bandied around so much in recent months that arguably, it’s lost a lot of meaning in the IT industry. Typically, IT teams have heard the phrase, and know they need to be doing something, but that something isn’t being done. As IDC pointed out last year, there is a concerning shortage of trained big data technology experts, and failure to recognise the implications that not managing big data can have on the business is dangerous. In today’s information economy, as increasingly digital consumers, customers, employees and social networkers we’re handing over more and more personal information for businesses and third parties to collate, manage and analyse. On top of the growth in digital data, emerging trends such as cloud computing are having a huge impact on the amount of information businesses are required to handle and store on behalf of their customers. Furthermore, it’s not just the amount of information that’s spiralling out of control: it’s also the way in which it is structured and used. There has been a dramatic rise in the amount of unstructured data, such as photos, videos and social media, which presents businesses with new challenges as to how to collate, handle and analyse it. As a result, information is growing exponentially. Experts now predict a staggering 4300% increase in annual data generation by 2020. Unless businesses put policies in place to manage this wealth of information, it will become worthless, and due to the often extortionate costs to store the data, it will instead end up having a huge impact on the business’ bottom line. Maxed out data centres Many businesses have limited resource to invest in physical servers and storage and so are increasingly looking to data centres to store their information in. As a result, data centres across Europe are quickly filling up. Due to European data retention regulations, which dictate that information is generally stored for longer periods than in other regions such as the US, businesses across Europe have to wait a very long time to archive their data. For instance, under EU law, telecommunications service and network providers are obliged to retain certain categories of data for a specific period of time (typically between six months and two years) and to make that information available to law enforcement where needed. With this in mind, it’s no surprise that investment in high performance storage capacity has become a key priority for many. Time for a clear out So how can organisations deal with these storage issues? They can upgrade or replace their servers, parting with lots of capital expenditure to bring in more power or more memory for Central Processing Units (CPUs). An alternative solution would be to “spring clean” their information. Smart partitioning allows businesses to spend just one tenth of the amount required to purchase new servers and storage capacity, and actually refocus how they’re organising their information. With smart partitioning capabilities, businesses can get all the benefits of archiving the information that’s not necessarily eligible for archiving (due to EU retention regulations). Furthermore, application retirement frees up floor space, drives the modernisation initiative, allows mainframe systems and older platforms to be replaced and legacy data to be migrated to virtual archives. Before IT professionals go out and buy big data systems, they need to spring clean their information and make room for big data. Poor economic conditions across Europe have stifled innovation for a lot of organisations, as they have been forced to focus on staying alive rather than putting investment into R&D to help improve operational efficiencies. They are, therefore, looking for ways to squeeze more out of their already shrinking budgets. The likes of smart partitioning and application retirement offer businesses a real solution to the growing big data conundrum. So maybe it’s time you got your feather duster out, and gave your information a good clean out this spring?
Businesses retain information in an Enterprise data archiving either for compliance – adhere to data retention regulations – or because business users are afraid to let go of data they are used to having access to. Many IT have told us they retain data in archives because they are looking to cut infrastructure costs and do not have retention requirements clearly articulated from the business. As a result, enterprise data archiving has morphed into serving multiple purposes for IT –they can eliminate costs associated with maintaining aging data in production applications, allow business users to access the information on demand, all while adhering to some – if any known or defined – retention policies. (more…)
Many years ago (over 30 to be precise) I can recall walking the halls of more than one fortune 500 company and seeing four-foot high stacks of boxes with computer printouts in the hallway outside of managers’ offices. In fact it was not uncommon to see pallet-loads of computer printouts in some companies. When I asked one manager what the reports were and why they had so many, he said “we don’t look at the reports any more but we don’t know how to get the data center to stop sending them.” (more…)
The “Dodd-Frank Wall Street Reform and Consumer Protection Act” has recently been passed by the US federal government to regulate financial institutions. Per this legislation, there will be more “watchdog” agencies that will be auditing banks, lending and investment institutions to ensure compliance. As an example, there will be an Office of Financial Research within the Federal Treasury responsible for collecting and analyzing data. This legislation brings with it a higher risk of fines for non-compliance. (more…)
The phrase ‘Data Tsunami’ has been used by numerous authors in the last few months and it’s difficult to find another suitable analogy because what’s approaching is of such an increased order of magnitude that the IT industries continued expectations for data growth will be swamped in the next few years.
However impressive a spectacle a Tsunami is, it still wreaks havoc to those who are unprepared or believe they can tread water and simply float to the surface when the trouble has passed.
In my last post I talked about airlines becoming more efficient (or not) and I started thinking about how everything today is about efficiency – not a bad thing when you consider the growth of data volumes we’re seeing everywhere (go to YouTube and search “exponential times” – some interesting videos). Efficiency is necessary for scale, but also efficiency is about better use of resources (think Green). (more…)