Category Archives: Big Data
While CIOs are urged to rethink of backup strategies following warnings from leading analysts that companies are wasting billions on unnecessary storage, consultants and IT solution vendors are selling “Big Data” narratives to these CIOs as a storage optimization strategy.
What a CIO must do is ask:
Do you think a Backup Strategy is same as a Big Data strategy?
Is your MO – “I must invest in Big Data because my competitor is”?
Do you think Big Data and “data analysis” are synonyms?
Most companies invest very little in their storage technologies, while spending on server and network technologies primarily for backup. Further, the most common mistake businesses make is to fail to update their backup policies. It is not unusual for companies to be using backup policies that are years or even decades old, which do not discriminate between business-critical files and the personal music files of employees.
Web giants like Facebook and Yahoo generally aren’t dealing with Big Data. They run their own giant, in-house “clusters” – collections of powerful servers – for crunching data. But, it appears that those clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range, which means they could easily be handled on a single computer – even a laptop.
The necessity of breaking problems into many small parts, and processing each on a large array of computers, characterizes classic Big Data problems like Google’s need to compute the rank of every single web page on the planet.
In, Nobody ever got fired for buying a cluster, Microsoft Research points out that a lot of the problems solved by engineers at even the most data-hungry firms don’t need to be run on clusters. Why is that a problem? It is because, there are vast classes of problems for which these clusters are relatively inefficient, or a very inappropriate, solution.
Here is an example of a post exhorting readers to “Incorporate Big Data Into Your Small Business” that is about a quantity of data that probably wouldn’t strain Google Docs, much less Excel on a single laptop. In other words, most businesses are in dealing with small data. It’s very important stuff but it has little connection to the big kind.
Let us lose the habit of putting “big” in front of data to make it sound important. After all, supersizing your data, just because you can, is going to cost you a lot more and may yield a lot less.
So what is it? Big Data, small Data, or Smart Data?
Gregor Mendel uncovered the secrets of genetic inheritance with just enough data to fill a notebook. The important thing is gathering the right data, not gathering some arbitrary quantity of it.
It is hard to miss all the commentary, commercials, ads and reviews on the soon to be released Apple Watch. It got me to thinking about how much has changed over the last 15 years when it comes to how people perceive and use technology and how the Apple Watch may just signal the next shift in technology usage. Yes, we have had wearables now for some time but when Apple does something new they have proven to be able to tap into the broader market conscious and in doing so take us to new places.
The iPad just turned 5. The iPhone was released just over 8 years ago. The iPod that started it all was released in 2001. (I have one of these and my kids think it is ancient. They always ask why it does not have a touch screen)
Apple iPod Generation 1 (circa 2001)
There were no touch screens or “apps for that” in 2001. In fact the touch wheel at the time for navigation was at best quaint. It ended up helping to change the entire music industry and set the stage for Apple and others to continue to innovate on this new technology platform for years to come.
What happened over the next 15 years were some really interesting trends that may be completely changed by how the Apple Watch adds to the discussion of wearables.
Some of the big trends pushed by the iPod-> iPhone-> iPad have included
1. Increased access to technology at a entry to medium level price point. As the these devices became more powerful and open platforms developed for applications and internet access the average person had access to ever increasing information and tools. While Apple products tend to be more on the higher price points of the market they did help create opportunities for other vendors to enter at lower price points.
2. Task based applications. An “app for that” mentality has grown and is very engrained in both the consumer and enterprise market. This mentality is very much at odds with the traditional monolithic application and stack world and has created many opportunities for specialized applications and services. Even where a software vendor continues to offer a platform or stack they are forced to think about the architecture and API access that would support smaller and more mobile applications.
3. Mobile first. While this is a bit of chicken and egg question it is worth giving credit to the explosion of personal devices for helping drive a mobile first approach on many consumer and enterprise solutions. Of course we also have seen a huge explosion in access to reliable (mostly) Wi-Fi access in public locations but it is reasonable to believe the user demand for access is more what has driven so many places from airports to McDonalds to offer free public Wi-Fi.
4. Social media. Really would all the Twitter, Facebook, Instagram, Snapchats and others of the world big as big as they are today without the huge increase in access from more mobile devices? Most likely the answer is no.
Ok, so why is the Apple Watch a possible shift in these usage patterns and not just the continuation?
1. Device Driven Attention Deficit Disorder (DDADD). Yes, I just made that term up but it is a real problem and we all either have it at times or we know people that suffer from this issue. Unless your actual job involves doing social media posts it is not really reasonable (or polite) to be posting away on your device every 5 minutes all night long. The watch/wearables may just provide a way for some people to strike a balance by streamlining interaction with all those applications on their larger phone or device. The review in the WSJ today really hit up on this point of the smartwatch being able to drive user efficiency by only bringing specific tasks to the watch. It is too early to tell but that sure sounds like a good thing.
2. Form and function. Just as the laptop, smartphone and tablet markets have changed the overall computer market (just ask those companies that sell desktop PCs how that is going) the smartwatch over time may do the same. This seems especially the case if the smartwatch can in some cases be a replacement to another device in addition to being used in conjunction with another device the way the Apple Watch and iPhone are used together.
3. New applications and services. These are coming and it is not easy to guess how much change is coming. In some ways the wearables market seems much harder compared to the micro-applications market but that could just be because all things new are hard to predict.
Applications by App Stores Explosion (originally appeared at app figures)4. Concerns over data and personal data. The data aspects of this are complex as we get into both usage data and personal data both of which are valuable and highly regulated. It’s hard to say how the wearables market changes things other than to put a bigger spotlight on the need for industry solutions that put the user in charge of their data.
In summary while the Apple Watch may or may not be a commercial success it seems like we could look back in 5, 10 or 15 years and see this as yet another huge shift in the way people perceive and use technology.
Informatica’s in Brussels this week for Hadoop Summit. We’re looking forward to spending time with our European customers who are leading the way on repeatably delivering trusted and timely data for big data analytics.
If you’re attending Hadoop Summit Brussels, definitely stop by our session with Belgacom International Carrier Services and our very own Bert Oosterhof to learn how Belgacom is easily driving more predictive analytics and a better customer experience using Informatica and Hadoop.
Europe is clearly becoming a hotbed for increasing use of Hadoop, especially in Telecom, Financial Services, and Public Sector. As organizations look to extend their information architectures with Hadoop, Informatica can help you repeatably deliver trusted and timely data for big data analytics.
Please stop by our booth at Hadoop Summit to learn more!
First – let’s start off with a description of what exactly Big Data is…simply put: lots and lots of data. According to Wikipedia: “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.”
There are many different sources of data (claims systems, enrollment systems, benefits administration systems, survey results, consumer data, social media, personal health devices – like fitbit). Each source generates an amazing amount of data. These data sets grow in size because they are being gathered by readily available and numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data were created. In order to make sense of all of this data, we need to be able to organize it, create linkages between the data and then perform analysis on the data in order to provide meaningful actions.
In 2000, Seisint Inc. developed C++ as a distributed file sharing framework for data storage and querying to support the vast amount of storage that is necessary for this data. With this framework, structured, semi-structured and/or unstructured data can be stored and distributed across multiple servers.
In 2004, Google published a paper on a process called MapReduce that uses the distributed file sharing framework. The MapReduce framework provides a parallel processing model and associated implementation to process huge amount of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open source project named Hadoop.
With Hadoop, payers have the ability to store a vast amount of data at a fairly inexpensive price point. By distributing the framework, access to the data can happen in a timely manner and payers are able to interact effectively with their distributed data.
Within the Healthcare Payer market, there are a lot of potential use cases for Hadoop or big data. Once the data is stored, linked and relationships between the data are created – some of the benefits we anticipate include:
- Re-Admission Risk Analysis- One of the key predictors of re-admission rates is whether or not the patient has someone to help them at home. The ability to determine household information (through relationships in member data, for example addresses and care team relationships available within a master data management solution populated with data from a Hadoop cluster) would be very helpful to identify at risk patients and provide targeted care post discharge. Data from social media outlets can provide quite a bit of household information.
- STARS Rating Improvement -In addition to missed care management plans/drug adherence, another interesting thing that could be better aligned is the member/provider link. Perhaps one specific provider is more successful at getting patients to adhere to Diabetes management protocols, while another provider is not very successful at getting hip replacement patients to complete physical therapy. Being able to link the patient to the provider along with the clinical data can help identify where to focus remediation efforts for possibly modifying provider or member behavior.
- Member Engagement -Taking householding further, putting information from re-admission risk analysis to work – once payers are able to household a group of members and link the household to a specific address – payers might be able to better predict how a new member in the same physical location might behave – and then you could target your outreach to the new members from the beginning utilizing effective engagement methodologies that have been successful for that physical location in the past.
In order to create the household, or determine how a member feels about a provider (which can then impact how they adhere to treatment plans) or understand how neighborhoods (which are groupings of households) may engage with their providers, payers need access to a vast amount of data. They also need to be able to sift through this data efficiently to create the relationship links as quickly as possible. Sifting through the data is enabled with Hadoop and Big Data. Relating the data can be done with master data management (which I will talk about next).
Where is the best place to get started on a Big Data solution? The Big, Big Data Workbook addresses:
- How to choose the right big data project and make it bulletproof from the start– setting clear business and IT objectives, defining metrics that prove your project’s value, and being strategic about datasets, tools and hand-coding.
- What to consider when building your team and data governance framework– making the most of existing skills, thinking strategically about the composition of the team, and ensuring effective communication and alignment of the project goals.
- How to ensure your big data supply chain is lean and effective– establishing clear, repeatable, scalable, and continuously improving processes, and a blueprint for building the ideal big data technology and process architecture
At the recent Bosch Connected World conference in Berlin, Stefan Bungart, Software Leader Europe at GE, presented a very interesting keynote, “How Data Eats the World”—which I assume refers to Marc Andreesen’s statement that “Software eats the world”. One of the key points he addressed in his keynote was the importance of generating actionable insight from Big Data, securely and in real-time at every level, from local to global and at an industrial scale will be the key to survival. Companies that do not invest in DATA now, will eventually end up like consumer companies which missed the Internet: It will be too late.
As software and the value of data are becoming a larger part of the business value chain, the lines between different industries become more vague, or as GE’s Chairman and CEO Jeff Immelt once stated: “If you went to bed last night as an industrial company, you’re going to wake up today as a software and analytics company.” This is not only true for an industrial company, but for many companies that produce “things”: cars, jet-engines, boats, trains, lawn-mowers, tooth-brushes, nut-runners, computers, network-equipment, etc. GE, Bosch, Technicolor and Cisco are just a few of the industrial companies that offer an Internet of Things (IoT) platform. By offering the IoT platform, they enter domains of companies such as Amazon (AWS), Google, etc. As Google and Apple are moving into new areas such as manufacturing cars and watches and offering insurance, the industry-lines are becoming blurred and service becomes the key differentiator. The best service offerings will be contingent upon the best analytics and the best analytics require a complete and reliable data-platform. Only companies that can leverage data will be able to compete and thrive in the future.
The idea of this “servitization” is that instead of selling assets, companies offer service that utilizes those assets. For example, Siemens offers a service for body-scans to hospitals instead of selling the MRI scanner, Philips sells lightning services to cities and large companies, not the light bulbs. These business models enable suppliers to minimize disruption and repairs as this will cost them money. Also, it is more attractive to have as much functionality of devices in software so that upgrades or adjustments can be done without replacing physical components. This is made possible by the fact that all devices are connected, generate data and can be monitored and managed from another location. The data is used to analyse functionality, power consumption, usage , but also can be utilised to predict malfunction, proactive maintenance planning, etc.
So what impact does this have on data and on IT? First of all, the volumes are immense. Whereas the total global volume of for example Twitter messages is around 150GB, ONE gas-turbine with around 200 sensors generates close to 600GB per day! But according to IDC only 3% of potentially useful data is tagged and less than 1% is currently analysed. Secondly, the structure of the data is now always straightforward and even a similar device is producing different content (messages) as it can be on a different software level. This has impact on the backend processing and reliability of the analysis of the data.
Also the data often needs to put into context with other master data from thea, locations or customers for real-time decision making. This is a non-trivial task. Next, Governance is an aspect that needs top-level support. Questions like: Who owns the data? Who may see/use the data? What data needs to be kept or archived and for how long? What needs to be answered and governed in IoT projects with the same priorities as the data in the more traditional applications.
To summarize, managing data and mastering data governance is becoming one of the most important pillars of companies that lead the digital age. Companies that fail to do so will be at risk for becoming a new Blockbuster or Kodak: companies that didn’t adopt quickly enough. In order to avoid this, companies need to evaluate a data platform can support a comprehensive data strategy which encapsulates scalability, quality, governance, security, ease of use and flexibility, and that enables them to choose the most appropriate data processing infrastructure, whether that is on premise or in the cloud, or most likely a hybrid combination of these.
Data Science should change how your businesses are run
The importance of data science is becoming more and more clear. Marc Benioff says, “I think for every company, the revolution in data science will fundamentally change how we run our business”. “There’s just a huge amount more data than ever before, our greatest challenge is making sense of that data”. He goes on to say that “we need a new generation of executives who understand how to manage and lead through data. And we also need a new generation of employees who are able to help us organize and structure our business around data”. Mark then says “when I look at the next set of technologies that we have to build at Salesforce, it is all data science based technology.” Ram Charan in his article in Fortune Magazine “says to thrive, companies—and the execs who run them, must transform into math machines” (The Algorithmic CEO, Fortune Magazine, March 2015, page 45).
With such powerful endorsements for data science, the question you may be asking is when should you hire a data scientist or two. The answer has multiple answers. I liken data science to any business research. You need to do your upfront homework for the data scientists you hire to be effective.
Create a situation analysis before you start
You need to start by defining your problem—are you losing sales, finding it takes too long to manufacturer something, less profitable than you would like to be, and the list goes on. Next, you should create a situational analysis. You want to arm your data scientists with as much information as possible to define what you want them to solve or change. Make sure that you are as concrete as possible here. Data scientists struggle when the business people that they work with are vague. As well, it is important that you indicate what kinds of business changes will be considered if the model and data deliver this results or that result.
Next you need to catalog the data that you already have which is relevant to the business problem. Without relevant data there is little that the data scientist can do to help you. With relevant data sources in hand, you need to define the range of actions that you can possibly take once a model has been created.
Be realistic about what is required
With these things in hand, it may be time to hire some data scientists. As you start your process, you need to be realistic about the difficulty of getting a top flight data scientist. Many of my customers have complained about the difficulty competing with Google and other tech startups. As important, “there is a huge variance in the quality and ability of data scientists”. (Data Science for Business, Foster Provost, O’Reilly, page 321). Once you have hired someone, you need to keep in mind that effective data science requires business and data science collaboration. As well, please know that data scientist struggle when business people don’t appreciate the effort needed to get an appropriate training data set or model evaluation procedures.
Make sure internal or external data scientists give you an effective proposal
Once Once your data scientists are in place, you should realize that a data scientist worth their salt will create a proposal back to you. As we have said, it is important that you know what kinds of things will happen if the model and data delivery this results or that result. Data scientist in turn will be able to narrow things down to a dollar impact.
Their proposal should start by sharing their understanding of the business and the data which is available. What business problems are they trying to solve? Next the data scientist may define things like whether supervised or unsupervised learning will be used. Next they should openly discuss what efforts will be involved in data preparation. They should tell you here about the values for the target variable (whose values will be predicted). They should describe next their modeling approach and whether more than one model is be evaluated and then how models will be compared and final model be selected. And finally, they should discuss how the model will be evaluated and deployed. Are there evaluation and setup metrics? Data scientists can dedicate time and resources in their proposal to determining what things are real versus expected drivers.
To make all this work, it can be a good idea for data scientist to talk in their proposal about likelihood because business people that have not been through a quantitative MBA do not understand or remember statistics. It is important as well that data scientist before they begin ask business people the so what questions if the situation analysis is inadequate.
Leading an internal analytics team
In some cases, analytical teams will be built internally. Where this occurs, it is really importantly that the analytic leader have good people skills. They need as well to be able to set expectations that people will be making decisions from data and analysis. This includes having the ability to push back when someone comes to them will a recommendation based on gut feel.
The leader needs to hire smart analysts. To keep them, they need a stimulating and supportive work environment. Tom Davenport says analysts are motivated by interesting and challenging work that allows them to utilize their highly specialize skills. Like millenials, money is nice for analysts but they are more motivated more by exciting work and having the opportunity to grow and stretch their skills. Please know that data scientists want to spend time refining analytical models rather than doing simple analyses and report generation. Most importantly they want to do important work that makes a meaningful contribution. To do this, they want to feel supported and valued but have autonomy at work. This includes the freedom to organize their work. At the same time, analysts like to work together. And they like to be surrounded by other smart and capable collogues. Make sure to treat your data scientists as a strategic resource. This means you need development plans, career plans, and performance management processes.
As we have discussed, make sure to do your homework before contracting or hiring for data scientists. Once you have done your homework, if you are an analytic leader, make sure that you create a stimulating environment. Additionally, prove the value of analytics by signing up for results that demonstrate data modeling efficacy. To do this, look here for business problems that will lead to a big difference. And finally if you need an analytics leader to emulate, look no further than Brian Cornell, the new CEO of Target.
Myles in Twitter: @MylesSuer
The emergence of the business cloud is making the need for data ever more prevalent. Whatever your business, if your role is in the sales, marketing or service departments, chances are your productivity depends a great deal on the ability to move data quickly in and out of Salesforce and its ecosphere of applications.
With the in-built data transformation intelligence, the Data Wizard (click here to try the Beta version), changes the landscape of what traditional data loaders can do. The Data Wizard takes care of the following aspects, so that you don’t have to:
- Data Transformations: We built in over 300 standard data transformations so you don’t have to format the data before bringing it in (eg. combining first and last names into full names, adding numeric columns for totals, splitting address fields into its separate components).
- Built-in intelligence: We automate the mapping of data into Salesforce for a range of common use cases (eg., Automatically mapping matching fields, intelligently auto-generating date format conversions , concatenating multiple fields).
- App-to-app integration: We incorporated pre-built integration templates to encapsulate the logic required for integrating Salesforce with other applications (eg., single click update of customer addresses in a Cloud ERP application based on Account addresses in Salesforce) .
Unlike the other data loading apps out there, the Data Wizard doesn’t presuppose any technical ability on the part of the user. It was purpose-built to solve the needs of every type of user, from the Salesforce administrator to the business analyst.
Despite the simplicity the Data Wizard offers, it is built on the robust Informatica Cloud integration platform, providing the same reliability and performance that is key to the success of Informatica Cloud’s enterprise customers, who integrate over 5 billion rows of data per day. We invite you to try the Data Wizard for free, and contribute to the Beta process by providing us with your feedback.
In case you haven’t noticed, data integration is all the rage right now. Why? There are three major reasons for this trend that we’ll explore below, but a recent USA Today story focused on corporate data as a much more valuable asset than it was just a few years ago. Moreover, the sheer volume of data is exploding.
For instance, in a report published by research company IDC, they estimated that the total count of data created or replicated worldwide in 2012 would add up to 2.8 zettabytes (ZB). By 2020, IDC expects the annual data-creation total to reach 40 ZB, which would amount to a 50-fold increase from where things stood at the start of 2010.
But the growth of data is only a part of the story. Indeed, I see three things happening that drive interest in data integration.
First, the growth of cloud computing. The growth of data integration around the growth of cloud computing is logical, considering that we’re relocating data to public clouds, and that data must be synced with systems that remain on-premise.
The data integration providers, such as Informatica, have stepped up. They provide data integration technology that can span enterprises, managed service providers, and clouds that dealing with the special needs of cloud-based systems. Moreover, at the same time, data integration improves the ways we doing data governance, and data quality,
Second, the growth of big data. A recent IDC forecast shows that the big data technology and services market will grow at a 26.4% compound annual growth rate to $41.5 billion through 2018, or, about six times the growth rate of the overall information technology market. Additionally, by 2020, IDC believes that line of business buyers will help drive analytics beyond its historical sweet spot of relational to the double-digit growth rates of real-time intelligence and exploration/discovery of the unstructured worlds.
The world of big data razor blades around data integration. The more that enterprises rely on big data, and the more that data needs to move from place to place, the more a core data integration strategy and technology is needed. That means you can’t talk about big data without talking about big data integration.
Data integration technology providers have responded with technology that keeps up with the volume of data that moves from place to place. As linked to the growth of cloud computing above, providers also create technology with the understanding that data now moves within enterprises, between enterprises and clouds, and even from cloud to cloud. Finally, data integration providers know how to deal with both structured and unstructured data these days.
Third, better understanding around the value of information. Enterprise managers always knew their data was valuable, but perhaps they did not understand the true value that it can bring.
With the growth of big data, we now have access to information that helps us drive our business in the right directions. Predictive analytics, for instance, allows us to take years of historical data and determine patterns that allow us to predict the future. Mashing up our business data with external data sources makes our data even more valuable.
Of course, data integration drives much of this growth. Thus the refocus on data integration approaches and tech. There are years and years of evolution still ahead of us, and much to be learned from the data we maintain.
A lot of my time is spent discussing enterprise and end user value of software solutions. Increasingly over the last few years the solution focus has moved from being first about specific application and business processes to being data centric. People start with thinking and asking about what data that is collected, displayed, manipulated and automated instead of what is the task (e.g. we need to better understand how our customers make buying decisions instead of we need to streamline our account managers daily tasks). I have been working on a mental model for how to think about these different types of solutions and one that would give me a better framework when discussing product, technical and marketing topics with clients or friends in the industry.
I came up with the following framework as a 2×2 matrix that uses two main axis to define the perceived value of data centric solutions. These are the Volume & Complexity of Data Integration and the Completeness & Flexibility of Data Analytics.
The reason for these definitions is that one very real change is that most clients that I work with are constantly dealing with distributed applications and business processes which means having to figure out how to bring that data together either in a new solution or in a analytics solution that can work across the various data sets. There is no single right answer to these issues but there are very real patterns of how different companies and solutions approach the underlying issue of growing distributed data inside and outside the control of the company.
1. Personal Productivity. These are solutions that collect and present data mostly for individual use, team data sharing and organization. They tend to be single task oriented and provide data reporting functions.
2. Business Productivity. These solutions usually span multiple data sources and are focused on either decision support, communication or collaboration.
3. Business Criticality. Theses solutions provide new value or capabilities to an organization by adding advanced data analytics that provided automated response or secondary views across distributed data sources.
4. Life Criticality. These solutions are a special subset which are aimed at either individual, group or social impact solutions. Traditionally these have been very proprietary and closed systems. The main trend in data-centric solutions is coming from more government and business data being exposed which can be integrated into new solutions that we just never could even do previously let alone think up. I do not even have a good example of a real one yet, but I see it as the higher level solution that evolves as at the juncture of real-time data meets analytics and distributed data sets.
Some examples of current solutions as I would map them on the perceived value of data centric solutions framework. Some of these are well known and others you probably have never heard. Many of these new solutions were not easy to create without technology that provides easier access to data from distributed resources or compute power for supporting decision support.
What I really like about this value framework is that it allows us to get beyond all the buzzwords of IoT, BigData, etc and focus on the real needs and solutions that are needed and that cross over these technical or singular topics but on their own are not actual high value business solutions. Feedback welcome.