Tag Archives: Big Data
While CIOs are urged to rethink of backup strategies following warnings from leading analysts that companies are wasting billions on unnecessary storage, consultants and IT solution vendors are selling “Big Data” narratives to these CIOs as a storage optimization strategy.
What a CIO must do is ask:
Do you think a Backup Strategy is same as a Big Data strategy?
Is your MO – “I must invest in Big Data because my competitor is”?
Do you think Big Data and “data analysis” are synonyms?
Most companies invest very little in their storage technologies, while spending on server and network technologies primarily for backup. Further, the most common mistake businesses make is to fail to update their backup policies. It is not unusual for companies to be using backup policies that are years or even decades old, which do not discriminate between business-critical files and the personal music files of employees.
Web giants like Facebook and Yahoo generally aren’t dealing with Big Data. They run their own giant, in-house “clusters” – collections of powerful servers – for crunching data. But, it appears that those clusters are unnecessary for many of the tasks which they’re handed. In the case of Facebook, most of the jobs engineers ask their clusters to perform are in the “megabyte to gigabyte” range, which means they could easily be handled on a single computer – even a laptop.
The necessity of breaking problems into many small parts, and processing each on a large array of computers, characterizes classic Big Data problems like Google’s need to compute the rank of every single web page on the planet.
In, Nobody ever got fired for buying a cluster, Microsoft Research points out that a lot of the problems solved by engineers at even the most data-hungry firms don’t need to be run on clusters. Why is that a problem? It is because, there are vast classes of problems for which these clusters are relatively inefficient, or a very inappropriate, solution.
Here is an example of a post exhorting readers to “Incorporate Big Data Into Your Small Business” that is about a quantity of data that probably wouldn’t strain Google Docs, much less Excel on a single laptop. In other words, most businesses are in dealing with small data. It’s very important stuff but it has little connection to the big kind.
Let us lose the habit of putting “big” in front of data to make it sound important. After all, supersizing your data, just because you can, is going to cost you a lot more and may yield a lot less.
So what is it? Big Data, small Data, or Smart Data?
Gregor Mendel uncovered the secrets of genetic inheritance with just enough data to fill a notebook. The important thing is gathering the right data, not gathering some arbitrary quantity of it.
First – let’s start off with a description of what exactly Big Data is…simply put: lots and lots of data. According to Wikipedia: “Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set.”
There are many different sources of data (claims systems, enrollment systems, benefits administration systems, survey results, consumer data, social media, personal health devices – like fitbit). Each source generates an amazing amount of data. These data sets grow in size because they are being gathered by readily available and numerous information-sensing mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers, and wireless sensor networks. The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5×1018) of data were created. In order to make sense of all of this data, we need to be able to organize it, create linkages between the data and then perform analysis on the data in order to provide meaningful actions.
In 2000, Seisint Inc. developed C++ as a distributed file sharing framework for data storage and querying to support the vast amount of storage that is necessary for this data. With this framework, structured, semi-structured and/or unstructured data can be stored and distributed across multiple servers.
In 2004, Google published a paper on a process called MapReduce that uses the distributed file sharing framework. The MapReduce framework provides a parallel processing model and associated implementation to process huge amount of data. With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open source project named Hadoop.
With Hadoop, payers have the ability to store a vast amount of data at a fairly inexpensive price point. By distributing the framework, access to the data can happen in a timely manner and payers are able to interact effectively with their distributed data.
Within the Healthcare Payer market, there are a lot of potential use cases for Hadoop or big data. Once the data is stored, linked and relationships between the data are created – some of the benefits we anticipate include:
- Re-Admission Risk Analysis- One of the key predictors of re-admission rates is whether or not the patient has someone to help them at home. The ability to determine household information (through relationships in member data, for example addresses and care team relationships available within a master data management solution populated with data from a Hadoop cluster) would be very helpful to identify at risk patients and provide targeted care post discharge. Data from social media outlets can provide quite a bit of household information.
- STARS Rating Improvement -In addition to missed care management plans/drug adherence, another interesting thing that could be better aligned is the member/provider link. Perhaps one specific provider is more successful at getting patients to adhere to Diabetes management protocols, while another provider is not very successful at getting hip replacement patients to complete physical therapy. Being able to link the patient to the provider along with the clinical data can help identify where to focus remediation efforts for possibly modifying provider or member behavior.
- Member Engagement -Taking householding further, putting information from re-admission risk analysis to work – once payers are able to household a group of members and link the household to a specific address – payers might be able to better predict how a new member in the same physical location might behave – and then you could target your outreach to the new members from the beginning utilizing effective engagement methodologies that have been successful for that physical location in the past.
In order to create the household, or determine how a member feels about a provider (which can then impact how they adhere to treatment plans) or understand how neighborhoods (which are groupings of households) may engage with their providers, payers need access to a vast amount of data. They also need to be able to sift through this data efficiently to create the relationship links as quickly as possible. Sifting through the data is enabled with Hadoop and Big Data. Relating the data can be done with master data management (which I will talk about next).
Where is the best place to get started on a Big Data solution? The Big, Big Data Workbook addresses:
- How to choose the right big data project and make it bulletproof from the start– setting clear business and IT objectives, defining metrics that prove your project’s value, and being strategic about datasets, tools and hand-coding.
- What to consider when building your team and data governance framework– making the most of existing skills, thinking strategically about the composition of the team, and ensuring effective communication and alignment of the project goals.
- How to ensure your big data supply chain is lean and effective– establishing clear, repeatable, scalable, and continuously improving processes, and a blueprint for building the ideal big data technology and process architecture
In case you haven’t noticed, data integration is all the rage right now. Why? There are three major reasons for this trend that we’ll explore below, but a recent USA Today story focused on corporate data as a much more valuable asset than it was just a few years ago. Moreover, the sheer volume of data is exploding.
For instance, in a report published by research company IDC, they estimated that the total count of data created or replicated worldwide in 2012 would add up to 2.8 zettabytes (ZB). By 2020, IDC expects the annual data-creation total to reach 40 ZB, which would amount to a 50-fold increase from where things stood at the start of 2010.
But the growth of data is only a part of the story. Indeed, I see three things happening that drive interest in data integration.
First, the growth of cloud computing. The growth of data integration around the growth of cloud computing is logical, considering that we’re relocating data to public clouds, and that data must be synced with systems that remain on-premise.
The data integration providers, such as Informatica, have stepped up. They provide data integration technology that can span enterprises, managed service providers, and clouds that dealing with the special needs of cloud-based systems. Moreover, at the same time, data integration improves the ways we doing data governance, and data quality,
Second, the growth of big data. A recent IDC forecast shows that the big data technology and services market will grow at a 26.4% compound annual growth rate to $41.5 billion through 2018, or, about six times the growth rate of the overall information technology market. Additionally, by 2020, IDC believes that line of business buyers will help drive analytics beyond its historical sweet spot of relational to the double-digit growth rates of real-time intelligence and exploration/discovery of the unstructured worlds.
The world of big data razor blades around data integration. The more that enterprises rely on big data, and the more that data needs to move from place to place, the more a core data integration strategy and technology is needed. That means you can’t talk about big data without talking about big data integration.
Data integration technology providers have responded with technology that keeps up with the volume of data that moves from place to place. As linked to the growth of cloud computing above, providers also create technology with the understanding that data now moves within enterprises, between enterprises and clouds, and even from cloud to cloud. Finally, data integration providers know how to deal with both structured and unstructured data these days.
Third, better understanding around the value of information. Enterprise managers always knew their data was valuable, but perhaps they did not understand the true value that it can bring.
With the growth of big data, we now have access to information that helps us drive our business in the right directions. Predictive analytics, for instance, allows us to take years of historical data and determine patterns that allow us to predict the future. Mashing up our business data with external data sources makes our data even more valuable.
Of course, data integration drives much of this growth. Thus the refocus on data integration approaches and tech. There are years and years of evolution still ahead of us, and much to be learned from the data we maintain.
For those hoping to push through a hard-hitting analytics effort that will serve as a beacon of light within an otherwise calcified organization, there’s probably a lot of work cut out for you. Evolving into an organization that fully grasps the power and opportunities of data analytics requires cultural change, and this is a challenge organizations have only begin to grasp.
“Sitting down with pizza and coffee could get you around can get around most of the technical challenges,” explained Sam Ransbotham, Ph.D, associate professor Boston College, at a recent panel webcast hosted by MIT Sloan Management Review, “but the cultural problems are much larger.”
That’s one of the key takeaways from a the panel, in which Ransbotham was joined by Tuck Rickards, head of digital transformation practice at Russell Reynolds Associates, a digital recruiting firm, and Denis Arnaud, senior data scientist Amadeus Travel Intelligence. The panel, which examined the impact of corporate culture on data analytics, was led by Michael Fitzgerald, contributing editor at MIT Sloan Management Review.
The path to becoming an analytics-driven company is a journey that requires transformation across most or all departments, the panelists agreed. “It’s fundamentally different to be a data-driven decision company than kind of a gut-feel decision-making company,” said Rickards. “Acquiring this capability to do things differently usually requires a massive culture shift.”
That’s because the cultural aspects of the organization – “the values, the behaviors, the decision making norms and the outcomes go hand in hand with data analytics,” said Ransbotham. “It doesn’t do any good to have a whole bunch of data processes if your company doesn’t have the culture to act on them and do something with them.” Rickards adds that bringing this all together requires an agile, open source mindset, with frequent, open communication across the organization.
So how does one go about building and promoting a culture that is conducive to getting the maximum benefit from data analytics? The most important piece is being about people who ate aware and skilled in analytics – both from within the enterprise and from outside, the panelists urged. Ransbotham points out that it may seem daunting, but it’s not. “This is not some gee-whizz thing,” he said. “We have to get rid of this mindset that these things are impossible. Everybody who has figured it out has figured it out somehow. We’re a lot more able to pick up on these things that we think — the technology is getting easier, it doesn’t require quite as much as it used to.”
The key to evolving corporate culture to becoming more analytics-driven is to identify or recruit enlightened and skilled individuals who can provide the vision and build a collaborative environment. “The most challenging part is looking for someone who can see the business more broadly, and can interface with the various business functions –ideally, someone who can manage change and transformation throughout the organization,” Rickards said.
Arnaud described how his organization – an online travel service — went about building an espirit de corps between data analytics staff and business staff to ensure the success of their company’s analytics efforts. “Every month all the teams would do a hands-on workshop, together in some place in Europe [Amadeus is headquartered in Madrid, Spain].” For example, a workshop may focus on a market analysis for a specific customer, and the participants would explore the entire end-to-end process for working with the customer, “from the data collection all the way through to data acquisition through data crunching and so on. The one knowing the data analysis techniques would explain them, and the one knowing the business would explain that, and so on.” As a result of these monthly workshops, business and analytics teams members have found it “much easier to collaborate,” he added.
Web-oriented companies such as Amadeus – or Amazon and eBay for that matter — may be paving the way with analytics-driven operations, but companies in most other industries are not at this stage yet, both Rickards and Ransbotham point out. The more advanced web companies have built “an end-to-end supply chain, wrapped around customer interaction,” said Rickards. “If you think of most traditional businesses, financial services or automotive or healthcare are a million miles away from that. It starts with having analytic capabilities, but it’s a real journey to take that capability across the company.”
The analytics-driven business of the near future – regardless of industry – will likely to be staffed with roles not seen as of yet today. “If you are looking to re-architect the business, you may be imagining roles that you don’t have in the company today,” said Rickards. Along with the need for chief analytics officers, data scientists, and data analysts, there will be many new roles created. “If you are on the analytics side of this, you can be in an analytics group or a marketing group, with more of a CRM or customer insights title. Yu can be in a planning or business functions. In a similar way on the technology side, there are people very focused on architecture and security.”
Ultimately, the demand will be for leaders and professionals who understand both the business and technology sides of the opportunity, Rickards continued. Ultimately, he added, “you can have good people building a platform, and you can have good data scientists. But you better have someone on the top of that organization knowing the business purpose.’
On March 25th, Josh Lee, Global Director for Insurance Marketing at Informatica and Cindy Maike, General Manager, Insurance at Hortonworks, will be joining the Insurance Journal in a webinar on “How to Become an Analytics Ready Insurer”.
Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern
Josh and Cindy exchange perspectives on what “analytics ready” really means for insurers, and today we are sharing some of our views (join the webinar to learn more). Josh and Cindy offer perspectives on the five questions posed here. Please join Insurance Journal, Informatica and Hortonworks on March 25th for more on this exciting topic.
See the Hortonworks site for a second posting of this blog and more details on exciting innovations in Big Data.
- What makes a big data environment attractive to an insurer?
CM: Many insurance companies are using new types of data to create innovative products that better meet their customers’ risk needs. For example, we are seeing insurance for “shared vehicles” and new products for prevention services. Much of this innovation is made possible by the rapid growth in sensor and machine data, which the industry incorporates into predictive analytics for risk assessment and claims management.
Customers who buy personal lines of insurance also expect the same type of personalized service and offers they receive from retailers and telecommunication companies. They expect carriers to have a single view of their business that permeates customer experience, claims handling, pricing and product development. Big data in Hadoop makes that single view possible.
JL: Let’s face it, insurance is all about analytics. Better analytics leads to better pricing, reduced risk and better customer service. But here’s the issue. Existing data sources are costly in storing vast amounts of data and inflexible to adapt to changing needs of innovative analytics. Imagine kicking off a simulation or modeling routine one evening only to return in the morning and find it incomplete or lacking data that requires a special request of IT.
This is where big data environments are helping insurers. Larger, more flexible data sets allowing longer series of analytics to be run, generating better results. And imagine doing all that at a fraction of the cost and time of traditional data structures. Oh, and heaven forbid you ask a mainframe to do any of this.
- So we hear a lot about Big Data being great for unstructured data. What about traditional data types that have been used in insurance forever?
CM: Traditional data types are very important to the industry – it drives our regulatory reporting and much of the performance management reporting. This data will continue to play a very important role in the insurance industry and for companies.
However, big data can now enrich that traditional data with new data sources for new insights. In areas such as customer service and product personalization, it can make the difference between cross-selling the right products to meet customer needs and losing the business. For commercial and group carriers, the new data provides the ability to better analyze risk needs, price accordingly and enable superior service in a highly competitive market.
JL: Traditional data will always be around. I doubt that I will outlive a mainframe installation at an insurer; which makes me a little sad. And for many rote tasks like financial reporting, a sales report, or a commission statement, those are sufficient. However, the business of insurance is changing in leaps and bounds. Innovators in data science are interested in correlating those traditional sources to other creative data to find new products, or areas to reduce risk. There is just a lot of data that is either ignored or locked in obscure systems that needs to be brought into the light. This data could be structured or unstructured, it doesn’t matter, and Big Data can assist there.
- How does this fit into an overall data management function?
JL: At the end of the day, a Hadoop cluster is another source of data for an insurer. More flexible, more cost effective and higher speed; but yet another data source for an insurer. So that’s one more on top of relational, cubes, content repositories, mainframes and whatever else insurers have latched onto over the years. So if it wasn’t completely obvious before, it should be now. Data needs to be managed. As data moves around the organization for consumption, it is shaped, cleaned, copied and we hope there is governance in place. And the Big Data installation is not exempt from any of these routines. In fact, one could argue that it is more critical to leverage good data management practices with Big Data not only to optimize the environment but also to eventually replace traditional data structures that just aren’t working.
CM: Insurance companies are blending new and old data and looking for the best ways to leverage “all data”. We are witnessing the development of a new generation of advanced analytical applications to take advantage of the volume, velocity, and variety in big data. We can also enhance current predictive models, enriching them with the unstructured information in claim and underwriting notes or diaries along with other external data.
There will be challenges. Insurance companies will still need to make important decisions on how to incorporate the new data into existing data governance and data management processes. The Chief Data or Chief Analytics officer will need to drive this business change in close partnership with IT.
- Tell me a little bit about how Informatica and Hortonworks are working together on this?
JL: For years Informatica has been helping our clients to realize the value in their data and analytics. And while enjoying great success in partnership with our clients, unlocking the full value of data requires new structures, new storage and something that doesn’t break the bank for our clients. So Informatica and Hortonworks are on a continuing journey to show that value in analytics comes with strong relationships between the Hadoop distribution and innovative market leading data management technology. As the relationship between Informatica and Hortonworks deepens, expect to see even more vertically relevant solutions and documented ROI for the Informatica/Hortonworks solution stack.
CM: Informatica and Hortonworks optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value. By incorporating data management services into the data lake, companies can store and process massive amounts of data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field.
Matching data from internal sources (e.g. very granular data about customers) with external data (e.g. weather data or driving patterns in specific geographic areas) can unlock new revenue streams.
See this video for a discussion on unlocking those new revenue streams. Sanjay Krishnamurthi, Informatica CTO, and Shaun Connolly, Hortonworks VP of Corporate Strategy, share their perspectives.
- Do you have any additional comments on the future of data in this brave new world?
CM: My perspective is that, over time, we will drop the reference to “big” or ”small” data and get back to referring simply to “Data”. The term big data has been useful to describe the growing awareness on how the new data types can help insurance companies grow.
We can no longer use “traditional” methods to gain insights from data. Insurers need a modern data architecture to store, process and analyze data—transforming it into insight.
We will see an increase in new market entrants in the insurance industry, and existing insurance companies will improve their products and services based upon the insights they have gained from their data, regardless of whether that was “big” or “small” data.
JL: I’m sure that even now there is someone locked in their mother’s basement playing video games and trying to come up with the next data storage wave. So we have that to look forward to, and I’m sure it will be cool. But, if we are honest with ourselves, we’ll admit that we really don’t know what to do with half the data that we have. So while data storage structures are critical, the future holds even greater promise for new models, better analytical tools and applications that can make sense of all of this and point insurers in new directions. The trend that won’t change anytime soon is the ongoing need for good quality data, data ready at a moment’s notice, safe and secure and governed in a way that insurers can trust what those cool analytics show them.
Please join us for an interactive discussion on March 25th at 10am Pacific Time/ 1pm Eastern Time.
Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern
Despite spending more than $30 Billion in annual spending on Big Data, successful big data implementations elude most organizations. That’s the sobering assessment of a recent study of 226 senior executives from Capgemini, which found that only 13 percent feel they have truly have made any headway with their big data efforts.
The reasons for Big Data’s lackluster performance include the following:
- Data is in silos or legacy systems, scattered across the enterprise
- No convincing business case
- Ineffective alignment of Big Data and analytics teams across the organization
- Most data locked up in petrified, difficult to access legacy systems
- Lack of Big Data and analytics skills
Actually, there is nothing new about any of these issues – in fact, the perceived issues with Big Data initiatives so far map closely with the failed expect many other technology-driven initiatives. First, there’s the hype that tends to get way ahead of any actual well-functioning case studies. Second, there’s the notion that managers can simply take a solution of impressive magnitude and drop it on top of their organizations, expecting overnight delivery of profits and enhanced competitiveness.
Technology, and Big Data itself, is but a tool that supports the vision, well-designed plans and hard work of forward-looking organizations. Those managers seeking transformative effects need to look deep inside their organizations, at how deeply innovation is allowed to flourish, and in turn, how their employees are allowed to flourish. Think about it: if line employees suddenly have access to alternative ways of doing things, would they be allowed to run with it? If someone discovers through Big Data that customers are using a product differently than intended, do they have the latitude to promote that new use? Or do they have to go through chains of approval?
Big Data may be what everybody is after, but Big Culture is the ultimate key to success.
For its part, Capgemini provides some high-level recommendations for better baking in transformative values as part of Big Data initiatives, based on their observations of best-in-class enterprises:
The vision thing: “It all starts with vision,” says Capgemini’s Ron Tolido. “If the company executive leadership does not actively, demonstrably embrace the power of technology and data as the driver of change and future performance, nothing digitally convincing will happen. We have not even found one single exception to this rule. The CIO may live and breathe Big Data and there may even be a separate Chief Data Officer appointed – expect more of these soon – if they fail to commit their board of executives to data as the engine of success, there will be a dark void beyond the proof of concept.”
Establish a well-defined organizational structure: “Big Data initiatives are rarely, if ever, division-centric,” the Capgemini report states. “They often cut across various departments in an organization. Organizations that have clear organizational structures for managing rollout can minimize the problems of having to engage multiple stakeholders.”
Adopt a systematic implementation approach: Surprisingly, even the largest and most sophisticated organizations that do everything on process don’t necessarily approach Big Data this way, the report states. “Intuitively, it would seem that a systematic and structured approach should be the way to go in large-scale implementations. However, our survey shows that this philosophy and approach are rare. Seventy-four percent of organizations did not have well-defined criteria to identify, qualify and select Big Data use-cases. Sixty-seven percent of companies did not have clearly defined KPIs to assess initiatives. The lack of a systematic approach affects success rates.”
Adopt a “venture capitalist” approach to securing buy-in and funding: “The returns from investments in emerging digital technologies such as Big Data are often highly speculative, given the lack of historical benchmarks,” the Capgemini report points out. “Consequently, in many organizations, Big Data initiatives get stuck due to the lack of a clear and attributable business case.” To address this challenge, the report urges that Big Data leaders manage investments “by using a similar approach to venture capitalists. This involves making multiple small investments in a variety of proofs of concept, allowing rapid iteration, and then identifying PoCs that have potential and discarding those that do not.”
Leverage multiple channels to secure skills and capabilities: “The Big Data talent gap is something that organizations are increasingly coming face-to-face with. Closing this gap is a larger societal challenge. However, smart organizations realize that they need to adopt a multi-pronged strategy. They not only invest more on hiring and training, but also explore unconventional channels to source talent. Capgemini advises reaching out to partner organizations for the skills needed to develop Big Data initiatives. These can be employee exchanges, or “setting up innovation labs in high-tech hubs such as Silicon Valley.” Startups may also be another source of Big Data talent.
Over and over, when talking with people who are starting to learn Data Science, there’s a frustration that comes up: “I don’t know which programming language to start with.”
Moreover, it’s not just programming languages; it’s also software systems like Tableau, SPSS, etc. There is an ever-widening range of tools and programming languages and it’s difficult to know which one to select.
I get it. When I started focusing heavily on data science a few years ago, I reviewed all of the popular programming languages at the time: Python, R, SAS, D3, not to mention a few that in hindsight, really aren’t that great for analytics like Perl, Bash, and Java. I once read a suggestion to use arcane tools like UNIX’s AWK and SED.
There are so many suggestions, so much material, so many options; it becomes difficult to know what to learn first. There’s a mountain of content, and it’s difficult to know where to find the “gold nuggets”; the things to learn that will bring you the high return on time investment.
That’s the crux of the problem. The fact is – time is limited. Learning a new programming language is a large investment in your time, so you need to be strategic about which one you select. To be clear, some languages will yield a very high return on your investment. Other languages are purely auxiliary tools that you might use only a few times per year.
Let me make this easy for you: learn R first. Here’s why:
R is becoming the “lingua franca” of data science
R is becoming the lingua franca for data science. That’s not to say that it’s the only language, or that it’s the best tool for every job. It is, however, the most widely used and it is rising in popularity.
As I’ve noted before, O’Reilly Media conducted a survey in 2014 to understand the tools that data scientists are currently using. They found that R is the most popular programming language (if you exclude SQL as a “proper” programing language).
Looking more broadly, there are other rankings that look at programming language popularity in general. For example, Redmonk measures programming language popularity by examining discussion (on Stack Overflow) and usage (on GitHub). In their latest rankings, R placed 13th, the highest of any statistical programming language. Redmonk also noted that R has been rising in popularity over time.
A similar ranking by TIOBE, which ranks programming languages by the number of search engine searches, indicates a strong year over year rise for R.
Keep in mind that the Redmonk and TIOBE rankings are for all programming languages. When you look at these, R is now ranking among the most popular and most commonly used over all.
It’s often said that 80% of the work in data science is data manipulation. More often than not, you’ll need to spend significant amounts of your time “wrangling” your data; putting it into the shape you want. R has some of the best data management tools you’ll find.
The dplyr package in R makes data manipulation easy. It is the tool I wish I had years ago. When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.
ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization.
I’ve said numerous times, that there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.
Finally, there’s machine learning. While I think most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools.
When you’re ready to start using (and learning) machine learning, R has some of the best tools and resources.
One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.
Summary: Learn R, and focus your efforts
Once you start to learn R, don’t get “shiny new object” syndrome.
You’re likely to see demonstrations of new techniques and tools. Just look at some of the dazzling data visualizations that people are creating.
Seeing other people create great work (and finding out that they’re using a different tool) might lead you to try something else. Trust me on this: you need to focus. Don’t get “shiny new object” syndrome. You need to be able to devote a few months (or longer) to really diving into one tool.
And as I noted above, you really want to build up your competence in skills across the data science workflow. You need to have solid skills at least in data visualization and data manipulation. You need to be able to do some serious data exploration in R before you start moving on.
Spending 100 hours on R will yield vastly better returns than spending 10 hours on 10 different tools. In the end, your time ROI will be higher by concentrating your efforts. Don’t get distracted by the “latest, sexy new thing.”
Let’s face it, building a Data Governance program is no overnight task. As one CDO puts it: ”data governance is a marathon, not a sprint”. Why? Because data governance is a complex business function that encompasses technology, people and process, all of which have to work together effectively to ensure the success of the initiative. Because of the scope of the program, Data Governance often calls for participants from different business units within an organization, and it can be disruptive at first.
Why bother then? Given that data governance is complex, disruptive, and could potentially introduce additional cost to a company? Well, the drivers for data governance can vary for different organizations. Let’s take a close look at some of the motivations behind data governance program.
For companies in heavily regulated industries, establishing a formal data governance program is a mandate. When a company is not compliant, consequences can be severe. Penalties could include hefty fines, brand damage, loss in revenue, and even potential jail time for the person who is held accountable for being noncompliance. In order to meet the on-going regulatory requirements, adhere to data security policies and standards, companies need to rely on clean, connected and trusted data to enable transparency, auditability in their reporting to meet mandatory requirements and answer critical questions from auditors. Without a dedicated data governance program in place, the compliance initiative could become an on-going nightmare for companies in the regulated industry.
A data governance program can also be established to support customer centricity initiative. To make effective cross-sells and ups-sells to your customers and grow your business, you need clear visibility into customer purchasing behaviors across multiple shopping channels and touch points. Customer’s shopping behaviors and their attributes are captured by the data, therefore, to gain thorough understanding of your customers and boost your sales, a holistic Data Governance program is essential.
Other reasons for companies to start a data governance program include improving efficiency and reducing operational cost, supporting better analytics and driving more innovations. As long as it’s a business critical area and data is at the core of the process, and the business case is loud and sound, then there is a compelling reason for launching a data governance program.
Now that we have identified the drivers for data governance, how do we start? This rather loaded question really gets into the details of the implementation. A few critical elements come to consideration including: identifying and establishing various task forces such as steering committee, data governance team and business sponsors; identifying roles and responsibilities for the stakeholders involved in the program; defining metrics for tracking the results. And soon you will find that on top of everything, communications, communications and more communications is probably the most important tactic of all for driving the initial success of the program.
A rule of thumb? Start small, take one-step at a time and focus on producing something tangible.
Sounds easy, right? Well, let’s hear what the real-world practitioners have to say. Join us at this Informatica webinar to hear Michael Wodzinski, Director of Information Architecture, Lisa Bemis, Director of Master Data, Fabian Torres, Director of Project Management from Houghton Mifflin Harcourt, global leader in publishing, as well as David Lyle, VP of product strategy from Informatica to discuss how to implement a successful data governance practice that brings business impact to an enterprise organization.
If you are currently kicking the tires on setting up data governance practice in your organization, I’d like to invite you to visit a member-only website dedicated to Data Governance: http://governyourdata.com/. This site currently has over 1,000 members and is designed to foster open communications on everything data governance. There you will find conversations on best practices, methodologies, frame works, tools and metrics. I would also encourage you to take a data governance maturity assessment to see where you currently stand on the data governance maturity curve, and compare the result against industry benchmark. More than 200 members have taken the assessment to gain better understanding of their current data governance program, so why not give it a shot?
Data Governance is a journey, likely a never-ending one. We wish you best of the luck on this effort and a joyful ride! We love to hear your stories.
2014 was a pivotal turning point for Informatica as our investments in Hadoop and efforts to innovate in big data gathered momentum and became a core part of Informatica’s business. Our Hadoop related big data revenue growth was in the ballpark of leading Hadoop startups – more than doubling over 2013.
In 2014, Informatica reached about 100 enterprise customers of our big data products with an increasing number going into production with Informatica together with Hadoop and other big data technologies. Informatica’s big data Hadoop customers include companies in financial services, insurance, telcommunications, technology, energy, life sciences, healthcare and business services. These innovative companies are leveraging Informatica to accelerate their time to production and drive greater value from their big data investments.
These customers are in-production or implementing a wide range of use cases leveraging Informatica’s great data pipeline capabilities to better put the scale, efficiency and flexibility of Hadoop to work. Many Hadoop customers start by optimizing their data warehouse environments by moving data storage, profiling, integration and cleansing to Hadoop in order to free up capacity in their traditional analytics data warehousing systems. Customers that are further along in their big data journeys have expanded to use Informatica on Hadoop for exploratory analytics of new data types, 360 degree customer analytics, fraud detection, predictive maintenance, and analysis of massive amounts of Internet of Things machine data for optimization of energy exploration, manufacturing processes, network data, security and other large scale systems initiatives.
2014 was not just a year of market momentum for Informatica, but also one of new product development innovations. We shipped enhanced functionality for entity matching and relationship building at Hadoop scale (a key part of Master Data Management), end-to-end data lineage through Hadoop, as well as high performance real-time streaming of data into Hadoop. We also launched connectors to NoSQL and analytics databases including Datastax Cassandra, MongoDB and Amazon Redshift. Informatica advanced our capabilities to curate great data for self-serve analytics with a connector to output Tableau’s data format and launched our self-service data preparation solution, Informatica Rev.
Customers can now quickly try out Informatica on Hadoop by downloading the free trials for the Big Data Edition and Vibe Data Stream that we launched in 2014. Now that Informatica supports all five of the leading Hadoop distributions, customers can build their data pipelines on Informatica with confidence that no matter how the underlying Hadoop technologies evolve, their Informatica mappings will run. Informatica provides highly scalable data processing engines that run natively in Hadoop and leverage the best of open source innovations such as YARN, MapReduce, and more. Abstracting data pipeline mappings from the underlying Hadoop technologies combined with visual tools enabling team collaboration empowers large organizations to put Hadoop into production with confidence.
As we look ahead into 2015, we have ambitious plans to continue to expand and evolve our product capabilities with enhanced productivity to help customers rapidly get more value from their data in Hadoop. Stay tuned for announcements throughout the year.
Try some of Informatica’s products for Hadoop on the Informatica Marketplace here.
Strata 2015 – Making Data Work for Everyone with Cloud Integration, Cloud Data Management and Cloud Machine Learning
Are you ready to answer “Yes” to the questions:
a) “Are you Cloud Ready?”
b) “Are you Machine Learning Ready?”
I meet with hundreds of Informatica Cloud customers and prospects every year. While they are investing in Cloud, and seeing the benefits, they also know that there is more innovation out there. They’re asking me, what’s next for Cloud? And specifically, what’s next for Informatica in regards to Cloud Data Integration and Cloud Data Management? I’ll share more about my response throughout this blog post.
The spotlight will be on Big Data and Cloud at the Strata + Hadoop World conference taking place in Silicon Valley from February 17-20 with the theme “Make Data Work”. I want to focus this blog post on two topics related to making data work and business insights:
- How existing cloud technologies, innovations and partnerships can help you get ready for the new era in cloud analytics.
- How you can make data work in new and advanced ways for every user in your company.
Today, Informatica is announcing the availability of its Cloud Integration Secure Agent on Microsoft Azure and Linux Virtual Machines as well as an Informatica Cloud Connector for Microsoft Azure Storage. Users of Azure data services such as Azure HDInsight, Azure Machine Learning and Azure Data Factory can make their data work with access to the broadest set of data sources including on-premises applications, databases, cloud applications and social data. Read more from Microsoft about their news at Strata, including their relationship with Informatica, here.
“Informatica, a leader in data integration, provides a key solution with its Cloud Integration Secure Agent on Azure,” said Joseph Sirosh, Corporate Vice President, Machine Learning, Microsoft. “Today’s companies are looking to gain a competitive advantage by deriving key business insights from their largest and most complex data sets. With this collaboration, Microsoft Azure and Informatica Cloud provide a comprehensive portfolio of data services that deliver a broad set of advanced cloud analytics use cases for businesses in every industry.”
Even more exciting is how quickly any user can deploy a broad spectrum of data services for cloud analytics projects. The fully-managed cloud service for building predictive analytics solutions from Azure and the wizard-based, self-service cloud integration and data management user experience of Informatica Cloud helps overcome the challenges most users have in making their data work effectively and efficiently for analytics use cases.
The new solution enables companies to bring in data from multiple sources for use in Azure data services including Azure HDInsight, Azure Machine Learning, Azure Data Factory and others – for advanced analytics.
The broad availability of Azure data services, and Azure Machine Learning in particular, is a game changer for startups and large enterprises. Startups can now access cloud-based advanced analytics with minimal cost and complexity and large businesses can use scalable cloud analytics and machine learning models to generate faster and more accurate insights from their Big Data sources.
Success in using machine learning requires not only great analytics models, but also an end-to-end cloud integration and data management capability that brings in a wide breadth of data sources, ensures that data quality and data views match the requirements for machine learning modeling, and an ease of use that facilitates speed of iteration while providing high-performance and scalable data processing.
For example, the Informatica Cloud solution on Azure is designed to deliver on these critical requirements in a complementary approach and support advanced analytics and machine learning use cases that provide customers with key business insights from their largest and most complex data sets.
Using the Informatica Cloud solution on Azure connector with Informatica Cloud Data Integration enables optimized read-write capabilities for data to blobs in Azure Storage. Customers can use Azure Storage objects as sources, lookups, and targets in data synchronization tasks and advanced mapping configuration tasks for efficient data management using Informatica’s industry leading cloud integration solution.
As Informatica fulfills the promise of “making great data ready to use” to our 5,500 customers globally, we continue to form strategic partnerships and develop next-generation solutions to stay one step ahead of the market with our Cloud offerings.
My goal in 2015 is to help each of our customers say that they are Cloud Ready! And collaborating with solutions such as Azure ensures that our joint customers are also Machine Learning Ready!
To learn more, try our free Informatica Cloud trial for Microsoft Azure data services.