Tag Archives: Hadoop
I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.
MongoDB as your Big Data platform
Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.
RDBMS schema as files
You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.
Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.
Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.
Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”
Missing use cases
Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.
It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.
To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.
On March 25th, Josh Lee, Global Director for Insurance Marketing at Informatica and Cindy Maike, General Manager, Insurance at Hortonworks, will be joining the Insurance Journal in a webinar on “How to Become an Analytics Ready Insurer”.
Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern
Josh and Cindy exchange perspectives on what “analytics ready” really means for insurers, and today we are sharing some of our views (join the webinar to learn more). Josh and Cindy offer perspectives on the five questions posed here. Please join Insurance Journal, Informatica and Hortonworks on March 25th for more on this exciting topic.
See the Hortonworks site for a second posting of this blog and more details on exciting innovations in Big Data.
- What makes a big data environment attractive to an insurer?
CM: Many insurance companies are using new types of data to create innovative products that better meet their customers’ risk needs. For example, we are seeing insurance for “shared vehicles” and new products for prevention services. Much of this innovation is made possible by the rapid growth in sensor and machine data, which the industry incorporates into predictive analytics for risk assessment and claims management.
Customers who buy personal lines of insurance also expect the same type of personalized service and offers they receive from retailers and telecommunication companies. They expect carriers to have a single view of their business that permeates customer experience, claims handling, pricing and product development. Big data in Hadoop makes that single view possible.
JL: Let’s face it, insurance is all about analytics. Better analytics leads to better pricing, reduced risk and better customer service. But here’s the issue. Existing data sources are costly in storing vast amounts of data and inflexible to adapt to changing needs of innovative analytics. Imagine kicking off a simulation or modeling routine one evening only to return in the morning and find it incomplete or lacking data that requires a special request of IT.
This is where big data environments are helping insurers. Larger, more flexible data sets allowing longer series of analytics to be run, generating better results. And imagine doing all that at a fraction of the cost and time of traditional data structures. Oh, and heaven forbid you ask a mainframe to do any of this.
- So we hear a lot about Big Data being great for unstructured data. What about traditional data types that have been used in insurance forever?
CM: Traditional data types are very important to the industry – it drives our regulatory reporting and much of the performance management reporting. This data will continue to play a very important role in the insurance industry and for companies.
However, big data can now enrich that traditional data with new data sources for new insights. In areas such as customer service and product personalization, it can make the difference between cross-selling the right products to meet customer needs and losing the business. For commercial and group carriers, the new data provides the ability to better analyze risk needs, price accordingly and enable superior service in a highly competitive market.
JL: Traditional data will always be around. I doubt that I will outlive a mainframe installation at an insurer; which makes me a little sad. And for many rote tasks like financial reporting, a sales report, or a commission statement, those are sufficient. However, the business of insurance is changing in leaps and bounds. Innovators in data science are interested in correlating those traditional sources to other creative data to find new products, or areas to reduce risk. There is just a lot of data that is either ignored or locked in obscure systems that needs to be brought into the light. This data could be structured or unstructured, it doesn’t matter, and Big Data can assist there.
- How does this fit into an overall data management function?
JL: At the end of the day, a Hadoop cluster is another source of data for an insurer. More flexible, more cost effective and higher speed; but yet another data source for an insurer. So that’s one more on top of relational, cubes, content repositories, mainframes and whatever else insurers have latched onto over the years. So if it wasn’t completely obvious before, it should be now. Data needs to be managed. As data moves around the organization for consumption, it is shaped, cleaned, copied and we hope there is governance in place. And the Big Data installation is not exempt from any of these routines. In fact, one could argue that it is more critical to leverage good data management practices with Big Data not only to optimize the environment but also to eventually replace traditional data structures that just aren’t working.
CM: Insurance companies are blending new and old data and looking for the best ways to leverage “all data”. We are witnessing the development of a new generation of advanced analytical applications to take advantage of the volume, velocity, and variety in big data. We can also enhance current predictive models, enriching them with the unstructured information in claim and underwriting notes or diaries along with other external data.
There will be challenges. Insurance companies will still need to make important decisions on how to incorporate the new data into existing data governance and data management processes. The Chief Data or Chief Analytics officer will need to drive this business change in close partnership with IT.
- Tell me a little bit about how Informatica and Hortonworks are working together on this?
JL: For years Informatica has been helping our clients to realize the value in their data and analytics. And while enjoying great success in partnership with our clients, unlocking the full value of data requires new structures, new storage and something that doesn’t break the bank for our clients. So Informatica and Hortonworks are on a continuing journey to show that value in analytics comes with strong relationships between the Hadoop distribution and innovative market leading data management technology. As the relationship between Informatica and Hortonworks deepens, expect to see even more vertically relevant solutions and documented ROI for the Informatica/Hortonworks solution stack.
CM: Informatica and Hortonworks optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value. By incorporating data management services into the data lake, companies can store and process massive amounts of data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field.
Matching data from internal sources (e.g. very granular data about customers) with external data (e.g. weather data or driving patterns in specific geographic areas) can unlock new revenue streams.
See this video for a discussion on unlocking those new revenue streams. Sanjay Krishnamurthi, Informatica CTO, and Shaun Connolly, Hortonworks VP of Corporate Strategy, share their perspectives.
- Do you have any additional comments on the future of data in this brave new world?
CM: My perspective is that, over time, we will drop the reference to “big” or ”small” data and get back to referring simply to “Data”. The term big data has been useful to describe the growing awareness on how the new data types can help insurance companies grow.
We can no longer use “traditional” methods to gain insights from data. Insurers need a modern data architecture to store, process and analyze data—transforming it into insight.
We will see an increase in new market entrants in the insurance industry, and existing insurance companies will improve their products and services based upon the insights they have gained from their data, regardless of whether that was “big” or “small” data.
JL: I’m sure that even now there is someone locked in their mother’s basement playing video games and trying to come up with the next data storage wave. So we have that to look forward to, and I’m sure it will be cool. But, if we are honest with ourselves, we’ll admit that we really don’t know what to do with half the data that we have. So while data storage structures are critical, the future holds even greater promise for new models, better analytical tools and applications that can make sense of all of this and point insurers in new directions. The trend that won’t change anytime soon is the ongoing need for good quality data, data ready at a moment’s notice, safe and secure and governed in a way that insurers can trust what those cool analytics show them.
Please join us for an interactive discussion on March 25th at 10am Pacific Time/ 1pm Eastern Time.
Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern
2014 was a pivotal turning point for Informatica as our investments in Hadoop and efforts to innovate in big data gathered momentum and became a core part of Informatica’s business. Our Hadoop related big data revenue growth was in the ballpark of leading Hadoop startups – more than doubling over 2013.
In 2014, Informatica reached about 100 enterprise customers of our big data products with an increasing number going into production with Informatica together with Hadoop and other big data technologies. Informatica’s big data Hadoop customers include companies in financial services, insurance, telcommunications, technology, energy, life sciences, healthcare and business services. These innovative companies are leveraging Informatica to accelerate their time to production and drive greater value from their big data investments.
These customers are in-production or implementing a wide range of use cases leveraging Informatica’s great data pipeline capabilities to better put the scale, efficiency and flexibility of Hadoop to work. Many Hadoop customers start by optimizing their data warehouse environments by moving data storage, profiling, integration and cleansing to Hadoop in order to free up capacity in their traditional analytics data warehousing systems. Customers that are further along in their big data journeys have expanded to use Informatica on Hadoop for exploratory analytics of new data types, 360 degree customer analytics, fraud detection, predictive maintenance, and analysis of massive amounts of Internet of Things machine data for optimization of energy exploration, manufacturing processes, network data, security and other large scale systems initiatives.
2014 was not just a year of market momentum for Informatica, but also one of new product development innovations. We shipped enhanced functionality for entity matching and relationship building at Hadoop scale (a key part of Master Data Management), end-to-end data lineage through Hadoop, as well as high performance real-time streaming of data into Hadoop. We also launched connectors to NoSQL and analytics databases including Datastax Cassandra, MongoDB and Amazon Redshift. Informatica advanced our capabilities to curate great data for self-serve analytics with a connector to output Tableau’s data format and launched our self-service data preparation solution, Informatica Rev.
Customers can now quickly try out Informatica on Hadoop by downloading the free trials for the Big Data Edition and Vibe Data Stream that we launched in 2014. Now that Informatica supports all five of the leading Hadoop distributions, customers can build their data pipelines on Informatica with confidence that no matter how the underlying Hadoop technologies evolve, their Informatica mappings will run. Informatica provides highly scalable data processing engines that run natively in Hadoop and leverage the best of open source innovations such as YARN, MapReduce, and more. Abstracting data pipeline mappings from the underlying Hadoop technologies combined with visual tools enabling team collaboration empowers large organizations to put Hadoop into production with confidence.
As we look ahead into 2015, we have ambitious plans to continue to expand and evolve our product capabilities with enhanced productivity to help customers rapidly get more value from their data in Hadoop. Stay tuned for announcements throughout the year.
Try some of Informatica’s products for Hadoop on the Informatica Marketplace here.
As we head into Strata + Hadoop World San Jose, Pivotal has made some interesting announcements that are sure to be the talk of the show. Pivotal’s move to open-source some of their advanced products (and to form a new organization to foster Hadoop community cooperation) are signs of the dynamism and momentum of the Big Data market.
Informatica applauds these initiatives by Pivotal and we hope that they will contribute to the accelerating maturity of Hadoop and its expansion beyond early adopters into mainstream industry adoption. By contributing HAWQ, GemFire and the Greenplum Database to the open source community, Pivotal creates further open options in the evolving Hadoop data infrastructure technology. We expect this to be well received by the open source community.
As Informatica has long served as the industry’s neutral data connector for more than 5,500 customers and have developed a rich set of capabilities for Hadoop, we are also excited to see efforts to try to reduce fragmentation in the Hadoop community.
Even before the new company Pivotal was formed, Informatica had a long history working with the Greenplum team to ensure that joint customers could confidently use Informatica tools to include the Greenplum Database in their enterprise data pipelines. Informatica has mature and high-performance native connectivity to load data in and out of Greenplum reliably using Informatica’s codeless, visual data pipelining tools. In 2014, Informatica expanded out Hadoop support to include Pivotal HD Hadoop and we have joint customers using Informatica to do data profiling, transformation, parsing and cleansing using Informatica Big Data Edition running on Pivotal HD Hadoop.
We expect these innovative developments driven by Pivotal in the Big Data technology landscape to help to move the industry forward and contribute to Pivotal’s market progress. We look forward to continuing to support Pivotal technology and to an ever increasing number of successful joint customers. Please reach out to us if you have any questions about how Informatica and Pivotal can help your organization to put Big Data into production. We want to ensure that we can help you answer the question … Are you Big Data Ready?
A Data Lake is a simple concept. They are a catchment area for data entering the organization. In the past, most businesses didn’t need to organize such a data store because almost all data was internal. It traveled via traditional ETL mechanisms from transactional systems to a data warehouse and then was sprayed around the business, as required.
When a good deal of data comes from external sources, or even from internal sources like log files, which never previously made it into the data warehouse, there is a need for an “operational data store.” This has definitely become the premier application for Hadoop and it makes perfect sense to me that such technology be used for a data catchment area. The neat thing about Hadoop for this application is that:
- It scales out “as far as the eye can see,” so there’s no likelihood of it being unable to manage the data volumes even when they grow beyond the petabyte level.
- It is a key-value store, which means that you don’t need to expend much effort in modeling data when you decide to accommodate a new data source. You just define a key and define the metadata at leisure.
- The cost of the software and the storage is very low.
So let’s imagine that we have a need for a data catchment area, because we have decided to collect data from log-files, mobile devices, social networks, from public data sources, or whatever. So let us also imagine that we have implemented Hadoop and some of its useful components and we have begun to collect data.
Is it reasonable to describe this as a data lake?
A Hadoop implementation should not be a set of servers randomly placed at the confluence of various data flows. The placement needs to be carefully considered and if the implementation is to resemble a “data lake” in any way, then it must be a well-engineered man-made lake. Since the data doesn’t just sit there until it evaporates but eventually flows to various applications, we should think of this as a “data reservoir” rather than a “data lake.”
There is no point in arranging all that data neatly along the aisles because when we get it, we may not know what we want to do with it at the time we get it. We should organize the data when we know that.
Another reason we should think of this as more like a reservoir than a lake is that we might like to purify the data a little before sending it down the pipes to applications or users that want to use it.
You Say Big Dayta, I say Big Dahta
Some say Big Data is a great challenge while others say Big Data creates new opportunities. Where do you stand? For most companies concerned with their Big Data challenges, it shouldn’t be so difficult – at least on paper. Computing costs (both hardware and software) have vastly shrunk. Databases and storage techniques have become more sophisticated and scale massively, and companies such as Informatica have made connecting and integrating all the “big” and disparate data sources much easier and have helped companies achieve a sort of “big data synchronicity”. As it is.
In the process of creating solutions to Big Data problems, humans (and the supra-species known as IT Sapiens) have a tendency to use theories based on linear thinking and the scientific method. There is data as our systems know it and data as our systems don’t. The reality, in my opinion, is that “Really Big Data” problems now and in the future will have complex correlations and unintuitive relationships that need to utilize mathematical disciplines, data models and algorithms that haven’t even been discovered or invented yet and when eventually discovered, will make current database science positively primordial.
At some point in the future, machines will be able to predict, based on big, perhaps unknown data types when someone is having a bad day or a good day, or more importantly whether a person may behave in a good or bad way. Many people do this now when they take a glance at someone across a room and infer how that person is feeling or what they will do next. They see eyes that are shiny or dull, crinkles around eyes or sides of mouths, then hear the “tone” in a voice and then their neurons put it altogether that this is a person that is having a bad day and needs a hug. Quickly. No one knows exactly how the human brain does this, but it does what it does and we go with it and we are usually right.
And some day, Big Data will be able to derive this and it will be an evolution point and it will also be a big business opportunity. Through bigger and better data ingestion and integration techniques and more sophisticated math and data models, a machine will do this fast and relatively speaking, cheaply. The vast majority won’t understand why or how it’s done, but it will work and it will be fairly accurate.
And my question to you all is this.
Do you see any other alternate scenarios regarding the future of big data? Is contextual computing an important evolution and will big data integration be more or less of a problem in the future.
PS. Oh yeah, one last thing to chew on concerning Big Data… If Big Data becomes big enough, does that spell the end of modelling as we know it?
I’ve been having some interesting conversations with work colleagues recently about the Big Data hubbub and I’ve come to the conclusion that “Big Data” as hyped is neither, really. In fact, both terms are relative. “Big” 20 years ago to many may have been 1 terabyte. “Data” 20 years ago may have been Flat files, Sybase, Oracle, Informix, SQL Server or DB2 tables. Fast forward to today and “Big” is now Exabytes (or millions of terabytes). “Data” are now expanded to include events, sensors, messages, RFID, telemetry, GPS, accelerometers, magnetometers, IoT / M2M and other new and evolving data classifications.
And then there’s social and search data.
Surely you would classify Google data as really really big data – I can tell when I do a search, and get 487,464,685 answers within fractions of a second that they appear to have gotten a handle on their big data speeds and feeds. However, it’s also telling that nearly all of those bazillion results are actually not relevant to what I am searching for.
My conclusion is that if you have the right algorithms, invest in and use the right hardware and software technology and make sure to measure the pertinent data sources, harnessing big data can yield speedy &“big”results.
So what’s the rub then?
It usually boils down to having larger and more sophisticated data stores and still not understanding its structure, OR it can’t be integrated into cohesive formats, OR there is important hidden meaning in the data that we don’t have the wherewithal to derive, see or understand a la Google? So how DO you find the timely and important information out of your company’s big data (AKA the needle in the haystack)?
More to the point, how do you better ingest, integrate, parse, analyze, prepare, and cleanse your data to get the speed, but also the relevancy in a Big Data world?
Hadoop related tools are one of the current technologies of choice when it comes to solving Big Data related problems, and as an Informatica customer, you can leverage these tools, regardless of whether it’s Big Data or Not So Big Data, fast data or slow data. In fact, it actually astounds me that many IT professionals would want to go back to hand coding with a Hadoop tool just because they don’t know that the tools to do so are right under their nose, installed and running in their familiar Informatica User Interface (AND that work with Hadoop right out of the box.)
So what does your company get out of using Informatica in conjunction with Hadoop tools? Namely, better customer service and responsiveness, better operational efficiencies, more effective supply chains, better governance, service assurance, and the ability to discover previously unknown opportunities as well as stopping problems when they are an issue – not after the fact. In other words, Big Data done right can be a great advantage to many of today’s organizations.
Much more to say on this this subject as I delve into the future of Big Data. For more, see Part 2.
The current trend is that new types of data and new types of physical storage are changing all of that.
When I got back from my trip I found a TDWI white paper by Philip Russom that describes the situation very well in a white paper detailing his research on this subject; Evolving Data Warehouse Architectures in the Age of Big Data.
From an enterprise data architecture and management point of view, this is a very interesting paper.
- First the DW architectures are getting complex because of all the new physical storage options available
- Hadoop – very large scale and inexpensive
- NoSQL DBMS – beyond tabular data
- Columnar DBMS – very fast seek time
- DW Appliances – very fast / very expensive
- What is driving these changes is the rapidly-increasing complexity of data. Data volume has captured the imagination of the press, but it is really the rising complexity of the data types that is going to challenge architects.
- But, here is what really jumped out at me. When they asked the people in their survey what are the important components of their data warehouse architecture, the answer came back; Standards and rules. Specifically, they meant how data is modeled, how data quality metrics are created, metadata requirements, interfaces for data integration, etc.
The conclusion for me, from this part of the survey, was that business strategy is requiring more complex data for better analyses (example: realtime response or proactive recommendations) and business processes (example: advanced customer service). This, in turn, is driving IT to look into more advanced technology to deal with different data types and different use cases for the data. And finally, the way they are dealing with the exploding complexity was through standards, particularly data standards. If you are dealing with increasing complexity and have to do it better, faster and cheaper, they only way you are going to survive is by standardizing as much as reasonably makes sense. But, not a bit more.
If you think about it, it is good advice. Get your data standards in place first. It is the best way to manage the data and technology complexity. …And a chance to be the driver rather than the driven.
I highly recommend reading this white paper. There is far more in it than I can cover here. There is also a Philip Russom webinar on DW Architecture that I recommend.
Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.
Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.
In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.
Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.
2014 was the year that Big Data went mainstream from conversations asking “What is Big Data?” to “How do we harness the power of Big Data to solve real business problems”. It seemed like everyone jumped on the Big Data band wagon from new software start-ups offering the “next generation” predictive analytic applications to traditional database, data quality, business intelligence, and data integration vendors, all calling themselves Big Data providers. The truth is, they all play a role in this Big Data movement.
Earlier in 2014, Wikibon estimated the Big Data market is currently on pace to top $50 billion in 2017, which translates to a 38% compound annual growth rate over the six year period from 2011 (the first year Wikibon sized the Big Data market) to 2017. Most of the excitement around Big Data has been around Hadoop as early adopters who experimented with open source versions quickly grew to adopt enterprise-class solutions from companies like Cloudera™, HortonWorks™, MapR™, and Amazon’s RedShift™ to address real-world business problems including: (more…)