Tag Archives: Hadoop
2014 was a pivotal turning point for Informatica as our investments in Hadoop and efforts to innovate in big data gathered momentum and became a core part of Informatica’s business. Our Hadoop related big data revenue growth was in the ballpark of leading Hadoop startups – more than doubling over 2013.
In 2014, Informatica reached about 100 enterprise customers of our big data products with an increasing number going into production with Informatica together with Hadoop and other big data technologies. Informatica’s big data Hadoop customers include companies in financial services, insurance, telcommunications, technology, energy, life sciences, healthcare and business services. These innovative companies are leveraging Informatica to accelerate their time to production and drive greater value from their big data investments.
These customers are in-production or implementing a wide range of use cases leveraging Informatica’s great data pipeline capabilities to better put the scale, efficiency and flexibility of Hadoop to work. Many Hadoop customers start by optimizing their data warehouse environments by moving data storage, profiling, integration and cleansing to Hadoop in order to free up capacity in their traditional analytics data warehousing systems. Customers that are further along in their big data journeys have expanded to use Informatica on Hadoop for exploratory analytics of new data types, 360 degree customer analytics, fraud detection, predictive maintenance, and analysis of massive amounts of Internet of Things machine data for optimization of energy exploration, manufacturing processes, network data, security and other large scale systems initiatives.
2014 was not just a year of market momentum for Informatica, but also one of new product development innovations. We shipped enhanced functionality for entity matching and relationship building at Hadoop scale (a key part of Master Data Management), end-to-end data lineage through Hadoop, as well as high performance real-time streaming of data into Hadoop. We also launched connectors to NoSQL and analytics databases including Datastax Cassandra, MongoDB and Amazon Redshift. Informatica advanced our capabilities to curate great data for self-serve analytics with a connector to output Tableau’s data format and launched our self-service data preparation solution, Informatica Rev.
Customers can now quickly try out Informatica on Hadoop by downloading the free trials for the Big Data Edition and Vibe Data Stream that we launched in 2014. Now that Informatica supports all five of the leading Hadoop distributions, customers can build their data pipelines on Informatica with confidence that no matter how the underlying Hadoop technologies evolve, their Informatica mappings will run. Informatica provides highly scalable data processing engines that run natively in Hadoop and leverage the best of open source innovations such as YARN, MapReduce, and more. Abstracting data pipeline mappings from the underlying Hadoop technologies combined with visual tools enabling team collaboration empowers large organizations to put Hadoop into production with confidence.
As we look ahead into 2015, we have ambitious plans to continue to expand and evolve our product capabilities with enhanced productivity to help customers rapidly get more value from their data in Hadoop. Stay tuned for announcements throughout the year.
Try some of Informatica’s products for Hadoop on the Informatica Marketplace here.
As we head into Strata + Hadoop World San Jose, Pivotal has made some interesting announcements that are sure to be the talk of the show. Pivotal’s move to open-source some of their advanced products (and to form a new organization to foster Hadoop community cooperation) are signs of the dynamism and momentum of the Big Data market.
Informatica applauds these initiatives by Pivotal and we hope that they will contribute to the accelerating maturity of Hadoop and its expansion beyond early adopters into mainstream industry adoption. By contributing HAWQ, GemFire and the Greenplum Database to the open source community, Pivotal creates further open options in the evolving Hadoop data infrastructure technology. We expect this to be well received by the open source community.
As Informatica has long served as the industry’s neutral data connector for more than 5,500 customers and have developed a rich set of capabilities for Hadoop, we are also excited to see efforts to try to reduce fragmentation in the Hadoop community.
Even before the new company Pivotal was formed, Informatica had a long history working with the Greenplum team to ensure that joint customers could confidently use Informatica tools to include the Greenplum Database in their enterprise data pipelines. Informatica has mature and high-performance native connectivity to load data in and out of Greenplum reliably using Informatica’s codeless, visual data pipelining tools. In 2014, Informatica expanded out Hadoop support to include Pivotal HD Hadoop and we have joint customers using Informatica to do data profiling, transformation, parsing and cleansing using Informatica Big Data Edition running on Pivotal HD Hadoop.
We expect these innovative developments driven by Pivotal in the Big Data technology landscape to help to move the industry forward and contribute to Pivotal’s market progress. We look forward to continuing to support Pivotal technology and to an ever increasing number of successful joint customers. Please reach out to us if you have any questions about how Informatica and Pivotal can help your organization to put Big Data into production. We want to ensure that we can help you answer the question … Are you Big Data Ready?
A Data Lake is a simple concept. They are a catchment area for data entering the organization. In the past, most businesses didn’t need to organize such a data store because almost all data was internal. It traveled via traditional ETL mechanisms from transactional systems to a data warehouse and then was sprayed around the business, as required.
When a good deal of data comes from external sources, or even from internal sources like log files, which never previously made it into the data warehouse, there is a need for an “operational data store.” This has definitely become the premier application for Hadoop and it makes perfect sense to me that such technology be used for a data catchment area. The neat thing about Hadoop for this application is that:
- It scales out “as far as the eye can see,” so there’s no likelihood of it being unable to manage the data volumes even when they grow beyond the petabyte level.
- It is a key-value store, which means that you don’t need to expend much effort in modeling data when you decide to accommodate a new data source. You just define a key and define the metadata at leisure.
- The cost of the software and the storage is very low.
So let’s imagine that we have a need for a data catchment area, because we have decided to collect data from log-files, mobile devices, social networks, from public data sources, or whatever. So let us also imagine that we have implemented Hadoop and some of its useful components and we have begun to collect data.
Is it reasonable to describe this as a data lake?
A Hadoop implementation should not be a set of servers randomly placed at the confluence of various data flows. The placement needs to be carefully considered and if the implementation is to resemble a “data lake” in any way, then it must be a well-engineered man-made lake. Since the data doesn’t just sit there until it evaporates but eventually flows to various applications, we should think of this as a “data reservoir” rather than a “data lake.”
There is no point in arranging all that data neatly along the aisles because when we get it, we may not know what we want to do with it at the time we get it. We should organize the data when we know that.
Another reason we should think of this as more like a reservoir than a lake is that we might like to purify the data a little before sending it down the pipes to applications or users that want to use it.
You Say Big Dayta, I say Big Dahta
Some say Big Data is a great challenge while others say Big Data creates new opportunities. Where do you stand? For most companies concerned with their Big Data challenges, it shouldn’t be so difficult – at least on paper. Computing costs (both hardware and software) have vastly shrunk. Databases and storage techniques have become more sophisticated and scale massively, and companies such as Informatica have made connecting and integrating all the “big” and disparate data sources much easier and have helped companies achieve a sort of “big data synchronicity”. As it is.
In the process of creating solutions to Big Data problems, humans (and the supra-species known as IT Sapiens) have a tendency to use theories based on linear thinking and the scientific method. There is data as our systems know it and data as our systems don’t. The reality, in my opinion, is that “Really Big Data” problems now and in the future will have complex correlations and unintuitive relationships that need to utilize mathematical disciplines, data models and algorithms that haven’t even been discovered or invented yet and when eventually discovered, will make current database science positively primordial.
At some point in the future, machines will be able to predict, based on big, perhaps unknown data types when someone is having a bad day or a good day, or more importantly whether a person may behave in a good or bad way. Many people do this now when they take a glance at someone across a room and infer how that person is feeling or what they will do next. They see eyes that are shiny or dull, crinkles around eyes or sides of mouths, then hear the “tone” in a voice and then their neurons put it altogether that this is a person that is having a bad day and needs a hug. Quickly. No one knows exactly how the human brain does this, but it does what it does and we go with it and we are usually right.
And some day, Big Data will be able to derive this and it will be an evolution point and it will also be a big business opportunity. Through bigger and better data ingestion and integration techniques and more sophisticated math and data models, a machine will do this fast and relatively speaking, cheaply. The vast majority won’t understand why or how it’s done, but it will work and it will be fairly accurate.
And my question to you all is this.
Do you see any other alternate scenarios regarding the future of big data? Is contextual computing an important evolution and will big data integration be more or less of a problem in the future.
PS. Oh yeah, one last thing to chew on concerning Big Data… If Big Data becomes big enough, does that spell the end of modelling as we know it?
I’ve been having some interesting conversations with work colleagues recently about the Big Data hubbub and I’ve come to the conclusion that “Big Data” as hyped is neither, really. In fact, both terms are relative. “Big” 20 years ago to many may have been 1 terabyte. “Data” 20 years ago may have been Flat files, Sybase, Oracle, Informix, SQL Server or DB2 tables. Fast forward to today and “Big” is now Exabytes (or millions of terabytes). “Data” are now expanded to include events, sensors, messages, RFID, telemetry, GPS, accelerometers, magnetometers, IoT / M2M and other new and evolving data classifications.
And then there’s social and search data.
Surely you would classify Google data as really really big data – I can tell when I do a search, and get 487,464,685 answers within fractions of a second that they appear to have gotten a handle on their big data speeds and feeds. However, it’s also telling that nearly all of those bazillion results are actually not relevant to what I am searching for.
My conclusion is that if you have the right algorithms, invest in and use the right hardware and software technology and make sure to measure the pertinent data sources, harnessing big data can yield speedy &“big”results.
So what’s the rub then?
It usually boils down to having larger and more sophisticated data stores and still not understanding its structure, OR it can’t be integrated into cohesive formats, OR there is important hidden meaning in the data that we don’t have the wherewithal to derive, see or understand a la Google? So how DO you find the timely and important information out of your company’s big data (AKA the needle in the haystack)?
More to the point, how do you better ingest, integrate, parse, analyze, prepare, and cleanse your data to get the speed, but also the relevancy in a Big Data world?
Hadoop related tools are one of the current technologies of choice when it comes to solving Big Data related problems, and as an Informatica customer, you can leverage these tools, regardless of whether it’s Big Data or Not So Big Data, fast data or slow data. In fact, it actually astounds me that many IT professionals would want to go back to hand coding with a Hadoop tool just because they don’t know that the tools to do so are right under their nose, installed and running in their familiar Informatica User Interface (AND that work with Hadoop right out of the box.)
So what does your company get out of using Informatica in conjunction with Hadoop tools? Namely, better customer service and responsiveness, better operational efficiencies, more effective supply chains, better governance, service assurance, and the ability to discover previously unknown opportunities as well as stopping problems when they are an issue – not after the fact. In other words, Big Data done right can be a great advantage to many of today’s organizations.
Much more to say on this this subject as I delve into the future of Big Data. For more, see Part 2.
The current trend is that new types of data and new types of physical storage are changing all of that.
When I got back from my trip I found a TDWI white paper by Philip Russom that describes the situation very well in a white paper detailing his research on this subject; Evolving Data Warehouse Architectures in the Age of Big Data.
From an enterprise data architecture and management point of view, this is a very interesting paper.
- First the DW architectures are getting complex because of all the new physical storage options available
- Hadoop – very large scale and inexpensive
- NoSQL DBMS – beyond tabular data
- Columnar DBMS – very fast seek time
- DW Appliances – very fast / very expensive
- What is driving these changes is the rapidly-increasing complexity of data. Data volume has captured the imagination of the press, but it is really the rising complexity of the data types that is going to challenge architects.
- But, here is what really jumped out at me. When they asked the people in their survey what are the important components of their data warehouse architecture, the answer came back; Standards and rules. Specifically, they meant how data is modeled, how data quality metrics are created, metadata requirements, interfaces for data integration, etc.
The conclusion for me, from this part of the survey, was that business strategy is requiring more complex data for better analyses (example: realtime response or proactive recommendations) and business processes (example: advanced customer service). This, in turn, is driving IT to look into more advanced technology to deal with different data types and different use cases for the data. And finally, the way they are dealing with the exploding complexity was through standards, particularly data standards. If you are dealing with increasing complexity and have to do it better, faster and cheaper, they only way you are going to survive is by standardizing as much as reasonably makes sense. But, not a bit more.
If you think about it, it is good advice. Get your data standards in place first. It is the best way to manage the data and technology complexity. …And a chance to be the driver rather than the driven.
I highly recommend reading this white paper. There is far more in it than I can cover here. There is also a Philip Russom webinar on DW Architecture that I recommend.
Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.
Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.
In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.
Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.
2014 was the year that Big Data went mainstream from conversations asking “What is Big Data?” to “How do we harness the power of Big Data to solve real business problems”. It seemed like everyone jumped on the Big Data band wagon from new software start-ups offering the “next generation” predictive analytic applications to traditional database, data quality, business intelligence, and data integration vendors, all calling themselves Big Data providers. The truth is, they all play a role in this Big Data movement.
Earlier in 2014, Wikibon estimated the Big Data market is currently on pace to top $50 billion in 2017, which translates to a 38% compound annual growth rate over the six year period from 2011 (the first year Wikibon sized the Big Data market) to 2017. Most of the excitement around Big Data has been around Hadoop as early adopters who experimented with open source versions quickly grew to adopt enterprise-class solutions from companies like Cloudera™, HortonWorks™, MapR™, and Amazon’s RedShift™ to address real-world business problems including: (more…)
It takes a village to build mainstream big data solutions. We often get so caught up in Hadoop use cases and customer successes that sometimes we don’t talk enough about the innovative partner technologies and integrations that enable our customers to put the enterprise data hub at the core of their data architecture and innovate with confidence. Cloudera and Informatica have been working together to integrate our products to enable new levels of productivity and lower deployment and production risk.
Going from Hadoop to an enterprise data hub, means a number of things. It means that you recognize the business value of capturing and leveraging all your data for exploration and analytics. It means you’re ready to make the move from Hadoop pilot project to production. And it means your data is important enough that it’s worth securing and making data pipelines visible. It’s the visibility layer, and in particular, the unique integration between Cloudera Navigator and Informatica that I want to focus on in this post.
The era of big data has ushered in increased regulations in a number of industries – banking, retail, healthcare, energy – most of which deal in how data is managed throughout its lifecycle. Cloudera Navigator is the only native end-to-end solution for governance in Hadoop. It provides visibility for analysts to explore data in Hadoop, and enables administrators and managers to maintain a full audit history for HDFS, HBase, Hive, Impala, Spark and Sentry then run reports on data access for auditing and compliance.The integration of Informatica Metadata Manager in the Big Data Edition and Cloudera Navigator extends this level of visibility and governance beyond the enterprise data hub.
Today, only Informatica and Cloudera provide end-to-end data lineage from source systems through Hadoop, and into BI/analytic and data warehouse systems. And you can view it from a single pane within Informatica.
This is important because Hadoop, and the enterprise data hub in particular, doesn’t function in a silo. It’s an integrated part of a larger enterprise-wide data management architecture. The better the insight into where data originated, where it traveled, who had access to it and what they did with it, the greater our ability to report and audit. No other combination of technologies provides this level of audit granularity.
But more so than that, the visibility Cloudera and Informatica provides our joint customers with the ability to confidently stand up an enterprise data hub as a part of their production enterprise infrastructure because they can verify the integrity of the data that undergirds their analytics. I encourage you to check out a demo of the Informatica-Cloudera Navigator integration at this link: http://infa.media/1uBpPbT
You can also check out a demo and learn a little more about Cloudera Navigator and the Informatica integration in the recorded TechTalk hosted by Informatica at this link:
Data warehousing systems remain the de facto standard for high performance reporting and business intelligence, and there is no sign that will change soon. But Hadoop now offers an opportunity to lower costs by transferring infrequently used data and data preparation workloads off of the data warehouse and process entirely new sources of data coming from the explosion of industrial and personal devices. This is motivating interest in new concepts like the “data lake” as adjunct environments to traditional data warehousing systems.
Now, let’s be real. Between the evolutionary opportunity of preparing data more cost effectively and the revolutionary opportunity of analyzing new sources of data, the latter just sounds cooler. This revolutionary opportunity is what has spurred the growth of new roles like data scientists and new tools for self-service visualization. In the revolutionary world of pervasive analytics, data scientists have the ability to use Hadoop as a low cost and transient sandbox for data. Data scientists can perform exploratory data analysis by quickly dumping data from a variety of sources into a schema-on-read platform and by iterating dumps as new data comes in. SQL-on-Hadoop technologies like Cloudera Impala, Hortonworks Stinger, Apache Drill, and Pivotal HAWQ enable agile and iterative SQL-like queries on datasets, while new analysis tools like Tableau enable self-service visualization. We are merely in the early phases of the revolutionary opportunity of big data.
But while the revolutionary opportunity is exciting, there’s an equally compelling opportunity for enterprises to modernize their existing data environment. Enterprises cannot rely on an iterative dump methodology for managing operational data pipelines. Unmanaged “data swamps” are simply unpractical for business operations. For an operational data pipeline, the Hadoop environment must be a clean, consistent, and compliant system of record for serving analytical systems. Loading enterprise data into Hadoop instead of a relational data warehouse does not eliminate the need to prepare it.
Now I have a secret to share with you: nearly every enterprise adopting Hadoop today to modernize their data environment has processes, standards, tools, and people dedicated to data profiling, data cleansing, data refinement, data enrichment, and data validation. In the world of enterprise big data, schemas and metadata still matter.
I’ll share some examples with you. I attended a customer panel at Strata + Hadoop World in October. One of the participants was the analytics program lead at a large software company whose team was responsible for data preparation. He described how they ingest data from heterogeneous data sources by mandating a standardized schema for everything that lands in the Hadoop data lake. Once the data lands, his team profiles, cleans, refines, enriches, and validates the data so that business analysts have access to high quality information. Another data executive described how inbound data teams are required to convert data into Avro before storing the data in the data lake. (Avro is an emerging data format alongside other new formats like ORC, Parquet, and JSON). One data engineer from one of the largest consumer internet companies in the world described the schema review committee that had been set up to govern changes to their data schemas. The final participant was an enterprise architect from one of the world’s largest telecom providers who described how their data schema was critical for maintaining compliance with privacy requirements since data had to be masked before it could be made available to analysts.
Let me be clear – these companies are not just bringing in CRM and ERP data into Hadoop. These organizations are ingesting patient sensor data, log files, event data, clickstream data, and in every case, data preparation was the first task at hand.
I recently talked to a large financial services customer who proposed a unique architecture for their Hadoop deployment. They wanted to empower line of business users to be creative in discovering revolutionary opportunities while also evolving their existing data environment. They decided to allow line of businesses to set up sandbox data lakes on local Hadoop clusters for use by small teams of data scientists. Then, once a subset of data was profiled, cleansed, refined, enriched, and validated, it would be loaded into a larger Hadoop cluster functioning as an enterprise information lake. Unlike the sandbox data lakes, the enterprise information lake was clean, consistent, and compliant. Data stewards of the enterprise information lake could govern metadata and ensure data lineage tracking from source systems to sandbox to enterprise information lakes to destination systems. Enterprise information lakes balance the quality of a data warehouse with the cost-effective scalability of Hadoop.
Building enterprise information lakes out of data lakes is simple and fast with tools that can port data pipeline mappings from traditional architectures to Hadoop. With visual development interfaces and native execution on Hadoop, enterprises can accelerate their adoption of Hadoop for operational data pipelines.
No one described the opportunity of enterprise information lakes better at Strata + Hadoop World than a data executive from a large healthcare provider who said, “While big data is exciting, equally exciting is complete data…we are data rich and information poor today.” Schemas and metadata still matter more than ever, and with the help of leading data integration and preparation tools like Informatica, enterprises have a path to unleashing information riches. To learn more, check out this Big Data Workbook