Category Archives: Hadoop
It takes a village to build mainstream big data solutions. We often get so caught up in Hadoop use cases and customer successes that sometimes we don’t talk enough about the innovative partner technologies and integrations that enable our customers to put the enterprise data hub at the core of their data architecture and innovate with confidence. Cloudera and Informatica have been working together to integrate our products to enable new levels of productivity and lower deployment and production risk.
Going from Hadoop to an enterprise data hub, means a number of things. It means that you recognize the business value of capturing and leveraging all your data for exploration and analytics. It means you’re ready to make the move from Hadoop pilot project to production. And it means your data is important enough that it’s worth securing and making data pipelines visible. It’s the visibility layer, and in particular, the unique integration between Cloudera Navigator and Informatica that I want to focus on in this post.
The era of big data has ushered in increased regulations in a number of industries – banking, retail, healthcare, energy – most of which deal in how data is managed throughout its lifecycle. Cloudera Navigator is the only native end-to-end solution for governance in Hadoop. It provides visibility for analysts to explore data in Hadoop, and enables administrators and managers to maintain a full audit history for HDFS, HBase, Hive, Impala, Spark and Sentry then run reports on data access for auditing and compliance.The integration of Informatica Metadata Manager in the Big Data Edition and Cloudera Navigator extends this level of visibility and governance beyond the enterprise data hub.
Today, only Informatica and Cloudera provide end-to-end data lineage from source systems through Hadoop, and into BI/analytic and data warehouse systems. And you can view it from a single pane within Informatica.
This is important because Hadoop, and the enterprise data hub in particular, doesn’t function in a silo. It’s an integrated part of a larger enterprise-wide data management architecture. The better the insight into where data originated, where it traveled, who had access to it and what they did with it, the greater our ability to report and audit. No other combination of technologies provides this level of audit granularity.
But more so than that, the visibility Cloudera and Informatica provides our joint customers with the ability to confidently stand up an enterprise data hub as a part of their production enterprise infrastructure because they can verify the integrity of the data that undergirds their analytics. I encourage you to check out a demo of the Informatica-Cloudera Navigator integration at this link: http://infa.media/1uBpPbT
You can also check out a demo and learn a little more about Cloudera Navigator and the Informatica integration in the recorded TechTalk hosted by Informatica at this link:
Data warehousing systems remain the de facto standard for high performance reporting and business intelligence, and there is no sign that will change soon. But Hadoop now offers an opportunity to lower costs by transferring infrequently used data and data preparation workloads off of the data warehouse and process entirely new sources of data coming from the explosion of industrial and personal devices. This is motivating interest in new concepts like the “data lake” as adjunct environments to traditional data warehousing systems.
Now, let’s be real. Between the evolutionary opportunity of preparing data more cost effectively and the revolutionary opportunity of analyzing new sources of data, the latter just sounds cooler. This revolutionary opportunity is what has spurred the growth of new roles like data scientists and new tools for self-service visualization. In the revolutionary world of pervasive analytics, data scientists have the ability to use Hadoop as a low cost and transient sandbox for data. Data scientists can perform exploratory data analysis by quickly dumping data from a variety of sources into a schema-on-read platform and by iterating dumps as new data comes in. SQL-on-Hadoop technologies like Cloudera Impala, Hortonworks Stinger, Apache Drill, and Pivotal HAWQ enable agile and iterative SQL-like queries on datasets, while new analysis tools like Tableau enable self-service visualization. We are merely in the early phases of the revolutionary opportunity of big data.
But while the revolutionary opportunity is exciting, there’s an equally compelling opportunity for enterprises to modernize their existing data environment. Enterprises cannot rely on an iterative dump methodology for managing operational data pipelines. Unmanaged “data swamps” are simply unpractical for business operations. For an operational data pipeline, the Hadoop environment must be a clean, consistent, and compliant system of record for serving analytical systems. Loading enterprise data into Hadoop instead of a relational data warehouse does not eliminate the need to prepare it.
Now I have a secret to share with you: nearly every enterprise adopting Hadoop today to modernize their data environment has processes, standards, tools, and people dedicated to data profiling, data cleansing, data refinement, data enrichment, and data validation. In the world of enterprise big data, schemas and metadata still matter.
I’ll share some examples with you. I attended a customer panel at Strata + Hadoop World in October. One of the participants was the analytics program lead at a large software company whose team was responsible for data preparation. He described how they ingest data from heterogeneous data sources by mandating a standardized schema for everything that lands in the Hadoop data lake. Once the data lands, his team profiles, cleans, refines, enriches, and validates the data so that business analysts have access to high quality information. Another data executive described how inbound data teams are required to convert data into Avro before storing the data in the data lake. (Avro is an emerging data format alongside other new formats like ORC, Parquet, and JSON). One data engineer from one of the largest consumer internet companies in the world described the schema review committee that had been set up to govern changes to their data schemas. The final participant was an enterprise architect from one of the world’s largest telecom providers who described how their data schema was critical for maintaining compliance with privacy requirements since data had to be masked before it could be made available to analysts.
Let me be clear – these companies are not just bringing in CRM and ERP data into Hadoop. These organizations are ingesting patient sensor data, log files, event data, clickstream data, and in every case, data preparation was the first task at hand.
I recently talked to a large financial services customer who proposed a unique architecture for their Hadoop deployment. They wanted to empower line of business users to be creative in discovering revolutionary opportunities while also evolving their existing data environment. They decided to allow line of businesses to set up sandbox data lakes on local Hadoop clusters for use by small teams of data scientists. Then, once a subset of data was profiled, cleansed, refined, enriched, and validated, it would be loaded into a larger Hadoop cluster functioning as an enterprise information lake. Unlike the sandbox data lakes, the enterprise information lake was clean, consistent, and compliant. Data stewards of the enterprise information lake could govern metadata and ensure data lineage tracking from source systems to sandbox to enterprise information lakes to destination systems. Enterprise information lakes balance the quality of a data warehouse with the cost-effective scalability of Hadoop.
Building enterprise information lakes out of data lakes is simple and fast with tools that can port data pipeline mappings from traditional architectures to Hadoop. With visual development interfaces and native execution on Hadoop, enterprises can accelerate their adoption of Hadoop for operational data pipelines.
No one described the opportunity of enterprise information lakes better at Strata + Hadoop World than a data executive from a large healthcare provider who said, “While big data is exciting, equally exciting is complete data…we are data rich and information poor today.” Schemas and metadata still matter more than ever, and with the help of leading data integration and preparation tools like Informatica, enterprises have a path to unleashing information riches. To learn more, check out this Big Data Workbook
I ended my previous blog wondering if awareness of Data Gravity should change our behavior. While Data Gravity adds Value to Big Data, I find that the application of the Value is under explained.
Exponential growth of data has naturally led us to want to categorize it into facts, relationships, entities, etc. This sounds very elementary. While this happens so quickly in our subconscious minds as humans, it takes significant effort to teach this to a machine.
A friend tweeted this to me last week: I paddled out today, now I look like a lobster. Since this tweet, Twitter has inundated my friend and me with promotions from Red Lobster. It is because the machine deconstructed the tweet: paddled <PROPULSION>, today <TIME>, like <PREFERENCE> and lobster <CRUSTACEANS>. While putting these together, the machine decided that the keyword was lobster. You and I both know that my friend was not talking about lobsters.
You may think that this maybe just a funny edge case. You can confuse any computer system if you try hard enough, right? Unfortunately, this isn’t an edge case. 140 characters has not just changed people’s tweets, it has changed how people talk on the web. More and more information is communicated in smaller and smaller amounts of language, and this trend is only going to continue.
When will the machine understand that “I look like a lobster” means I am sunburned?
I believe the reason that there are not hundreds of companies exploiting machine-learning techniques to generate a truly semantic web, is the lack of weighted edges in publicly available ontologies. Keep reading, it will all make sense in about 5 sentences. Lobster and Sunscreen are 7 hops away from each other in dbPedia – way too many to draw any correlation between the two. For that matter, any article in Wikipedia is connected to any other article within about 14 hops, and that’s the extreme. Completed unrelated concepts are often just a few hops from each other.
But by analyzing massive amounts of both written and spoken English text from articles, books, social media, and television, it is possible for a machine to automatically draw a correlation and create a weighted edge between the Lobsters and Sunscreen nodes that effectively short circuits the 7 hops necessary. Many organizations are dumping massive amounts of facts without weights into our repositories of total human knowledge because they are naïvely attempting to categorize everything without realizing that the repositories of human knowledge need to mimic how humans use knowledge.
For example – if you hear the name Babe Ruth, what is the first thing that pops to mind? Roman Catholics from Maryland born in the 1800s or Famous Baseball Player?
If you look in Wikipedia today, he is categorized under 28 categories in Wikipedia, each of them with the same level of attachment. 1895 births | 1948 deaths | American League All-Stars | American League batting champions | American League ERA champions | American League home run champions | American League RBI champions | American people of German descent | American Roman Catholics | Babe Ruth | Baltimore Orioles (IL) players | Baseball players from Maryland | Boston Braves players | Boston Red Sox players | Brooklyn Dodgers coaches | Burials at Gate of Heaven Cemetery | Cancer deaths in New York | Deaths from esophageal cancer | Major League Baseball first base coaches | Major League Baseball left fielders | Major League Baseball pitchers | Major League Baseball players with retired numbers | Major League Baseball right fielders | National Baseball Hall of Fame inductees | New York Yankees players | Providence Grays (minor league) players | Sportspeople from Baltimore | Maryland | Vaudeville performers.
Now imagine how confused a machine would get when the distance of unweighted edges between nodes is used as a scoring mechanism for relevancy.
If I were to design an algorithm that uses weighted edges (on a scale of 1-5, with 5 being the highest), the same search would yield a much more obvious result.
1895 births | 1948 deaths | American League All-Stars | American League batting champions | American League ERA champions | American League home run champions | American League RBI champions | American people of German descent | American Roman Catholics | Babe Ruth | Baltimore Orioles (IL) players | Baseball players from Maryland | Boston Braves players | Boston Red Sox players | Brooklyn Dodgers coaches | Burials at Gate of Heaven Cemetery | Cancer deaths in New York | Deaths from esophageal cancer | Major League Baseball first base coaches | Major League Baseball left fielders | Major League Baseball pitchers | Major League Baseball players with retired numbers | Major League Baseball right fielders | National Baseball Hall of Fame inductees | New York Yankees players | Providence Grays (minor league) players | Sportspeople from Baltimore | Maryland | Vaudeville performers .
Now the machine starts to think more like a human. The above example forces us to ask ourselves the relevancy a.k.a. Value of the response. This is where I think Data Gravity’s becomes relevant.
You can contact me on twitter @bigdatabeat with your comments.
If you’ve wondered why so many companies are eager to control data storage, the answer can be summed up in a simple term: data gravity. Ultimately, where data is determines where the money is. Services and applications are nothing without it.
Dave McCrory introduced his idea of Data Gravity with a blog post back in 2010. The core idea was – and is – Interesting. More recently, Data Gravity featured in this year’s EMC World keynote. But, beyond the observation that large or valuable agglomerations of data exert a pull that tends to see them grow in size or value, what is a recognition of Data Gravity actually good for?
As a concept, Data Gravity seems closely associated with current enthusiasm for Big Data. In addition, like Big Data, the term’s real-world connotations can be unhelpful almost as often as they are helpful. Big Data exhibits at least three characteristics, which are Volume, Velocity, and Variety. Various other V’s, including Value, is mentioned from time to time, but with less consistency. Yet, Big Data’s name says it’s all about size. The speed with which data must be ingested, processed, or excreted is less important. The complexity and diversity of the data doesn’t matter either.
On its own, the size of a data set is unimportant. Coping with lots of data certainly raises some not-insignificant technical challenges, but the community is actually doing a good job of coming up with technically impressive solutions. The interesting aspect of a huge data set isn’t its size, but the very different modes of working that become possible when you begin to unpick the complex interrelationships between data elements.
Sometimes, Big Data is the vehicle by which enough data is gathered about enough aspects of enough things from enough places for those interrelationships to become observable against the background noise. Other times, Big Data is the background noise, and any hope of insight is drowned beneath the unending stream of petabytes.
To a degree, Data Gravity falls into the same trap. More gravity must be good, right? And more mass leads to more gravity. Mass must be connected to volume, in some vague way that was explained when I was 11, and which involves STP. Therefore, bigger data sets have more gravity. This means that bigger data sets are better data sets. That assertion is clearly nonsense, but luckily, it’s not actually what McCrory is suggesting. His arguments are more nuanced than that, and potentially far more useful.
Instinctively, I like that the equation attempts to move attention away from ‘the application’ toward the pools of data that support many, many applications at once. The data is where the potential lies. Applications are merely the means to unlock that potential in various ways. So maybe notions of Potential Energy from elsewhere in Physics need to figure here.
But I’m wary of the emphasis given to real numbers that are simply the underlying technology’s vital statistics; network latency, bandwidth, request sizes, numbers of requests, and the rest. I realize that these are the measurable things that we have, but feel that more abstract notions of value need to figure just as prominently.
So I’m left reaffirming my original impression that Data Gravity is “interesting”. It’s also intriguing, and I keep feeling that it should be insightful. I’m just not — yet — sure exactly how. Is a resource with a Data Gravity of 6 twice as good as a resource with a Data Gravity of 3? Does a data set with a Data Gravity of 15 require three times as much investment/infrastructure/love as a data set scoring a humble 5? It’s unlikely to be that simple, but I do look forward to seeing what happens as McCrory begins to work with the parts of our industry that can lend empirical credibility to his initial dabbling in mathematics.
If real numbers show the equations to stand up, all we then need to do is work out what the numbers mean. Should an awareness of Data Gravity change our behavior, should it validate what gut feel led us to do already, or is it just another ‘interesting’ and ultimately self-evident number that doesn’t take us anywhere?
I don’t know, but I will continue to explore. You can contact me on twitter @bigdatabeat
Western Union, a multi-billion dollar global financial services and communications company, data is recognized as their core asset. Like many other financial services firms, Western Union thrives on data for both harvesting new business opportunities and managing its internal operations. And like many other enterprises, Western Union isn’t just ingesting data from relational data sources. They are mining a number of new information-rich sources like clickstream data and log data. With Western Union’s scale and speed demands, the data pipeline just has to work so they can optimize customer experience across multiple channels (e.g. retail, online, mobile, etc.) to grow the business.
Let’s level set on how important scale and speed is to Western Union. Western Union processes more than 29 financial transactions every second. Analytical performance simply can’t be the bottleneck for extracting insights from this blazing velocity of data. So to maximize the performance of their data warehouse appliance, Western Union offloaded data quality and data integration workloads onto a Cloudera Hadoop cluster. Using the Informatica Big Data Edition, Western Union capitalized on the performance and scalability of Hadoop while unleashing the productivity of their Informatica developers.
Informatica Big Data Edition enables data driven organizations to profile, parse, transform, and cleanse data on Hadoop with a simple visual development environment, prebuilt transformations, and reusable business rules. So instead of hand coding one-off scripts, developers can easily create mappings without worrying about the underlying execution platform. Raw data can be easily loaded into Hadoop using Informatica Data Replication and Informatica’s suite of PowerExchange connectors. After the data is prepared, it can be loaded into a data warehouse appliance for supporting high performance analysis. It’s a win-win solution for both data managers and data consumers. Using Hadoop and Informatica, the right workloads are processed by the right platforms so that the right people get the right data at the right time.
Using Informatica’s Big Data solutions, Western Union is transforming the economics of data delivery, enabling data consumers to create safer and more personalized experiences for Western Union’s customers. Learn how the Informatica Big Data Edition can help put Hadoop to work for you. And download a free trial to get started today!
Well, it’s been a little over a week since the Strata conference so I thought I should give some perspective on what I learned. I think it was summed up at my first meeting, on the first morning of the conference. The meeting was with a financial services company who has significance experience with Hadoop. The first words out of their mouths were, “Hadoop is hard.”
Later in the conference, after a Western Union representative spoke about their Hadoop deployment, they were mobbed by end user questions and comments. The audience was thrilled to hear about an actual operational deployment: Not just a sandbox deployment, but an actual operational Hadoop deployment from a company that is over 160 years old.
The market is crossing the chasm from early adopters who love to hand code (and the macho culture of proving they can do the hard stuff) to more mainstream companies that want to use technology to solve real problems. These mainstream companies aren’t afraid to admit that it is still hard. For the early adopters, nothing is ever hard. They love hard. But the mainstream market doesn’t view it that way. They don’t want to mess around in the bowels of enabling technology. They want to use the technology to solve real problems. The comment from the financial services company represents the perspective of the vast majority of organizations. It is a sign Hadoop is hitting the mainstream market.
More proof we have moved to a new phase? Cloudera announced they were going from shipping six versions a year down to just three. I have been saying for awhile that we will know that Hadoop is real when the distribution vendors stop shipping every 2 months and go to a more typical enterprise software release schedule. It isn’t that Hadoop engineering efforts have slowed down. It is still evolving very rapidly. It is just that real customers are telling the Hadoop suppliers that they won’t upgrade as fast because they have real business projects running and they can’t do it. So for those of you who are disappointed by the “slow down,” don’t be. To me, this is news that Hadoop is reaching critical mass.
Technology is closing the gap to allow organizations to use Hadoop as a platform without having to actually have an army of Hadoop experts. That is what Informatica does for data parsing, data integration, data quality and data lineage (recent product announcement). In fact, the number one demo at the Informatica booth at Strata was the demonstration of “end to end” data lineage for data, going from the original source all the way to how it was loaded and then transformed within Hadoop. This is purely an enterprise-class capability that becomes more interesting and important when you actually go into true production.
Informatica’s goal is to hide the complexity of Hadoop so companies can get on with the work of using the platform with the skills they already have in house. And from what I saw from all of the start-up companies that were doing similar things for data exploration and analytics and all the talk around the need for governance, we are finally hitting the early majority of the market. So, for those of you who still drop down to the underlying UNIX OS that powers a Mac, the rest of us will keep using the GUI. To the extent that there are “fit for purpose” GUIs on top of Hadoop, the technology will get used by a much larger market.
So congratulations Hadoop, you have officially crossed the chasm!
P.S. See me on theCUBE talking about a similar topic at: youtu.be/oC0_5u_0h2Q
Recent published research shows that “faster” is better than “slower.” The point, ladies and gentlemen, is that speed, for lack of a better word, is good. But granted, you won’t always have the need for speed. My Lamborghini is handy when I need to elude the Bakersfield fuzz on I-5, but it does nothing for my Costco trips. There, I go with capacity and haul home my 30-gallon tubs of ketchup with my Ford F150. (Note: this is a fictitious example, I don’t actually own an F150.)
But if speed is critical, like in your data streaming application, then Informatica Vibe Data Stream and the MapR Distribution including Apache™ Hadoop® are the technologies to use together. But since Vibe Data Stream works with any Hadoop distribution, my discussion here is more broadly applicable. I first discussed this topic earlier this year during my presentation at Informatica World 2014. In that talk, I also briefly described architectures that include streaming components, like the Lambda Architecture and enterprise data hubs. I recommend that any enterprise architect should become familiar with these high-level architectures.
Data streaming deals with a continuous flow of data, often at a fast rate. As you might’ve suspected by now, Vibe Data Stream, based on the Informatica Ultra Messaging technology, is great for that. With its roots in high speed trading in capital markets, Ultra Messaging quickly and reliably gets high value data from point A to point B. Vibe Data Stream adds management features to make it consumable by the rest of us, beyond stock trading. Not surprisingly, Vibe Data Stream can be used anywhere you need to quickly and reliably deliver data (just don’t use it for sharing your cat photos, please), and that’s what I discussed at Informatica World. Let me discuss two examples I gave.
Large Query Support. Let’s first look at “large queries.” I don’t mean the stuff you type on search engines, which are typically no more than 20 characters. I’m referring to an environment where the query is a huge block of data. For example, what if I have an image of an unidentified face, and I want to send it to a remote facial recognition service and immediately get the identity? The image would be the query, the facial recognition system could be run on Hadoop for fast divide-and-conquer processing, and the result would be the person’s name. There are many similar use cases that could leverage a high speed, reliable data delivery system along with a fast processing platform, to get immediate answers to a data-heavy question.
Data Warehouse Onload. For another example, we turn to our old friend the data warehouse. If you’ve been following all the industry talk about data warehouse optimization, you know pumping high speed data directly into your data warehouse is not an efficient use of your high value system. So instead, pipe your fast data streams into Hadoop, run some complex aggregations, then load that processed data into your warehouse. And you might consider freeing up large processing jobs from your data warehouse onto Hadoop. As you process and aggregate that data, you create a data flow cycle where you return enriched data back to the warehouse. This gives your end users efficient analysis on comprehensive data sets.
Hopefully this stirs up ideas on how you might deploy high speed streaming in your enterprise architecture. Expect to see many new stories of interesting streaming applications in the coming months and years, especially with the anticipated proliferation of internet-of-things and sensor data.
To learn more about Vibe Data Stream you can find it on the Informatica Marketplace .
The Informatica Cloud team has been busy updating connectivity to Hadoop using the Cloud Connector SDK. Updated connectors are available now for Cloudera and Hortonworks and new connectivity has been added for MapR, Pivotal HD and Amazon EMR (Elastic Map Reduce).
Informatica Cloud’s Hadoop connectivity brings a new level of ease of use to Hadoop data loading and integration. Informatica Cloud provides a quick way to load data from popular on premise data sources and apps such as SAP and Oracle E-Business, as well as SaaS apps, such as Salesforce.com, NetSuite, and Workday, into Hadoop clusters for pilots and POCs. Less technical users are empowered to contribute to enterprise data lakes through the easy-to-use Informatica Cloud web user interface.
Informatica Cloud’s rich connectivity to a multitude of SaaS apps can now be leveraged with Hadoop. Data from SaaS apps for CRM, ERP and other lines of business are becoming increasingly important to enterprises. Bringing this data into Hadoop for analytics is now easier than ever.
Users of Amazon Web Services (AWS) can leverage Informatica Cloud to load data from SaaS apps and on premise sources into EMR directly. Combined with connectivity to Amazon Redshift, Informatica Cloud can be used to move data into EMR for processing and then onto Redshift for analytics.
Self service data loading and basic integration can be done by less technical users through Informatica Cloud’s drag and drop web-based user interface. This enables more of the team to contribute to and collaborate on data lakes without having to learn Hadoop.
Bringing the cloud and Big Data together to put the potential of data to work – that’s the power of Informatica in action.
Free trials of the Informatica Cloud Connector for Hadoop are available here: http://www.informaticacloud.com/connectivity/hadoop-connector.html
Today, 80% of the efforts in Big Data projects are related to extracting, transforming and loading data (ETL). Hortonworks and Informatica have teamed-up to leverage the power of Informatica Big Data Edition to use their existing skills to improve the efficiency of these operations and better leverage their resources in a modern data architecture. (MDA)
Next Generation Data Management
The Hortonworks Data Platform and Informatica BDE enable organizations to optimize their ETL workloads with long-term storage and processing at scale in Apache Hadoop. With Hortonworks and Informatica, you can:
• Leverage all internal and external data to achieve the full predictive power that drives the success of modern data-driven businesses.
• Optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value.
Imagine a world where you would have access to your most strategic data in a timely fashion, no matter how old the data is, where it is stored, or under what format. By leveraging Hadoop’s power of distributed processing, organizations can lower costs of data storage and processing and support large data distribution with high through put and concurrency.
Overall, the alignment between business and IT grows. The Big Data solution based on Informatica and Hortonworks allows for a complete data pipeline to ingest, parse, integrate, cleanse, and prepare data for analysis natively on Hadoop thereby increasing developer productivity by 5x over hand-coding.
Where Do We Go From Here?
At the end of the day, Big Data is not about the technology. It is about the deep business and social transformation every organization will go through. The possibilities to make more informed decisions, identify patterns, proactively address fraud and threats, and predict pretty much anything are endless.
This transformation will happen as the technology is adopted and leveraged by more and more business users. We are already seeing the transition from 20-node clusters to 100-node clusters and from a handful of technology-savvy users relying on Hadoop to hundreds of business users. Informatica and Hortonworks are accelerating the delivery of actionable Big Data insights to business users by automating the entire data pipeline.
Try It For Yourself
On September 10, 2014, Informatica announced the 60-day trial version of the Informatica Big Data Edition into the Hortonworks Sandbox. This free trial enables you to download and test out the Big Data Edition on your notebook or spare computer and experience your own personal Modern Data Architecture (MDA).
If you happen to be at Strata this October 2014, please meet us at our booths: Informatica #352 and Hortonworks #117. Don’t forget to participate in our Passport Program and join our session at 5:45 pm ET on Thursday, October 16, 2014.