Tag Archives: Hadoop
It takes a village to build mainstream big data solutions. We often get so caught up in Hadoop use cases and customer successes that sometimes we don’t talk enough about the innovative partner technologies and integrations that enable our customers to put the enterprise data hub at the core of their data architecture and innovate with confidence. Cloudera and Informatica have been working together to integrate our products to enable new levels of productivity and lower deployment and production risk.
Going from Hadoop to an enterprise data hub, means a number of things. It means that you recognize the business value of capturing and leveraging all your data for exploration and analytics. It means you’re ready to make the move from Hadoop pilot project to production. And it means your data is important enough that it’s worth securing and making data pipelines visible. It’s the visibility layer, and in particular, the unique integration between Cloudera Navigator and Informatica that I want to focus on in this post.
The era of big data has ushered in increased regulations in a number of industries – banking, retail, healthcare, energy – most of which deal in how data is managed throughout its lifecycle. Cloudera Navigator is the only native end-to-end solution for governance in Hadoop. It provides visibility for analysts to explore data in Hadoop, and enables administrators and managers to maintain a full audit history for HDFS, HBase, Hive, Impala, Spark and Sentry then run reports on data access for auditing and compliance.The integration of Informatica Metadata Manager in the Big Data Edition and Cloudera Navigator extends this level of visibility and governance beyond the enterprise data hub.
Today, only Informatica and Cloudera provide end-to-end data lineage from source systems through Hadoop, and into BI/analytic and data warehouse systems. And you can view it from a single pane within Informatica.
This is important because Hadoop, and the enterprise data hub in particular, doesn’t function in a silo. It’s an integrated part of a larger enterprise-wide data management architecture. The better the insight into where data originated, where it traveled, who had access to it and what they did with it, the greater our ability to report and audit. No other combination of technologies provides this level of audit granularity.
But more so than that, the visibility Cloudera and Informatica provides our joint customers with the ability to confidently stand up an enterprise data hub as a part of their production enterprise infrastructure because they can verify the integrity of the data that undergirds their analytics. I encourage you to check out a demo of the Informatica-Cloudera Navigator integration at this link: http://infa.media/1uBpPbT
You can also check out a demo and learn a little more about Cloudera Navigator and the Informatica integration in the recorded TechTalk hosted by Informatica at this link:
Data warehousing systems remain the de facto standard for high performance reporting and business intelligence, and there is no sign that will change soon. But Hadoop now offers an opportunity to lower costs by transferring infrequently used data and data preparation workloads off of the data warehouse and process entirely new sources of data coming from the explosion of industrial and personal devices. This is motivating interest in new concepts like the “data lake” as adjunct environments to traditional data warehousing systems.
Now, let’s be real. Between the evolutionary opportunity of preparing data more cost effectively and the revolutionary opportunity of analyzing new sources of data, the latter just sounds cooler. This revolutionary opportunity is what has spurred the growth of new roles like data scientists and new tools for self-service visualization. In the revolutionary world of pervasive analytics, data scientists have the ability to use Hadoop as a low cost and transient sandbox for data. Data scientists can perform exploratory data analysis by quickly dumping data from a variety of sources into a schema-on-read platform and by iterating dumps as new data comes in. SQL-on-Hadoop technologies like Cloudera Impala, Hortonworks Stinger, Apache Drill, and Pivotal HAWQ enable agile and iterative SQL-like queries on datasets, while new analysis tools like Tableau enable self-service visualization. We are merely in the early phases of the revolutionary opportunity of big data.
But while the revolutionary opportunity is exciting, there’s an equally compelling opportunity for enterprises to modernize their existing data environment. Enterprises cannot rely on an iterative dump methodology for managing operational data pipelines. Unmanaged “data swamps” are simply unpractical for business operations. For an operational data pipeline, the Hadoop environment must be a clean, consistent, and compliant system of record for serving analytical systems. Loading enterprise data into Hadoop instead of a relational data warehouse does not eliminate the need to prepare it.
Now I have a secret to share with you: nearly every enterprise adopting Hadoop today to modernize their data environment has processes, standards, tools, and people dedicated to data profiling, data cleansing, data refinement, data enrichment, and data validation. In the world of enterprise big data, schemas and metadata still matter.
I’ll share some examples with you. I attended a customer panel at Strata + Hadoop World in October. One of the participants was the analytics program lead at a large software company whose team was responsible for data preparation. He described how they ingest data from heterogeneous data sources by mandating a standardized schema for everything that lands in the Hadoop data lake. Once the data lands, his team profiles, cleans, refines, enriches, and validates the data so that business analysts have access to high quality information. Another data executive described how inbound data teams are required to convert data into Avro before storing the data in the data lake. (Avro is an emerging data format alongside other new formats like ORC, Parquet, and JSON). One data engineer from one of the largest consumer internet companies in the world described the schema review committee that had been set up to govern changes to their data schemas. The final participant was an enterprise architect from one of the world’s largest telecom providers who described how their data schema was critical for maintaining compliance with privacy requirements since data had to be masked before it could be made available to analysts.
Let me be clear – these companies are not just bringing in CRM and ERP data into Hadoop. These organizations are ingesting patient sensor data, log files, event data, clickstream data, and in every case, data preparation was the first task at hand.
I recently talked to a large financial services customer who proposed a unique architecture for their Hadoop deployment. They wanted to empower line of business users to be creative in discovering revolutionary opportunities while also evolving their existing data environment. They decided to allow line of businesses to set up sandbox data lakes on local Hadoop clusters for use by small teams of data scientists. Then, once a subset of data was profiled, cleansed, refined, enriched, and validated, it would be loaded into a larger Hadoop cluster functioning as an enterprise information lake. Unlike the sandbox data lakes, the enterprise information lake was clean, consistent, and compliant. Data stewards of the enterprise information lake could govern metadata and ensure data lineage tracking from source systems to sandbox to enterprise information lakes to destination systems. Enterprise information lakes balance the quality of a data warehouse with the cost-effective scalability of Hadoop.
Building enterprise information lakes out of data lakes is simple and fast with tools that can port data pipeline mappings from traditional architectures to Hadoop. With visual development interfaces and native execution on Hadoop, enterprises can accelerate their adoption of Hadoop for operational data pipelines.
No one described the opportunity of enterprise information lakes better at Strata + Hadoop World than a data executive from a large healthcare provider who said, “While big data is exciting, equally exciting is complete data…we are data rich and information poor today.” Schemas and metadata still matter more than ever, and with the help of leading data integration and preparation tools like Informatica, enterprises have a path to unleashing information riches. To learn more, check out this Big Data Workbook
Western Union, a multi-billion dollar global financial services and communications company, data is recognized as their core asset. Like many other financial services firms, Western Union thrives on data for both harvesting new business opportunities and managing its internal operations. And like many other enterprises, Western Union isn’t just ingesting data from relational data sources. They are mining a number of new information-rich sources like clickstream data and log data. With Western Union’s scale and speed demands, the data pipeline just has to work so they can optimize customer experience across multiple channels (e.g. retail, online, mobile, etc.) to grow the business.
Let’s level set on how important scale and speed is to Western Union. Western Union processes more than 29 financial transactions every second. Analytical performance simply can’t be the bottleneck for extracting insights from this blazing velocity of data. So to maximize the performance of their data warehouse appliance, Western Union offloaded data quality and data integration workloads onto a Cloudera Hadoop cluster. Using the Informatica Big Data Edition, Western Union capitalized on the performance and scalability of Hadoop while unleashing the productivity of their Informatica developers.
Informatica Big Data Edition enables data driven organizations to profile, parse, transform, and cleanse data on Hadoop with a simple visual development environment, prebuilt transformations, and reusable business rules. So instead of hand coding one-off scripts, developers can easily create mappings without worrying about the underlying execution platform. Raw data can be easily loaded into Hadoop using Informatica Data Replication and Informatica’s suite of PowerExchange connectors. After the data is prepared, it can be loaded into a data warehouse appliance for supporting high performance analysis. It’s a win-win solution for both data managers and data consumers. Using Hadoop and Informatica, the right workloads are processed by the right platforms so that the right people get the right data at the right time.
Using Informatica’s Big Data solutions, Western Union is transforming the economics of data delivery, enabling data consumers to create safer and more personalized experiences for Western Union’s customers. Learn how the Informatica Big Data Edition can help put Hadoop to work for you. And download a free trial to get started today!
Well, it’s been a little over a week since the Strata conference so I thought I should give some perspective on what I learned. I think it was summed up at my first meeting, on the first morning of the conference. The meeting was with a financial services company who has significance experience with Hadoop. The first words out of their mouths were, “Hadoop is hard.”
Later in the conference, after a Western Union representative spoke about their Hadoop deployment, they were mobbed by end user questions and comments. The audience was thrilled to hear about an actual operational deployment: Not just a sandbox deployment, but an actual operational Hadoop deployment from a company that is over 160 years old.
The market is crossing the chasm from early adopters who love to hand code (and the macho culture of proving they can do the hard stuff) to more mainstream companies that want to use technology to solve real problems. These mainstream companies aren’t afraid to admit that it is still hard. For the early adopters, nothing is ever hard. They love hard. But the mainstream market doesn’t view it that way. They don’t want to mess around in the bowels of enabling technology. They want to use the technology to solve real problems. The comment from the financial services company represents the perspective of the vast majority of organizations. It is a sign Hadoop is hitting the mainstream market.
More proof we have moved to a new phase? Cloudera announced they were going from shipping six versions a year down to just three. I have been saying for awhile that we will know that Hadoop is real when the distribution vendors stop shipping every 2 months and go to a more typical enterprise software release schedule. It isn’t that Hadoop engineering efforts have slowed down. It is still evolving very rapidly. It is just that real customers are telling the Hadoop suppliers that they won’t upgrade as fast because they have real business projects running and they can’t do it. So for those of you who are disappointed by the “slow down,” don’t be. To me, this is news that Hadoop is reaching critical mass.
Technology is closing the gap to allow organizations to use Hadoop as a platform without having to actually have an army of Hadoop experts. That is what Informatica does for data parsing, data integration, data quality and data lineage (recent product announcement). In fact, the number one demo at the Informatica booth at Strata was the demonstration of “end to end” data lineage for data, going from the original source all the way to how it was loaded and then transformed within Hadoop. This is purely an enterprise-class capability that becomes more interesting and important when you actually go into true production.
Informatica’s goal is to hide the complexity of Hadoop so companies can get on with the work of using the platform with the skills they already have in house. And from what I saw from all of the start-up companies that were doing similar things for data exploration and analytics and all the talk around the need for governance, we are finally hitting the early majority of the market. So, for those of you who still drop down to the underlying UNIX OS that powers a Mac, the rest of us will keep using the GUI. To the extent that there are “fit for purpose” GUIs on top of Hadoop, the technology will get used by a much larger market.
So congratulations Hadoop, you have officially crossed the chasm!
P.S. See me on theCUBE talking about a similar topic at: youtu.be/oC0_5u_0h2Q
Recent published research shows that “faster” is better than “slower.” The point, ladies and gentlemen, is that speed, for lack of a better word, is good. But granted, you won’t always have the need for speed. My Lamborghini is handy when I need to elude the Bakersfield fuzz on I-5, but it does nothing for my Costco trips. There, I go with capacity and haul home my 30-gallon tubs of ketchup with my Ford F150. (Note: this is a fictitious example, I don’t actually own an F150.)
But if speed is critical, like in your data streaming application, then Informatica Vibe Data Stream and the MapR Distribution including Apache™ Hadoop® are the technologies to use together. But since Vibe Data Stream works with any Hadoop distribution, my discussion here is more broadly applicable. I first discussed this topic earlier this year during my presentation at Informatica World 2014. In that talk, I also briefly described architectures that include streaming components, like the Lambda Architecture and enterprise data hubs. I recommend that any enterprise architect should become familiar with these high-level architectures.
Data streaming deals with a continuous flow of data, often at a fast rate. As you might’ve suspected by now, Vibe Data Stream, based on the Informatica Ultra Messaging technology, is great for that. With its roots in high speed trading in capital markets, Ultra Messaging quickly and reliably gets high value data from point A to point B. Vibe Data Stream adds management features to make it consumable by the rest of us, beyond stock trading. Not surprisingly, Vibe Data Stream can be used anywhere you need to quickly and reliably deliver data (just don’t use it for sharing your cat photos, please), and that’s what I discussed at Informatica World. Let me discuss two examples I gave.
Large Query Support. Let’s first look at “large queries.” I don’t mean the stuff you type on search engines, which are typically no more than 20 characters. I’m referring to an environment where the query is a huge block of data. For example, what if I have an image of an unidentified face, and I want to send it to a remote facial recognition service and immediately get the identity? The image would be the query, the facial recognition system could be run on Hadoop for fast divide-and-conquer processing, and the result would be the person’s name. There are many similar use cases that could leverage a high speed, reliable data delivery system along with a fast processing platform, to get immediate answers to a data-heavy question.
Data Warehouse Onload. For another example, we turn to our old friend the data warehouse. If you’ve been following all the industry talk about data warehouse optimization, you know pumping high speed data directly into your data warehouse is not an efficient use of your high value system. So instead, pipe your fast data streams into Hadoop, run some complex aggregations, then load that processed data into your warehouse. And you might consider freeing up large processing jobs from your data warehouse onto Hadoop. As you process and aggregate that data, you create a data flow cycle where you return enriched data back to the warehouse. This gives your end users efficient analysis on comprehensive data sets.
Hopefully this stirs up ideas on how you might deploy high speed streaming in your enterprise architecture. Expect to see many new stories of interesting streaming applications in the coming months and years, especially with the anticipated proliferation of internet-of-things and sensor data.
To learn more about Vibe Data Stream you can find it on the Informatica Marketplace .
The Informatica Cloud team has been busy updating connectivity to Hadoop using the Cloud Connector SDK. Updated connectors are available now for Cloudera and Hortonworks and new connectivity has been added for MapR, Pivotal HD and Amazon EMR (Elastic Map Reduce).
Informatica Cloud’s Hadoop connectivity brings a new level of ease of use to Hadoop data loading and integration. Informatica Cloud provides a quick way to load data from popular on premise data sources and apps such as SAP and Oracle E-Business, as well as SaaS apps, such as Salesforce.com, NetSuite, and Workday, into Hadoop clusters for pilots and POCs. Less technical users are empowered to contribute to enterprise data lakes through the easy-to-use Informatica Cloud web user interface.
Informatica Cloud’s rich connectivity to a multitude of SaaS apps can now be leveraged with Hadoop. Data from SaaS apps for CRM, ERP and other lines of business are becoming increasingly important to enterprises. Bringing this data into Hadoop for analytics is now easier than ever.
Users of Amazon Web Services (AWS) can leverage Informatica Cloud to load data from SaaS apps and on premise sources into EMR directly. Combined with connectivity to Amazon Redshift, Informatica Cloud can be used to move data into EMR for processing and then onto Redshift for analytics.
Self service data loading and basic integration can be done by less technical users through Informatica Cloud’s drag and drop web-based user interface. This enables more of the team to contribute to and collaborate on data lakes without having to learn Hadoop.
Bringing the cloud and Big Data together to put the potential of data to work – that’s the power of Informatica in action.
Free trials of the Informatica Cloud Connector for Hadoop are available here: http://www.informaticacloud.com/connectivity/hadoop-connector.html
Today, 80% of the efforts in Big Data projects are related to extracting, transforming and loading data (ETL). Hortonworks and Informatica have teamed-up to leverage the power of Informatica Big Data Edition to use their existing skills to improve the efficiency of these operations and better leverage their resources in a modern data architecture. (MDA)
Next Generation Data Management
The Hortonworks Data Platform and Informatica BDE enable organizations to optimize their ETL workloads with long-term storage and processing at scale in Apache Hadoop. With Hortonworks and Informatica, you can:
• Leverage all internal and external data to achieve the full predictive power that drives the success of modern data-driven businesses.
• Optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value.
Imagine a world where you would have access to your most strategic data in a timely fashion, no matter how old the data is, where it is stored, or under what format. By leveraging Hadoop’s power of distributed processing, organizations can lower costs of data storage and processing and support large data distribution with high through put and concurrency.
Overall, the alignment between business and IT grows. The Big Data solution based on Informatica and Hortonworks allows for a complete data pipeline to ingest, parse, integrate, cleanse, and prepare data for analysis natively on Hadoop thereby increasing developer productivity by 5x over hand-coding.
Where Do We Go From Here?
At the end of the day, Big Data is not about the technology. It is about the deep business and social transformation every organization will go through. The possibilities to make more informed decisions, identify patterns, proactively address fraud and threats, and predict pretty much anything are endless.
This transformation will happen as the technology is adopted and leveraged by more and more business users. We are already seeing the transition from 20-node clusters to 100-node clusters and from a handful of technology-savvy users relying on Hadoop to hundreds of business users. Informatica and Hortonworks are accelerating the delivery of actionable Big Data insights to business users by automating the entire data pipeline.
Try It For Yourself
On September 10, 2014, Informatica announced the 60-day trial version of the Informatica Big Data Edition into the Hortonworks Sandbox. This free trial enables you to download and test out the Big Data Edition on your notebook or spare computer and experience your own personal Modern Data Architecture (MDA).
If you happen to be at Strata this October 2014, please meet us at our booths: Informatica #352 and Hortonworks #117. Don’t forget to participate in our Passport Program and join our session at 5:45 pm ET on Thursday, October 16, 2014.
Come and get it. For developers hungry to get their hands on Informatica on Hadoop, a downloadable free trial of Informatica Big Data Edition was launched today on the Informatica Marketplace. See for yourself the power of the killer app on Hadoop from the leader in data integration and quality.
Thanks to the generous help of our partners, the Informatica Big Data team has preinstalled the Big Data Edition inside the sandbox VMs of the two leading Hadoop distributions. This empowers Hadoop and Informatica developers to easily try the codeless, GUI driven Big Data Edition to build and execute ETL and data integration pipelines natively on Hadoop for Big Data analytics.
Informatica Big Data Edition is the most complete and powerful suite for Hadoop data pipelines and can increase productivity up to 5 times. Developers can leverage hundreds of out-of-the-box Informatica pre-built transforms and connectors for structured and unstructured data processing on Hadoop. With the Informatica Vibe Virtual Data Machine running directly on each node of the Hadoop cluster, the Big Data Edition can profile, parse, transform and cleanse data at any scale to prepare data for data science, business intelligence and operational analytics.
The Informatica Big Data Edition Trial Sandbox VMs will have a 60 day trial version of the Big Data Edition preinstalled inside a 1-node Hadoop cluster. The trials include sample data and mappings as well as getting started documentation and videos. It is possible to try your own data with the trials, but processing is limited to the 1-node Hadoop cluster and the machine you have it running on. Any mappings you develop in the trial can be easily moved on to a production Hadoop cluster running the Big Data Edition. The Informatica Big Data Edition also supports MapR and Pivotal Hadoop distributions, however, the trial is currently only available for Cloudera and Hortonworks.
Accelerate your ability to bring Hadoop from the sandbox into production by leveraging Informatica’s Big Data Edition. Informatica’s visual development approach means that more than one hundred thousand existing Informatica developers are now Hadoop developers without having to learn Hadoop or new hand coding techniques and languages. Informatica can help organizations easily integrate Hadoop into their enterprise data infrastructure and bring the PowerCenter data pipeline mappings running on traditional servers onto Hadoop clusters with minimal modification. Informatica Big Data Edition reduces the risk of Hadoop projects and increases agility by enabling more of your organization to interact with the data in your Hadoop cluster.
To get the Informatica Big Data Edition Trial Sandbox VMs and more information please visit Informatica Marketplace
In my last blog, I talked about the dreadful experience of cleaning raw data by hand as a former analyst a few years back. Well, the truth is, I was not alone. At a recent data mining Meetup event in San Francisco bay area, I asked a few analysts: “How much time do you spend on cleaning your data at work?” “More than 80% of my time” and “most my days” said the analysts, and “they are not fun”.
But check this out: There are over a dozen Meetup groups focused on data science and data mining here in the bay area I live. Those groups put on events multiple times a month, with topics often around hot, emerging technologies such as machine learning, graph analysis, real-time analytics, new algorithm on analyzing social media data, and of course, anything Big Data. Cools BI tools, new programming models and algorithms for better analysis are a big draw to data practitioners these days.
That got me thinking… if what analysts said to me is true, i.e., they spent 80% of their time on data prepping and 1/4 of that time analyzing the data and visualizing the results, which BTW, “is actually fun”, quoting a data analyst, then why are they drawn to the events focused on discussing the tools that can only help them 20% of the time? Why wouldn’t they want to explore technologies that can help address the dreadful 80% of the data scrubbing task they complain about?
Having been there myself, I thought perhaps a little self-reflection would help answer the question.
As a student of math, I love data and am fascinated about good stories I can discover from them. My two-year math program in graduate school was primarily focused on learning how to build fabulous math models to simulate the real events, and use those formula to predict the future, or look for meaningful patterns.
I used BI and statistical analysis tools while at school, and continued to use them at work after I graduated. Those software were great in that they helped me get to the results and see what’s in my data, and I can develop conclusions and make recommendations based on those insights for my clients. Without BI and visualization tools, I would not have delivered any results.
That was fun and glamorous part of my job as an analyst, but when I was not creating nice charts and presentations to tell the stories in my data, I was spending time, great amount of time, sometimes up to the wee hours cleaning and verifying my data, I was convinced that was part of my job and I just had to suck it up.
It was only a few months ago that I stumbled upon data quality software – it happened when I joined Informatica. At first I thought they were talking to the wrong person when they started pitching me data quality solutions.
Turns out, the concept of data quality automation is a highly relevant and extremely intuitive subject to me, and for anyone who is dealing with data on the regular basis. Data quality software offers an automated process for data cleansing and is much faster and delivers more accurate results than manual process. To put that in math context, if a data quality tool can reduce the data cleansing effort from 80% to 40% (btw, this is hardly a random number, some of our customers have reported much better results), that means analysts can now free up 40% of their time from scrubbing data, and use that times to do the things they like – playing with data in BI tools, building new models or running more scenarios, producing different views of the data and discovering things they may not be able to before, and do all of that with clean, trusted data. No more bored to death experience, what they are left with are improved productivity, more accurate and consistent results, compelling stories about data, and most important, they can focus on doing the things they like! Not too shabby right?
I am excited about trying out the data quality tools we have here at Informtica, my fellow analysts, you should start looking into them also. And I will check back in soon with more stories to share..