Category Archives: Hadoop
Today, 80% of the efforts in Big Data projects are related to extracting, transforming and loading data (ETL). Hortonworks and Informatica have teamed-up to leverage the power of Informatica Big Data Edition to use their existing skills to improve the efficiency of these operations and better leverage their resources in a modern data architecture. (MDA)
Next Generation Data Management
The Hortonworks Data Platform and Informatica BDE enable organizations to optimize their ETL workloads with long-term storage and processing at scale in Apache Hadoop. With Hortonworks and Informatica, you can:
• Leverage all internal and external data to achieve the full predictive power that drives the success of modern data-driven businesses.
• Optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value.
Imagine a world where you would have access to your most strategic data in a timely fashion, no matter how old the data is, where it is stored, or under what format. By leveraging Hadoop’s power of distributed processing, organizations can lower costs of data storage and processing and support large data distribution with high through put and concurrency.
Overall, the alignment between business and IT grows. The Big Data solution based on Informatica and Hortonworks allows for a complete data pipeline to ingest, parse, integrate, cleanse, and prepare data for analysis natively on Hadoop thereby increasing developer productivity by 5x over hand-coding.
Where Do We Go From Here?
At the end of the day, Big Data is not about the technology. It is about the deep business and social transformation every organization will go through. The possibilities to make more informed decisions, identify patterns, proactively address fraud and threats, and predict pretty much anything are endless.
This transformation will happen as the technology is adopted and leveraged by more and more business users. We are already seeing the transition from 20-node clusters to 100-node clusters and from a handful of technology-savvy users relying on Hadoop to hundreds of business users. Informatica and Hortonworks are accelerating the delivery of actionable Big Data insights to business users by automating the entire data pipeline.
Try It For Yourself
On September 10, 2014, Informatica announced the 60-day trial version of the Informatica Big Data Edition into the Hortonworks Sandbox. This free trial enables you to download and test out the Big Data Edition on your notebook or spare computer and experience your own personal Modern Data Architecture (MDA).
If you happen to be at Strata this October 2014, please meet us at our booths: Informatica #352 and Hortonworks #117. Don’t forget to participate in our Passport Program and join our session at 5:45 pm ET on Thursday, October 16, 2014.
Come and get it. For developers hungry to get their hands on Informatica on Hadoop, a downloadable free trial of Informatica Big Data Edition was launched today on the Informatica Marketplace. See for yourself the power of the killer app on Hadoop from the leader in data integration and quality.
Thanks to the generous help of our partners, the Informatica Big Data team has preinstalled the Big Data Edition inside the sandbox VMs of the two leading Hadoop distributions. This empowers Hadoop and Informatica developers to easily try the codeless, GUI driven Big Data Edition to build and execute ETL and data integration pipelines natively on Hadoop for Big Data analytics.
Informatica Big Data Edition is the most complete and powerful suite for Hadoop data pipelines and can increase productivity up to 5 times. Developers can leverage hundreds of out-of-the-box Informatica pre-built transforms and connectors for structured and unstructured data processing on Hadoop. With the Informatica Vibe Virtual Data Machine running directly on each node of the Hadoop cluster, the Big Data Edition can profile, parse, transform and cleanse data at any scale to prepare data for data science, business intelligence and operational analytics.
The Informatica Big Data Edition Trial Sandbox VMs will have a 60 day trial version of the Big Data Edition preinstalled inside a 1-node Hadoop cluster. The trials include sample data and mappings as well as getting started documentation and videos. It is possible to try your own data with the trials, but processing is limited to the 1-node Hadoop cluster and the machine you have it running on. Any mappings you develop in the trial can be easily moved on to a production Hadoop cluster running the Big Data Edition. The Informatica Big Data Edition also supports MapR and Pivotal Hadoop distributions, however, the trial is currently only available for Cloudera and Hortonworks.
Accelerate your ability to bring Hadoop from the sandbox into production by leveraging Informatica’s Big Data Edition. Informatica’s visual development approach means that more than one hundred thousand existing Informatica developers are now Hadoop developers without having to learn Hadoop or new hand coding techniques and languages. Informatica can help organizations easily integrate Hadoop into their enterprise data infrastructure and bring the PowerCenter data pipeline mappings running on traditional servers onto Hadoop clusters with minimal modification. Informatica Big Data Edition reduces the risk of Hadoop projects and increases agility by enabling more of your organization to interact with the data in your Hadoop cluster.
To get the Informatica Big Data Edition Trial Sandbox VMs and more information please visit Informatica Marketplace
This post was written by guest author Dale Kim, Director of Industry Solutions at MapR Technologies, a valued Informatica partner that provides a distribution for Apache Hadoop that ensures production success for its customers.
Apache Hadoop is growing in popularity as the foundation for an enterprise data hub. An Enterprise Data Hub (EDH) extends and optimizes the traditional data warehouse model by adding complementary big data technologies. It focuses your data warehouse on high value data by reallocating less frequently used data to an alternative platform. It also aggregates data from previously untapped sources to give you a more complete picture of data.
So you have your data, your warehouses, your analytical tools, your Informatica products, and you want to deploy an EDH… now what about Hadoop?
Requirements for Hadoop in an Enterprise Data Hub
Let’s look at characteristics required to meet your EDH needs for a production environment:
You already expect these from your existing enterprise deployments. Shouldn’t you hold Hadoop to the same standards? Let’s discuss each topic:
Enterprise-grade is about the features that keep a system running, i.e., high availability (HA), disaster recovery (DR), and data protection. HA helps a system run even when components (e.g., computers, routers, power supplies) fail. In Hadoop, this means no downtime and no data loss, but also no work loss. If a node fails, you still want jobs to run to completion. DR with remote replication or mirroring guards against site-wide disasters. Mirroring needs to be consistent to ensure recovery to a known state. Using file copy tools won’t cut it. And data protection, using snapshots, lets you recover from data corruption, especially from user or application errors. As with DR replicas, snapshots must be consistent, in that they must reflect the state of the data at the time the snapshot was taken. Not all Hadoop distributions can offer this guarantee.
Hadoop interoperability is an obvious necessity. Features like a POSIX-compliant, NFS-accessible file system let you reuse existing, file system-based applications on Hadoop data. Support for existing tools lets your developers get up to speed quickly. And integration with REST APIs enables easy, open connectivity with other systems.
You should be able to logically divide clusters to support different use cases, job types, user group, and administrators as needed. To avoid a complex, multi-cluster setup, choose a Hadoop distribution with multi-tenancy capabilities to simplify the architecture. This gives you less risk for error and no data/effort duplication.
Security should be a priority to protect against the exposure of confidential data. You should assess how you’ll handle authentication (with or without Kerberos), authorization (access controls), over-the-network encryption, and auditing. Many of these features should be native to your Hadoop distribution, and there are also strong security vendors that provide technologies for securing Hadoop.
Any large scale deployment needs fast read, write, and update capabilities. Hadoop can support the operational requirements of an EDH with integrated, in-Hadoop databases like Apache HBase™ and Accumulo™, as well as MapR-DB (the MapR NoSQL database). This in-Hadoop model helps to simplify the overall EDH architecture.
Using Hadoop as a foundation for an EDH is a powerful option for businesses. Choosing the correct Hadoop distribution is the key to deploying a successful EDH. Be sure not to take shortcuts – especially in a production environment – as you will want to hold your Hadoop platform to the same high expectations you have of your existing enterprise systems.
In my last blog, I talked about the dreadful experience of cleaning raw data by hand as a former analyst a few years back. Well, the truth is, I was not alone. At a recent data mining Meetup event in San Francisco bay area, I asked a few analysts: “How much time do you spend on cleaning your data at work?” “More than 80% of my time” and “most my days” said the analysts, and “they are not fun”.
But check this out: There are over a dozen Meetup groups focused on data science and data mining here in the bay area I live. Those groups put on events multiple times a month, with topics often around hot, emerging technologies such as machine learning, graph analysis, real-time analytics, new algorithm on analyzing social media data, and of course, anything Big Data. Cools BI tools, new programming models and algorithms for better analysis are a big draw to data practitioners these days.
That got me thinking… if what analysts said to me is true, i.e., they spent 80% of their time on data prepping and 1/4 of that time analyzing the data and visualizing the results, which BTW, “is actually fun”, quoting a data analyst, then why are they drawn to the events focused on discussing the tools that can only help them 20% of the time? Why wouldn’t they want to explore technologies that can help address the dreadful 80% of the data scrubbing task they complain about?
Having been there myself, I thought perhaps a little self-reflection would help answer the question.
As a student of math, I love data and am fascinated about good stories I can discover from them. My two-year math program in graduate school was primarily focused on learning how to build fabulous math models to simulate the real events, and use those formula to predict the future, or look for meaningful patterns.
I used BI and statistical analysis tools while at school, and continued to use them at work after I graduated. Those software were great in that they helped me get to the results and see what’s in my data, and I can develop conclusions and make recommendations based on those insights for my clients. Without BI and visualization tools, I would not have delivered any results.
That was fun and glamorous part of my job as an analyst, but when I was not creating nice charts and presentations to tell the stories in my data, I was spending time, great amount of time, sometimes up to the wee hours cleaning and verifying my data, I was convinced that was part of my job and I just had to suck it up.
It was only a few months ago that I stumbled upon data quality software – it happened when I joined Informatica. At first I thought they were talking to the wrong person when they started pitching me data quality solutions.
Turns out, the concept of data quality automation is a highly relevant and extremely intuitive subject to me, and for anyone who is dealing with data on the regular basis. Data quality software offers an automated process for data cleansing and is much faster and delivers more accurate results than manual process. To put that in math context, if a data quality tool can reduce the data cleansing effort from 80% to 40% (btw, this is hardly a random number, some of our customers have reported much better results), that means analysts can now free up 40% of their time from scrubbing data, and use that times to do the things they like – playing with data in BI tools, building new models or running more scenarios, producing different views of the data and discovering things they may not be able to before, and do all of that with clean, trusted data. No more bored to death experience, what they are left with are improved productivity, more accurate and consistent results, compelling stories about data, and most important, they can focus on doing the things they like! Not too shabby right?
I am excited about trying out the data quality tools we have here at Informtica, my fellow analysts, you should start looking into them also. And I will check back in soon with more stories to share..
I have a little fable to tell you…
This fable has nothing to do with Big Data, but instead deals with an Overabundance of Food and how to better digest it to make it useful.
And it all started when this SEO copywriter from IT Corporation walked into a bar, pub, grill, restaurant, liquor establishment, and noticed 2 large crowded tables. After what seemed like an endless loop, an SQL programmer sauntered in and contemplated the table problem. “Mind if I join you?”, he said? Since the tables were partially occupied and there were no virtual tables available, the host looked on the patio of the restaurant at 2 open tables. “Shall I do an outside join instead?” asked the programmer? The host considered their schema and assigned 2 seats to the space.
The writer told the programmer to look at the menu, bill of fare, blackboard – there were so many choices but not enough real nutrition. “Hmmm, I’m hungry for the right combination of food, grub, chow, to help me train for a triathlon” he said. With that contextual information, they thought about foregoing the menu items and instead getting in the all-you-can-eat buffer line. But there was too much food available and despite its appealing looks in its neat rows and columns, it seemed to be mostly empty calories. They both realized they had no idea what important elements were in the food, but came to the conclusion that this restaurant had a “Big Food” problem.
They scoped it out for a moment and then the writer did an about face, reversal, change in direction and the SQL programmer did a commit and quick pivot toward the buffer line where they did a batch insert of all of the food, even the BLOBS of spaghetti, mash potatoes and jello. There was far too much and it was far too rich for their tastes and needs, but they binged and consumed it all. You should have seen all the empty dishes at the end – they even caused a stack overflow. Because it was a batch binge, their digestive tracts didn’t know how to process all of the food, so they got a stomach ache from “big food” ingestion – and it nearly caused a core dump – in which case the restaurant host would have assigned his most dedicated servers to perform a thorough cleansing and scrubbing. There was no way to do a rollback at this point.
It was clear they needed relief. The programmer did an ad hoc query to JSON, their Server who they thought was Active, for a response about why they were having such “big food” indigestion, and did they have packets of relief available. No response. Then they asked again. There was still no response. So the programmer said to the writer, “Gee, the Quality Of Service here is terrible!”
Just then, the programmer remembered a remedy he had heard about previously and so he spoke up. “Oh, it’s very easy just <SELECT>Vibe.Data.Stream from INFORMATICA where REAL-TIME is NOT NULL.”
Informatica’s Vibe Data Stream enables streaming food collection for real-time Big food analytics, operational intelligence, and traditional enterprise food warehousing from a variety of distributed food sources at high scale and low latency. It enables the right food ingested at the right time when nutrition is needed without any need for binge or batch ingestion.
And so they all lived happily ever after and all was good in the IT Corporation once again.
Download Now and take your first steps to rapidly developing applications that sense and respond to streaming food (or data) in real-time.
Forget degrees from Harvard or MIT, forget NoSQL, Hadoop or OBIEE. These are all powerful tools but they will not win you the face-off. It starts with who you are (or are not), who or what you are going up against and what has happened in the past. Why should martial arts be a job requirement for Chief Data Officers? I boiled it down to three simple reasons to help you understand.
I started practicing Kendo three years ago and it surprises me every single practice how inadequate I still am, how much I can glean from my opponent to determine future behavior and how unimportant “background noise” really is. Even if I have a good day, some strike from a teenager or a retiree, who has been practicing for a decade or more, will remind me that I got only 1% better compared to last month and I have a long way to go. At our last practice, one of my Senseis told me that the higher ranks get their Ki-Ken-Tai-Ichi (alignment of spirit, sword, body) right maybe half the time. It’s a life lesson every time.
These three facts are probably also true for many one-on-one sports where adversaries study each other for more than just a couple of seconds before their next swing or shot. If you ask me, most western sports are about endurance, strength, mindset and strategy with a heavy focus on the physical aspects. Kendo is 90% strategy and mindset. That is why six and sixty-year olds alike can excel in it. It is more akin to chess with baseball bats.
You study your opponent from the second he walks up to the chair in the middle of the podium for a gentleman-like exchange of cerebral willpower but in the end you will smack him relentlessly with a bat. Everything you do is directly driven from how you feel, what you think, how your opponent moves, what your opponent feels and thinks. The goal is not to react to a hand being raised but to anticipate your opponent’s next move based on their most recent actions. By the time your eyes (and, a tenth of a second later, your brain) capture the right hand going up to strike towards you – you’ve already lost, as it is too late to react.
You are effectively analyzing core data domains and key attributes, like posture. Business data requires the same rigor and focus on the essential. There is also a tremendous amount of process (formalities like repeated bowing) and deeper meaning in everything you do; call it “Governance”.
A Chief Data Officer (CDO) needs to mind the same aspects in his or her existence.
- You are not the professional you think you are (humility)
- Someone else always knows something you don’t, so every additional bit helps to predict future actions (willingness to learn)
- How to eliminate all the noise detracting from the ultimate goal (focus)
In reality, the data problem has not been solved long ago. Something new can be learned to combat this age-old problem. The learning piece comes into play when we are willing to listen to people who have done or seen similar problems being fixed in another environment, not necessarily the same industry or department. The third is that political and technical detractors like procurement processes, M&A, new leadership or transactional volume spikes from more applications will continue to pop up. However, it is on you, the CDO, to uncover, isolate and preach that fixing a process may not always be the root cause of a business issue and as such needs to be put in perspective. As I always say “throwing bad data at a better process” just saved you a step but still renders errors, rework and bad decisions.
So what does this mean in “real” terms:
- Seek and accept opinions frequently, even if they don’t match your issue perfectly. Often a customer is a customer is a customer….admit it. Your business model may not be that special after all.
- Watch what the others do on a fundamental level, i.e. becoming data-driven organizations. These could be competitors, partners, organizations you (should) admire.
- Internalize and socialize what the core asset, goal of the organization is, which will move the needle the most. Often it will be your intelligence (speak for information or data).
I will leave you with these thoughts and invite you to sit down, cross your legs, close your eyes and get all esoteric on me, young grasshopper, but please envision what you organization should look like and how it should make its money in five years from now. Throwing more resources at new problems, ignoring core data issues and reacting when things bubble up at greater numbers will likely not cut it.
And here is where I will bow out. Take a moment and think about it; how does your take on life influence your assessment of what you encounter in your workplace?