Category Archives: Hadoop

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.

“It’s not you, it’s me!” – says Data Quality to Big Data

“It’s not you, it’s me!” – says Data Quality to Big Data

“It’s not you, it’s me!” – says Data Quality to Big Data

I couldn’t help myself start this blog with George Costanza’s “You are giving me the – It’s not you, it’s me! – routine? I invented – It’s not you, it’s me …”

The thing that resonates today, in the odd context of big data, is that we may all need to look in the mirror, hold a thumb drive full of information in our hands, and concede once and for all It’s not the data… it’s us.

Many organizations have a hard time making something useful from the ever-expanding universe of big-data, but the problem doesn’t lie with the data: It’s a people problem.

The contention is that big-data is falling short of the hype because people are:

  1. too unwilling to create cultures that value standardized, efficient, and repeatable information, and
  2. too complex to be reduced to “thin data” created from digital traces.

Evan Stubbs describes poor data quality as the data analyst’s single greatest problem.


About the only satisfying thing about having bad data is the schadenfreude that goes along with it. There’s cold solace in knowing that regardless of how poor your data is, everyone else’s is equally as bad. The thing is poor quality data doesn’t just appear from the ether. It’s created. Leave the dirty dishes for long enough and you’ll end up with cockroaches and cholera. Ignore data quality and eventually you’ll have black holes of untrustworthy information. Here’s the hard truth: we’re the reason bad data exists.


I will tell you that most data teams make “large efforts” to scrub their data. Those “infrequent” big cleanups however only treat the symptom, not the cause – and ultimately lead to inefficiency, cost, and even more frustration.

It’s intuitive and natural to think that data quality is a technological problem. It’s not; it’s a cultural problem. The real answer is that you need to create a culture that values standardized, efficient, and repeatable information.

If you do that, then you’ll be able to create data that is re-usable, efficient, and high quality. Rather than trying to manage a shanty of half-baked source tables, effective teams put the effort into designing, maintaining, and documenting their data. Instead of being a one-off activity, it becomes part of business as usual, something that’s simply part of daily life.

However, even if that data is the best it can possibly be, is it even capable of delivering on the big-data promise of greater insights about things like the habits, needs, and desires of customers?

Despite the enormous growth of data and the success of a few companies like Amazon and Netflix, “the reality is that deeper insights for most organizations remain elusive,” write Mikkel Rasmussen and Christian Madsbjerg in a Bloomberg Businessweek blog post that argues “big-data gets people wrong.”


Big-data delivers thin data. In the social sciences, we distinguish between two types of human behavior data. The first – thin data – is from digital traces: He wears a size 8, has blue eyes, and drinks pinot noir. The second – rich data – delivers an understanding of how people actually experience the world: He could smell the grass after the rain, he looked at her in that special way, and the new running shoes made him look faster. Big-data focuses solely on correlation, paying no attention to causality. What good is thin “information” when there is no insight into what your consumers actually think and feel?


Accenture reported only 20 percent of the companies it profiled had found a proven causal link between “what they measure and the outcomes they are intending to drive.”

Now, I can contend they keys to transforming big-data to strategic value are critical thinking skills.

Where do we get such skills? People, it seems, are both the problem and the solution. Are we failing on two fronts: failing to create the right data-driven cultures, and failing to interpret the data we collect?

Twitter @bigdatabeat

Share
Posted in Architects, Big Data, Business Impact / Benefits, CIO, Data Governance, Data Quality, Data Transformation, Hadoop | Tagged , , , | Leave a comment

There are Three Kinds of Lies: Lies, Damned lies, and Data

Lies, Damned lies, and Data

Lies, Damned lies, and Data

The phrase Benjamin Disraeli used in the 19th century was: There are three kinds of lies: lies, damned lies, and statistics.

Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.

Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.

In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.

Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.

Twitter @bigdatabeat

Share
Posted in Big Data, Cloud Data Management, Data Governance, Data Transformation, Data Warehousing, Hadoop | Tagged , , , , | Leave a comment

Enterprise Architecture for 8th Graders

Enterprise Architecture for 8th Graders

Enterprise Architecture for 8th Graders

Communicating a complex topic like Enterprise Architecture in a simple way that avoids technical terms and buzzwords is not easy. This blog builds on earlier articles by Rob Karel than explains Big Data and Metadata to Mom and by John Schmidt that explains Integration to kids by focusing on Enterprise Architecture for an audience of 8th graders. The first thing I had to do was learn about the vocabulary and communication style of teenagers, so this internet article on “How 14 year old teenagers communicate” was helpful.  My findings, although not related to the topic of Enterprise Architecture, were revealing.

The article provided the following recommendations when talking to teenagers:

1. Talk with them and not at them.
2. Ask questions that go beyond “yes” or “no” answers to prompt more developed conversation.
3. Take advantage of time during car trips to talk with your teen.
4. Make time for sporting and school events, playing games, and talking about current events.

Let’s see how these recommendations could be applied:

1. Talk to them and not at them

Ask your teenager the following question – Have you ever heard about Enterprise Architecture?

“It is a way to help a company understand its customers and how its products are made and sold.  It helps managers improve the way the company works and how technology is used to help people do a better job”.

2. Ask questions that go beyond “yes” or “no” answers to prompt more developed conversation.

Ask your teenager the following question:  What do you want to do when you grow up? – Depending on the answer you may need to customize the text below.

“Enterprise Architecture helps you understand the needs of (the industry selected by your teenager).  It will then tell you the typical activities that employees do and the systems and technologies that are used to simplify those activities”.

3. Take advantage of time during car trips to talk with your teen.

Imagine the following scenario with your teenager – we can do the following exercise. Let’s assume your teenager wants be part of an advertising agency for the entertainment industry.

“We should count the number of billboards on the side of the road and note how many are movie advertisements. I am interested in your opinion of which advertising style sparks your interest in a specific movie”.

“If we do that we can then discuss the different activities that are required in making that advertising material and how to make the images speak to you. Enterprise Architecture also does that. It helps you understand the activities required in any business, step by step, allowing you to create templates or graphics that represent any industry”.

4. Make time for sports and school events, play games, and talk about current events.

Another sample conversation to support this recommendation could be:

“Let’s go to the movies this week. Once you select one you would like to see, see if you can identify why you choose this particular movie over the other ones. If you think of the billboards that we saw, can you remember what motivated or influenced you?.  We can understand together what the designers of those images were creating in the visual experience. Perhaps you have new ideas or suggestions on how they could have done it better? With the help of Enterprise Architecture companies identify more efficient activities to generate more business”.

After researching the topic I realized that we could apply these recommendations to share Enterprise Architecture with our business partners. Perhaps the readers of this blog can help by using these recommendations with their teenagers at home to explain the basic concepts of Enterprise Architecture and collectively create a simpler way to talk about Enterprise Architecture.

Share
Posted in Enterprise Data Management, Hadoop | Tagged , , | Leave a comment

How to Get the Biggest Returns from Your Hadoop and Big Data Investments in 2015

Big Data2014 was the year that Big Data went mainstream from conversations asking “What is Big Data?” to “How do we harness the power of Big Data to solve real business problems”. It seemed like everyone jumped on the Big Data band wagon from new software start-ups offering the “next generation” predictive analytic applications to traditional database, data quality, business intelligence, and data integration vendors, all calling themselves Big Data providers. The truth is, they all play a role in this Big Data movement.

Earlier in 2014, Wikibon estimated the Big Data market is currently on pace to top $50 billion in 2017, which translates to a 38% compound annual growth rate over the six year period from 2011 (the first year Wikibon sized the Big Data market) to 2017. Most of the excitement around Big Data has been around Hadoop as early adopters who experimented with open source versions quickly grew to adopt enterprise-class solutions from companies like Cloudera™, HortonWorks™, MapR™, and Amazon’s RedShift™ to address real-world business problems including: (more…)

Share
Posted in B2B, Big Data, Business Impact / Benefits, Data Aggregation, Hadoop | Tagged , , , | Comments Off

Data Visibility From the Source to Hadoop and Beyond with Cloudera and Informatica Integration

Data Visibility From the Source to Hadoop

Data Visibility From the Source to Hadoop

This is a guest post by Amr Awadallah, Founder, CTO at Cloudera, Inc.

It takes a village to build mainstream big data solutions. We often get so caught up in Hadoop use cases and customer successes that sometimes we don’t talk enough about the innovative partner technologies and integrations that enable our customers to put the enterprise data hub at the core of their data architecture and innovate with confidence. Cloudera and Informatica have been working together to integrate our products to enable new levels of productivity and lower deployment and production risk.

Going from Hadoop to an enterprise data hub, means a number of things. It means that you recognize the business value of capturing and leveraging all your data for exploration and analytics. It means you’re ready to make the move from Hadoop pilot project to production. And it means your data is important enough that it’s worth securing and making data pipelines visible. It’s the visibility layer, and in particular, the unique integration between Cloudera Navigator and Informatica that I want to focus on in this post.

The era of big data has ushered in increased regulations in a number of industries – banking, retail, healthcare, energy – most of which deal in how data is managed throughout its lifecycle. Cloudera Navigator is the only native end-to-end solution for governance in Hadoop. It provides visibility for analysts to explore data in Hadoop, and enables administrators and managers to maintain a full audit history for HDFS, HBase, Hive, Impala, Spark and Sentry then run reports on data access for auditing and compliance.The integration of Informatica Metadata Manager in the Big Data Edition and Cloudera Navigator extends this level of visibility and governance beyond the enterprise data hub.

Hadoop
Today, only Informatica and Cloudera provide end-to-end data lineage from source systems through Hadoop, and into BI/analytic and data warehouse systems. And you can view it from a single pane within Informatica.

This is important because Hadoop, and the enterprise data hub in particular, doesn’t function in a silo. It’s an integrated part of a larger enterprise-wide data management architecture. The better the insight into where data originated, where it traveled, who had access to it and what they did with it, the greater our ability to report and audit. No other combination of technologies provides this level of audit granularity.

But more so than that, the visibility Cloudera and Informatica provides our joint customers with the ability to confidently stand up an enterprise data hub as a part of their production enterprise infrastructure because they can verify the integrity of the data that undergirds their analytics. I encourage you to check out a demo of the Informatica-Cloudera Navigator integration at this link: http://infa.media/1uBpPbT

You can also check out a demo and learn a little more about Cloudera Navigator  and the Informatica integration in the recorded  TechTalk hosted by Informatica at this link:

http://www.informatica.com/us/company/informatica-talks/?commid=133311

Share
Posted in Big Data, Cloud Data Integration, Governance, Risk and Compliance, Hadoop | Tagged , , , , | Leave a comment

Pour Some Schema On Me: The Secret Behind Every Enterprise Information Lake

Schema and the Enterprise Information Lake

Schema and the Enterprise Information Lake

Has there ever been a more exciting time in the world of data management?  With exponentially faster computing resources and exponentially cheaper storage, emerging frameworks like Hadoop are introducing new ways to capture, process, and analyze data.  Enterprises can leverage these new capabilities to become more efficient, competitive, and responsive to their customers.

Data warehousing systems remain the de facto standard for high performance reporting and business intelligence, and there is no sign that will change soon.  But Hadoop now offers an opportunity to lower costs by transferring infrequently used data and data preparation workloads off of the data warehouse and process entirely new sources of data coming from the explosion of industrial and personal devices.  This is motivating interest in new concepts like the “data lake” as adjunct environments to traditional data warehousing systems.

Now, let’s be real.  Between the evolutionary opportunity of preparing data more cost effectively and the revolutionary opportunity of analyzing new sources of data, the latter just sounds cooler.  This revolutionary opportunity is what has spurred the growth of new roles like data scientists and new tools for self-service visualization.  In the revolutionary world of pervasive analytics, data scientists have the ability to use Hadoop as a low cost and transient sandbox for data.  Data scientists can perform exploratory data analysis by quickly dumping data from a variety of sources into a schema-on-read platform and by iterating dumps as new data comes in.  SQL-on-Hadoop technologies like Cloudera Impala, Hortonworks Stinger, Apache Drill, and Pivotal HAWQ enable agile and iterative SQL-like queries on datasets, while new analysis tools like Tableau enable self-service visualization.  We are merely in the early phases of the revolutionary opportunity of big data.

But while the revolutionary opportunity is exciting, there’s an equally compelling opportunity for enterprises to modernize their existing data environment.  Enterprises cannot rely on an iterative dump methodology for managing operational data pipelines.  Unmanaged “data swamps” are simply unpractical for business operations.  For an operational data pipeline, the Hadoop environment must be a clean, consistent, and compliant system of record for serving analytical systems.  Loading enterprise data into Hadoop instead of a relational data warehouse does not eliminate the need to prepare it.

Now I have a secret to share with you:  nearly every enterprise adopting Hadoop today to modernize their data environment has processes, standards, tools, and people dedicated to data profiling, data cleansing, data refinement, data enrichment, and data validation.  In the world of enterprise big data, schemas and metadata still matter.

SchemaI’ll share some examples with you.  I attended a customer panel at Strata + Hadoop World in October.  One of the participants was the analytics program lead at a large software company whose team was responsible for data preparation.  He described how they ingest data from heterogeneous data sources by mandating a standardized schema for everything that lands in the Hadoop data lake.  Once the data lands, his team profiles, cleans, refines, enriches, and validates the data so that business analysts have access to high quality information.  Another data executive described how inbound data teams are required to convert data into Avro before storing the data in the data lake.  (Avro is an emerging data format alongside other new formats like ORC, Parquet, and JSON).  One data engineer from one of the largest consumer internet companies in the world described the schema review committee that had been set up to govern changes to their data schemas.  The final participant was an enterprise architect from one of the world’s largest telecom providers who described how their data schema was critical for maintaining compliance with privacy requirements since data had to be masked before it could be made available to analysts.

Let me be clear – these companies are not just bringing in CRM and ERP data into Hadoop.  These organizations are ingesting patient sensor data, log files, event data, clickstream data, and in every case, data preparation was the first task at hand.

I recently talked to a large financial services customer who proposed a unique architecture for their Hadoop deployment.  They wanted to empower line of business users to be creative in discovering revolutionary opportunities while also evolving their existing data environment.  They decided to allow line of businesses to set up sandbox data lakes on local Hadoop clusters for use by small teams of data scientists.  Then, once a subset of data was profiled, cleansed, refined, enriched, and validated, it would be loaded into a larger Hadoop cluster functioning as an enterprise information lake.  Unlike the sandbox data lakes, the enterprise information lake was clean, consistent, and compliant.  Data stewards of the enterprise information lake could govern metadata and ensure data lineage tracking from source systems to sandbox to enterprise information lakes to destination systems.  Enterprise information lakes balance the quality of a data warehouse with the cost-effective scalability of Hadoop.

Building enterprise information lakes out of data lakes is simple and fast with tools that can port data pipeline mappings from traditional architectures to Hadoop.  With visual development interfaces and native execution on Hadoop, enterprises can accelerate their adoption of Hadoop for operational data pipelines.

No one described the opportunity of enterprise information lakes better at Strata + Hadoop World than a data executive from a large healthcare provider who said, “While big data is exciting, equally exciting is complete data…we are data rich and information poor today.”  Schemas and metadata still matter more than ever, and with the help of leading data integration and preparation tools like Informatica, enterprises have a path to unleashing information riches.  To learn more, check out this Big Data Workbook

The Big Data Workbook

The Big Data Workbook

Share
Posted in Big Data, Data Integration, Data Quality, Hadoop | Tagged , , , | Leave a comment

Remembering Big Data Gravity – PART 2

I ended my previous blog wondering if awareness of Data Gravity should change our behavior. While Data Gravity adds Value to Big Data, I find that the application of the Value is under explained.

Exponential growth of data has naturally led us to want to categorize it into facts, relationships, entities, etc. This sounds very elementary. While this happens so quickly in our subconscious minds as humans, it takes significant effort to teach this to a machine.

A friend tweeted this to me last week: I paddled out today, now I look like a lobster. Since this tweet, Twitter has inundated my friend and me with promotions from Red Lobster. It is because the machine deconstructed the tweet: paddled <PROPULSION>, today <TIME>, like <PREFERENCE> and lobster <CRUSTACEANS>. While putting these together, the machine decided that the keyword was lobster. You and I both know that my friend was not talking about lobsters.

You may think that this maybe just a funny edge case. You can confuse any computer system if you try hard enough, right? Unfortunately, this isn’t an edge case. 140 characters has not just changed people’s tweets, it has changed how people talk on the web. More and more information is communicated in smaller and smaller amounts of language, and this trend is only going to continue.

When will the machine understand that “I look like a lobster” means I am sunburned?

I believe the reason that there are not hundreds of companies exploiting machine-learning techniques to generate a truly semantic web, is the lack of weighted edges in publicly available ontologies. Keep reading, it will all make sense in about 5 sentences. Lobster and Sunscreen are 7 hops away from each other in dbPedia – way too many to draw any correlation between the two. For that matter, any article in Wikipedia is connected to any other article within about 14 hops, and that’s the extreme. Completed unrelated concepts are often just a few hops from each other.

But by analyzing massive amounts of both written and spoken English text from articles, books, social media, and television, it is possible for a machine to automatically draw a correlation and create a weighted edge between the Lobsters and Sunscreen nodes that effectively short circuits the 7 hops necessary. Many organizations are dumping massive amounts of facts without weights into our repositories of total human knowledge because they are naïvely attempting to categorize everything without realizing that the repositories of human knowledge need to mimic how humans use knowledge.

For example – if you hear the name Babe Ruth, what is the first thing that pops to mind? Roman Catholics from Maryland born in the 1800s or Famous Baseball Player?

data gravityIf you look in Wikipedia today, he is categorized under 28 categories in Wikipedia, each of them with the same level of attachment. 1895 births | 1948 deaths | American League All-Stars | American League batting champions | American League ERA champions | American League home run champions | American League RBI champions | American people of German descent | American Roman Catholics | Babe Ruth | Baltimore Orioles (IL) players | Baseball players from Maryland | Boston Braves players | Boston Red Sox players | Brooklyn Dodgers coaches | Burials at Gate of Heaven Cemetery | Cancer deaths in New York | Deaths from esophageal cancer | Major League Baseball first base coaches | Major League Baseball left fielders | Major League Baseball pitchers | Major League Baseball players with retired numbers | Major League Baseball right fielders | National Baseball Hall of Fame inductees | New York Yankees players | Providence Grays (minor league) players | Sportspeople from Baltimore | Maryland | Vaudeville performers.

Now imagine how confused a machine would get when the distance of unweighted edges between nodes is used as a scoring mechanism for relevancy.

If I were to design an algorithm that uses weighted edges (on a scale of 1-5, with 5 being the highest), the same search would yield a much more obvious result.

data gravity1895 births [2]| 1948 deaths [2]| American League All-Stars [4]| American League batting champions [4]| American League ERA champions [4]| American League home run champions [4]| American League RBI champions [4]| American people of German descent [2]| American Roman Catholics [2]| Babe Ruth [5]| Baltimore Orioles (IL) players [4]| Baseball players from Maryland [3]| Boston Braves players [4]| Boston Red Sox players [5]| Brooklyn Dodgers coaches [4]| Burials at Gate of Heaven Cemetery [2]| Cancer deaths in New York [2]| Deaths from esophageal cancer [1]| Major League Baseball first base coaches [4]| Major League Baseball left fielders [3]| Major League Baseball pitchers [5]| Major League Baseball players with retired numbers [4]| Major League Baseball right fielders [3]| National Baseball Hall of Fame inductees [5]| New York Yankees players [5]| Providence Grays (minor league) players [3]| Sportspeople from Baltimore [1]| Maryland [1]| Vaudeville performers [1].

Now the machine starts to think more like a human. The above example forces us to ask ourselves the relevancy a.k.a. Value of the response. This is where I think Data Gravity’s becomes relevant.

You can contact me on twitter @bigdatabeat with your comments.

Share
Posted in Architects, Big Data, Cloud, Cloud Data Management, Data Aggregation, Data Archiving, Data Governance, General, Hadoop | Tagged , , , , , , | Leave a comment

Remembering Big Data Gravity – Part 1

If you’ve wondered why so many companies are eager to control data storage, the answer can be summed up in a simple term: data gravity. Ultimately, where data is determines where the money is. Services and applications are nothing without it.

big dataDave McCrory introduced his idea of Data Gravity with a blog post back in 2010. The core idea was – and is – Interesting. More recently, Data Gravity featured in this year’s EMC World keynote. But, beyond the observation that large or valuable agglomerations of data exert a pull that tends to see them grow in size or value, what is a recognition of Data Gravity actually good for?

As a concept, Data Gravity seems closely associated with current enthusiasm for Big Data. In addition, like Big Data, the term’s real-world connotations can be unhelpful almost as often as they are helpful. Big Data exhibits at least three characteristics, which are Volume, Velocity, and Variety. Various other V’s, including Value, is mentioned from time to time, but with less consistency. Yet, Big Data’s name says it’s all about size. The speed with which data must be ingested, processed, or excreted is less important. The complexity and diversity of the data doesn’t matter either.

On its own, the size of a data set is unimportant. Coping with lots of data certainly raises some not-insignificant technical challenges, but the community is actually doing a good job of coming up with technically impressive solutions. The interesting aspect of a huge data set isn’t its size, but the very different modes of working that become possible when you begin to unpick the complex interrelationships between data elements.

Sometimes, Big Data is the vehicle by which enough data is gathered about enough aspects of enough things from enough places for those interrelationships to become observable against the background noise. Other times, Big Data is the background noise, and any hope of insight is drowned beneath the unending stream of petabytes.

To a degree, Data Gravity falls into the same trap. More gravity must be good, right? And more mass leads to more gravity. Mass must be connected to volume, in some vague way that was explained when I was 11, and which involves STP. Therefore, bigger data sets have more gravity. This means that bigger data sets are better data sets. That assertion is clearly nonsense, but luckily, it’s not actually what McCrory is suggesting. His arguments are more nuanced than that, and potentially far more useful.

Instinctively, I like that the equation attempts to move attention away from ‘the application’ toward the pools of data that support many, many applications at once. The data is where the potential lies. Applications are merely the means to unlock that potential in various ways. So maybe notions of Potential Energy from elsewhere in Physics need to figure here.

But I’m wary of the emphasis given to real numbers that are simply the underlying technology’s vital statistics; network latency, bandwidth, request sizes, numbers of requests, and the rest. I realize that these are the measurable things that we have, but feel that more abstract notions of value need to figure just as prominently.

So I’m left reaffirming my original impression that Data Gravity is “interesting”. It’s also intriguing, and I keep feeling that it should be insightful. I’m just not — yet — sure exactly how. Is a resource with a Data Gravity of 6 twice as good as a resource with a Data Gravity of 3? Does a data set with a Data Gravity of 15 require three times as much investment/infrastructure/love as a data set scoring a humble 5? It’s unlikely to be that simple, but I do look forward to seeing what happens as McCrory begins to work with the parts of our industry that can lend empirical credibility to his initial dabbling in mathematics.

If real numbers show the equations to stand up, all we then need to do is work out what the numbers mean. Should an awareness of Data Gravity change our behavior, should it validate what gut feel led us to do already, or is it just another ‘interesting’ and ultimately self-evident number that doesn’t take us anywhere?

I don’t know, but I will continue to explore. You can contact me on twitter @bigdatabeat

Share
Posted in General, Hadoop | Tagged , , | Leave a comment

Not Just For Play, Western Union Puts Hadoop to Work

Not Just For Play, Western Union Puts Hadoop to Work

Put Hadoop to Work

Everyone’s talking about Hadoop for empowering analysts to quickly experiment, discover, and predict new insights.  But Hadoop isn’t just for play.  Leading enterprises like Western Union are putting Hadoop to work on their most mission-critical data pipelines.  Last week at Strata + Hadoop World, we had a chance to hear how Western Union uses Cloudera’s Hadoop-based enterprise data hub and Informatica to deliver faster, simpler, and cleaner data pipelines.

Western Union, a multi-billion dollar global financial services and communications company, data is recognized as their core asset.  Like many other financial services firms, Western Union thrives on data for both harvesting new business opportunities and managing its internal operations.  And like many other enterprises, Western Union isn’t just ingesting data from relational data sources.  They are mining a number of new information-rich sources like clickstream data and log data.  With Western Union’s scale and speed demands, the data pipeline just has to work so they can optimize customer experience across multiple channels (e.g. retail, online, mobile, etc.) to grow the business.

Let’s level set on how important scale and speed is to Western Union.  Western Union processes more than 29 financial transactions every second.  Analytical performance simply can’t be the bottleneck for extracting insights from this blazing velocity of data.  So to maximize the performance of their data warehouse appliance, Western Union offloaded data quality and data integration workloads onto a Cloudera Hadoop cluster.  Using the Informatica Big Data Edition, Western Union capitalized on the performance and scalability of Hadoop while unleashing the productivity of their Informatica developers.

Informatica Big Data Edition enables data driven organizations to profile, parse, transform, and cleanse data on Hadoop with a simple visual development environment, prebuilt transformations, and reusable business rules.  So instead of hand coding one-off scripts, developers can easily create mappings without worrying about the underlying execution platform.  Raw data can be easily loaded into Hadoop using Informatica Data Replication and Informatica’s suite of PowerExchange connectors.  After the data is prepared, it can be loaded into a data warehouse appliance for supporting high performance analysis.  It’s a win-win solution for both data managers and data consumers.  Using Hadoop and Informatica, the right workloads are processed by the right platforms so that the right people get the right data at the right time.

Using Informatica’s Big Data solutions, Western Union is transforming the economics of data delivery, enabling data consumers to create safer and more personalized experiences for Western Union’s customers.  Learn how the Informatica Big Data Edition can help put Hadoop to work for you.  And download a free trial to get started today!

Share
Posted in Data Migration, Hadoop | Tagged , , | Leave a comment

Has Hadoop Crossed The Chasm? Thoughts About Strata 2014

Well, it’s been a little over a week since the Strata conference so I thought I should give some perspective on what I learned.  I think it was summed up at my first meeting, on the first morning of the conference. The meeting was with a financial services company who has significance experience with Hadoop. The first words out of their mouths were, “Hadoop is hard.” 

Later in the conference, after a Western Union representative spoke about their Hadoop deployment, they were mobbed by end user questions and comments. The audience was thrilled to hear about an actual operational deployment: Not just a sandbox deployment, but an actual operational Hadoop deployment from a company that is over 160 years old.

The market is crossing the chasm from early adopters who love to hand code (and the macho culture of proving they can do the hard stuff) to more mainstream companies that want to use technology to solve real problems. These mainstream companies aren’t afraid to admit that it is still hard. For the early adopters, nothing is ever hard. They love hard. But the mainstream market doesn’t view it that way.  They don’t want to mess around in the bowels of enabling technology.  They want to use the technology to solve real problems.  The comment from the financial services company represents the perspective of the vast majority of organizations. It is a sign Hadoop is hitting the mainstream market.

More proof we have moved to a new phase?  Cloudera announced they were going from shipping six versions a year down to just three.  I have been saying for awhile that we will know that Hadoop is real when the distribution vendors stop shipping every 2 months and go to a more typical enterprise software release schedule.  It isn’t that Hadoop engineering efforts have slowed down.  It is still evolving very rapidly.  It is just that real customers are telling the Hadoop suppliers that they won’t upgrade as fast because they have real business projects running and they can’t do it.  So for those of you who are disappointed by the “slow down,” don’t be.  To me, this is news that Hadoop is reaching critical mass.

Technology is closing the gap to allow organizations to use Hadoop as a platform without having to actually have an army of Hadoop experts.  That is what Informatica does for data parsing, data integration,  data quality and data lineage (recent product announcement).  In fact, the number one demo at the Informatica booth at Strata was the demonstration of “end to end” data lineage for data, going from the original source all the way to how it was loaded and then transformed within Hadoop.  This is purely an enterprise-class capability that becomes more interesting and important when you actually go into true production.

Informatica’s goal is to hide the complexity of Hadoop so companies can get on with the work of using the platform with the skills they already have in house.  And from what I saw from all of the start-up companies that were doing similar things for data exploration and analytics and all the talk around the need for governance, we are finally hitting the early majority of the market.  So, for those of you who still drop down to the underlying UNIX OS that powers a Mac, the rest of us will keep using the GUI.   To the extent that there are “fit for purpose” GUIs on top of Hadoop, the technology will get used by a much larger market.

So congratulations Hadoop, you have officially crossed the chasm!

P.S. See me on theCUBE talking about a similar topic at: youtu.be/oC0_5u_0h2Q

Share
Posted in Banking & Capital Markets, Big Data, Hadoop, Informatica Events | Tagged , , , | Leave a comment