Category Archives: Hadoop

Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project being built and used by a global community of contributors and users.

There is Just One V in Big Data

According to Gartner, 64% of organizations surveyed have purchased or were planning to invest in Big Data systems. More and more companies are diving into their data, trying to put it to use to minimize customer churn, analyze financial risk, and improve the customer experience.

Of that 64%, 30% have already invested in Big Data technology, 19% plan to invest within the next year, and another 15% plan to invest within two years. Less than 8% of Gartner’s 720 respondents, however, have actually deployed Big Data technology. This is bad, because most companies simply don’t know what they’re doing when it comes to Big Data.

Over the years, we have heard that Big Data is Volume, Velocity, and Variety. I feel this is one of the reasons why despite the Big Data hype, most companies are still stuck in neutral is because of this limited view.

  1. Volume: Terabytes to Exabytes, petabytes to Zetabytes of lots of data
  1. Velocity: Streaming data, milliseconds to seconds, how fast data is produced, and how fast the data must be processed to meet the need or demand
  1. Variety: Structured, unstructured, text, multimedia, video, audio, sensor data, meter data, html, text, e-mails, etc.

There is just one V in Big DataFor us, the focus is on collection of data. After all, we are prone to be hoarders. Wired by our survival extinct to collect and hoard for the leaner winter months that may come. So while we hoard data, as much as we can, for the illusive “What if?” scenario. “Maybe this will be useful someday.” It’s this stockpiling of Big Data without application that makes it useless.

While Volume, Velocity, and Variety are focused on collection of data, Gartner, in 2014, introduced 3 additional Vs: Veracity, Variability, and Value which focus on usefulness of the data.

  1. Veracity: Uncertainty due to data inconsistency and incompleteness, ambiguities, latency, deception, model approximations, accuracy, quality, truthfulness or trustworthiness
  1. Variability: The differing ways in which the data may be interpreted, different questions require different interpretations
  1. Value: Data for co-creation and deep learning

I believe that perfecting as few as 5% of the relevant variables will get a business 95% of the same benefit. The trick is identifying that viable 5%, and extracting meaningful information from it. In other words, “Value” is the long pole in the tent.

Twitter @bigdatabeat

Share
Posted in Big Data, Business Impact / Benefits, Business/IT Collaboration, CIO, Hadoop | Tagged , | 1 Comment

Top 5 Big Data Mistakes

Top 5 Big Data mistakes

Top 5 Big Data mistakes

I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.

MongoDB as your Big Data platform

Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.

RDBMS schema as files

You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.

Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.

Data Ponds

Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.

Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”

Missing use cases

Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.

It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.

SQL

You like SQL. Query languages and techniques have changed with time. Today, think of Pig as PL/SQL on steroids with maybe a touch of acid.

To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.

Twitter @bigdatabeat

Share
Posted in Architects, Big Data, Business Impact / Benefits, CIO, Hadoop | Tagged , , , , , , , | Leave a comment

Informatica and Hortonworks Talk Analytics in Insurance

analytics

Informatica and Hortonworks Talk Analytics in Insurance

On March 25th, Josh Lee, Global Director for Insurance Marketing at Informatica and Cindy Maike, General Manager, Insurance at Hortonworks, will be joining the Insurance Journal in a webinar on “How to Become an Analytics Ready Insurer”.

Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern

Josh and Cindy exchange perspectives on what “analytics ready” really means for insurers, and today we are sharing some of our views (join the webinar to learn more). Josh and Cindy offer perspectives on the five questions posed here. Please join Insurance Journal, Informatica and Hortonworks on March 25th for more on this exciting topic.

See the Hortonworks site for a second posting of this blog and more details on exciting innovations in Big Data.

  1. What makes a big data environment attractive to an insurer?

CM: Many insurance companies are using new types of data to create innovative products that better meet their customers’ risk needs. For example, we are seeing insurance for “shared vehicles” and new products for prevention services. Much of this innovation is made possible by the rapid growth in sensor and machine data, which the industry incorporates into predictive analytics for risk assessment and claims management.

Customers who buy personal lines of insurance also expect the same type of personalized service and offers they receive from retailers and telecommunication companies. They expect carriers to have a single view of their business that permeates customer experience, claims handling, pricing and product development. Big data in Hadoop makes that single view possible.

JL: Let’s face it, insurance is all about analytics. Better analytics leads to better pricing, reduced risk and better customer service. But here’s the issue. Existing data sources are costly in storing vast amounts of data and inflexible to adapt to changing needs of innovative analytics. Imagine kicking off a simulation or modeling routine one evening only to return in the morning and find it incomplete or lacking data that requires a special request of IT.

This is where big data environments are helping insurers. Larger, more flexible data sets allowing longer series of analytics to be run, generating better results. And imagine doing all that at a fraction of the cost and time of traditional data structures. Oh, and heaven forbid you ask a mainframe to do any of this.

  1. So we hear a lot about Big Data being great for unstructured data.  What about traditional data types that have been used in insurance forever?

CM: Traditional data types are very important to the industry – it drives our regulatory reporting and much of the performance management reporting. This data will continue to play a very important role in the insurance industry and for companies.

However, big data can now enrich that traditional data with new data sources for new insights. In areas such as customer service and product personalization, it can make the difference between cross-selling the right products to meet customer needs and losing the business. For commercial and group carriers, the new data provides the ability to better analyze risk needs, price accordingly and enable superior service in a highly competitive market.

JL: Traditional data will always be around. I doubt that I will outlive a mainframe installation at an insurer; which makes me a little sad. And for many rote tasks like financial reporting, a sales report, or a commission statement, those are sufficient. However, the business of insurance is changing in leaps and bounds. Innovators in data science are interested in correlating those traditional sources to other creative data to find new products, or areas to reduce risk. There is just a lot of data that is either ignored or locked in obscure systems that needs to be brought into the light. This data could be structured or unstructured, it doesn’t matter, and Big Data can assist there.

  1. How does this fit into an overall data management function?

JL: At the end of the day, a Hadoop cluster is another source of data for an insurer. More flexible, more cost effective and higher speed; but yet another data source for an insurer. So that’s one more on top of relational, cubes, content repositories, mainframes and whatever else insurers have latched onto over the years. So if it wasn’t completely obvious before, it should be now. Data needs to be managed. As data moves around the organization for consumption, it is shaped, cleaned, copied and we hope there is governance in place. And the Big Data installation is not exempt from any of these routines. In fact, one could argue that it is more critical to leverage good data management practices with Big Data not only to optimize the environment but also to eventually replace traditional data structures that just aren’t working.

CM: Insurance companies are blending new and old data and looking for the best ways to leverage “all data”. We are witnessing the development of a new generation of advanced analytical applications to take advantage of the volume, velocity, and variety in big data. We can also enhance current predictive models, enriching them with the unstructured information in claim and underwriting notes or diaries along with other external data.

There will be challenges. Insurance companies will still need to make important decisions on how to incorporate the new data into existing data governance and data management processes. The Chief Data or Chief Analytics officer will need to drive this business change in close partnership with IT.

  1. Tell me a little bit about how Informatica and Hortonworks are working together on this?

JL: For years Informatica has been helping our clients to realize the value in their data and analytics. And while enjoying great success in partnership with our clients, unlocking the full value of data requires new structures, new storage and something that doesn’t break the bank for our clients. So Informatica and Hortonworks are on a continuing journey to show that value in analytics comes with strong relationships between the Hadoop distribution and innovative market leading data management technology. As the relationship between Informatica and Hortonworks deepens, expect to see even more vertically relevant solutions and documented ROI for the Informatica/Hortonworks solution stack.

CM: Informatica and Hortonworks optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value. By incorporating data management services into the data lake, companies can store and process massive amounts of data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field.

Matching data from internal sources (e.g. very granular data about customers) with external data (e.g. weather data or driving patterns in specific geographic areas) can unlock new revenue streams.

See this video for a discussion on unlocking those new revenue streams. Sanjay Krishnamurthi, Informatica CTO, and Shaun Connolly, Hortonworks VP of Corporate Strategy, share their perspectives.

  1. Do you have any additional comments on the future of data in this brave new world?

CM: My perspective is that, over time, we will drop the reference to “big” or ”small” data and get back to referring simply to “Data”. The term big data has been useful to describe the growing awareness on how the new data types can help insurance companies grow.

We can no longer use “traditional” methods to gain insights from data. Insurers need a modern data architecture to store, process and analyze data—transforming it into insight.

We will see an increase in new market entrants in the insurance industry, and existing insurance companies will improve their products and services based upon the insights they have gained from their data, regardless of whether that was “big” or “small” data.

JL: I’m sure that even now there is someone locked in their mother’s basement playing video games and trying to come up with the next data storage wave. So we have that to look forward to, and I’m sure it will be cool. But, if we are honest with ourselves, we’ll admit that we really don’t know what to do with half the data that we have. So while data storage structures are critical, the future holds even greater promise for new models, better analytical tools and applications that can make sense of all of this and point insurers in new directions. The trend that won’t change anytime soon is the ongoing need for good quality data, data ready at a moment’s notice, safe and secure and governed in a way that insurers can trust what those cool analytics show them.

Please join us for an interactive discussion on March 25th at 10am Pacific Time/ 1pm Eastern Time.

Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern

Share
Posted in Big Data, Data Quality, Financial Services, Hadoop | Tagged , , , , | Leave a comment

Information = Data + R

Data + R

Information = Data + R

Over and over, when talking with people who are starting to learn Data Science, there’s a frustration that comes up: “I don’t know which programming language to start with.”

Moreover, it’s not just programming languages; it’s also software systems like Tableau, SPSS, etc. There is an ever-widening range of tools and programming languages and it’s difficult to know which one to select.

I get it. When I started focusing heavily on data science a few years ago, I reviewed all of the popular programming languages at the time: Python, R, SAS, D3, not to mention a few that in hindsight, really aren’t that great for analytics like Perl, Bash, and Java. I once read a suggestion to use arcane tools like UNIX’s AWK and SED.

There are so many suggestions, so much material, so many options; it becomes difficult to know what to learn first. There’s a mountain of content, and it’s difficult to know where to find the “gold nuggets”; the things to learn that will bring you the high return on time investment.

That’s the crux of the problem. The fact is – time is limited. Learning a new programming language is a large investment in your time, so you need to be strategic about which one you select. To be clear, some languages will yield a very high return on your investment. Other languages are purely auxiliary tools that you might use only a few times per year.

Let me make this easy for you: learn R first. Here’s why:

R is becoming the “lingua franca” of data science

R is becoming the lingua franca for data science. That’s not to say that it’s the only language, or that it’s the best tool for every job. It is, however, the most widely used and it is rising in popularity.

As I’ve noted before, O’Reilly Media conducted a survey in 2014 to understand the tools that data scientists are currently using. They found that R is the most popular programming language (if you exclude SQL as a “proper” programing language).

Looking more broadly, there are other rankings that look at programming language popularity in general. For example, Redmonk measures programming language popularity by examining discussion (on Stack Overflow) and usage (on GitHub). In their latest rankings, R placed 13th, the highest of any statistical programming language. Redmonk also noted that R has been rising in popularity over time.

A similar ranking by TIOBE, which ranks programming languages by the number of search engine searches, indicates a strong year over year rise for R.

Keep in mind that the Redmonk and TIOBE rankings are for all programming languages. When you look at these, R is now ranking among the most popular and most commonly used over all.

Data wrangling

It’s often said that 80% of the work in data science is data manipulation. More often than not, you’ll need to spend significant amounts of your time “wrangling” your data; putting it into the shape you want. R has some of the best data management tools you’ll find.

The dplyr package in R makes data manipulation easy. It is the tool I wish I had years ago. When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.

Data visualization

ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization.

I’ve said numerous times, that there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.

Moreover, when you combine ggplot2 and dplyr together (using the chaining methodology), finding insight in your data becomes almost effortless.

Machine learning

Finally, there’s machine learning. While I think most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools.

When you’re ready to start using (and learning) machine learning, R has some of the best tools and resources.

One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.

Summary: Learn R, and focus your efforts

Once you start to learn R, don’t get “shiny new object” syndrome.

You’re likely to see demonstrations of new techniques and tools. Just look at some of the dazzling data visualizations that people are creating.

Seeing other people create great work (and finding out that they’re using a different tool) might lead you to try something else. Trust me on this: you need to focus. Don’t get “shiny new object” syndrome. You need to be able to devote a few months (or longer) to really diving into one tool.

And as I noted above, you really want to build up your competence in skills across the data science workflow. You need to have solid skills at least in data visualization and data manipulation. You need to be able to do some serious data exploration in R before you start moving on.

Spending 100 hours on R will yield vastly better returns than spending 10 hours on 10 different tools. In the end, your time ROI will be higher by concentrating your efforts. Don’t get distracted by the “latest, sexy new thing.”

Twitter @bigdatabeat

Share
Posted in Big Data, Data Transformation, General, Hadoop, Professional Services | Tagged , , , , | Leave a comment

Informatica Doubled Big Data Business in 2014 As Hadoop Crossed the Chasm

Big Data

Informatica Doubled Big Data Business in 2014 As Hadoop Crossed the Chasm

2014 was a pivotal turning point for Informatica as our investments in Hadoop and efforts to innovate in big data gathered momentum and became a core part of Informatica’s business. Our Hadoop related big data revenue growth was in the ballpark of leading Hadoop startups – more than doubling over 2013.

In 2014, Informatica reached about 100 enterprise customers of our big data products with an increasing number going into production with Informatica together with Hadoop and other big data technologies.  Informatica’s big data Hadoop customers include companies in financial services, insurance, telcommunications, technology, energy, life sciences, healthcare and business services.  These innovative companies are leveraging Informatica to accelerate their time to production and drive greater value from their big data investments.

These customers are in-production or implementing a wide range of use cases leveraging Informatica’s great data pipeline capabilities to better put the scale, efficiency and flexibility of Hadoop to work.  Many Hadoop customers start by optimizing their data warehouse environments by moving data storage, profiling, integration and cleansing to Hadoop in order to free up capacity in their traditional analytics data warehousing systems. Customers that are further along in their big data journeys have expanded to use Informatica on Hadoop for exploratory analytics of new data types, 360 degree customer analytics, fraud detection, predictive maintenance, and analysis of massive amounts of Internet of Things machine data for optimization of energy exploration, manufacturing processes, network data, security and other large scale systems initiatives.

2014 was not just a year of market momentum for Informatica, but also one of new product development innovations.  We shipped enhanced functionality for entity matching and relationship building at Hadoop scale (a key part of Master Data Management), end-to-end data lineage through Hadoop, as well as high performance real-time streaming of data into Hadoop. We also launched connectors to NoSQL and analytics databases including Datastax Cassandra, MongoDB and Amazon Redshift. Informatica advanced our capabilities to curate great data for self-serve analytics with a connector to output Tableau’s data format and launched our self-service data preparation solution, Informatica Rev.

Customers can now quickly try out Informatica on Hadoop by downloading the free trials for the Big Data Edition and Vibe Data Stream that we launched in 2014.  Now that Informatica supports all five of the leading Hadoop distributions, customers can build their data pipelines on Informatica with confidence that no matter how the underlying Hadoop technologies evolve, their Informatica mappings will run.  Informatica provides highly scalable data processing engines that run natively in Hadoop and leverage the best of open source innovations such as YARN, MapReduce, and more.   Abstracting data pipeline mappings from the underlying Hadoop technologies combined with visual tools enabling team collaboration empowers large organizations to put Hadoop into production with confidence.

As we look ahead into 2015, we have ambitious plans to continue to expand and evolve our product capabilities with enhanced productivity to help customers rapidly get more value from their data in Hadoop. Stay tuned for announcements throughout the year.

Try some of Informatica’s products for Hadoop on the Informatica Marketplace here.

Share
Posted in B2B Data Exchange, Big Data, Data Integration, Data Services, Hadoop | Tagged , , , , , , | Leave a comment

Informatica and Pivotal Delivering Great Data to Customers

Informatica and Pivotal Delivering Great Data to Customers

Delivering Great Data to Customers

As we head into Strata + Hadoop World San Jose, Pivotal has made some interesting announcements that are sure to be the talk of the show. Pivotal’s move to open-source some of their advanced products (and to form a new organization to foster Hadoop community cooperation) are signs of the dynamism and momentum of the Big Data market.

Informatica applauds these initiatives by Pivotal and we hope that they will contribute to the accelerating maturity of Hadoop and its expansion beyond early adopters into mainstream industry adoption. By contributing HAWQ, GemFire and the Greenplum Database to the open source community, Pivotal creates further open options in the evolving Hadoop data infrastructure technology. We expect this to be well received by the open source community.

As Informatica has long served as the industry’s neutral data connector for more than 5,500 customers and have developed a rich set of capabilities for Hadoop, we are also excited to see efforts to try to reduce fragmentation in the Hadoop community.

Even before the new company Pivotal was formed, Informatica had a long history working with the Greenplum team to ensure that joint customers could confidently use Informatica tools to include the Greenplum Database in their enterprise data pipelines. Informatica has mature and high-performance native connectivity to load data in and out of Greenplum reliably using Informatica’s codeless, visual data pipelining tools. In 2014, Informatica expanded out Hadoop support to include Pivotal HD Hadoop and we have joint customers using Informatica to do data profiling, transformation, parsing and cleansing using Informatica Big Data Edition running on Pivotal HD Hadoop.

We expect these innovative developments driven by Pivotal in the Big Data technology landscape to help to move the industry forward and contribute to Pivotal’s market progress. We look forward to continuing to support Pivotal technology and to an ever increasing number of successful joint customers. Please reach out to us if you have any questions about how Informatica and Pivotal can help your organization to put Big Data into production. We want to ensure that we can help you answer the question … Are you Big Data Ready?

Share
Posted in Big Data, Data Governance, Hadoop | Tagged , , , , , | Leave a comment

Data Streams, Data Lakes, Data Reservoirs, and Other Large Data Bodies

data lake

Data Lake is a catchment area for data entering the organization

A Data Lake is a simple concept. They are a catchment area for data entering the organization. In the past, most businesses didn’t need to organize such a data store because almost all data was internal. It traveled via traditional ETL mechanisms from transactional systems to a data warehouse and then was sprayed around the business, as required.

When a good deal of data comes from external sources, or even from internal sources like log files, which never previously made it into the data warehouse, there is a need for an “operational data store.” This has definitely become the premier application for Hadoop and it makes perfect sense to me that such technology be used for a data catchment area. The neat thing about Hadoop for this application is that:

  1. It scales out “as far as the eye can see,” so there’s no likelihood of it being unable to manage the data volumes even when they grow beyond the petabyte level.
  2. It is a key-value store, which means that you don’t need to expend much effort in modeling data when you decide to accommodate a new data source. You just define a key and define the metadata at leisure.
  3. The cost of the software and the storage is very low.

So let’s imagine that we have a need for a data catchment area, because we have decided to collect data from log-files, mobile devices, social networks, from public data sources, or whatever. So let us also imagine that we have implemented Hadoop and some of its useful components and we have begun to collect data.

Is it reasonable to describe this as a data lake?

A Hadoop implementation should not be a set of servers randomly placed at the confluence of various data flows. The placement needs to be carefully considered and if the implementation is to resemble a “data lake” in any way, then it must be a well-engineered man-made lake. Since the data doesn’t just sit there until it evaporates but eventually flows to various applications, we should think of this as a “data reservoir” rather than a “data lake.”

There is no point in arranging all that data neatly along the aisles because when we get it, we may not know what we want to do with it at the time we get it. We should organize the data when we know that.

Another reason we should think of this as more like a reservoir than a lake is that we might like to purify the data a little before sending it down the pipes to applications or users that want to use it.

Twitter @bigdatabeat

Share
Posted in Architects, Big Data, CIO, Cloud Data Integration, Cloud Data Management, DaaS, Hadoop, IaaS | Tagged , , , , , | Leave a comment

Big Data Is Neither-Part II

Big_DataYou Say Big Dayta, I say Big Dahta

Some say Big Data is a great challenge while others say Big Data creates new opportunities. Where do you stand?  For most companies concerned with their Big Data challenges, it shouldn’t be so difficult – at least on paper. Computing costs (both hardware and software) have vastly shrunk. Databases and storage techniques have become more sophisticated and scale massively, and companies such as Informatica have made connecting and integrating all the “big” and disparate data sources much easier and have helped companies achieve a sort of “big data synchronicity”. As it is.

In the process of creating solutions to Big Data problems, humans (and the supra-species known as IT Sapiens) have a tendency to use theories based on linear thinking and the scientific method. There is data as our systems know it and data as our systems don’t. The reality, in my opinion, is that “Really Big Data” problems now and in the future will have complex correlations and unintuitive relationships that need to utilize mathematical disciplines, data models and algorithms that haven’t even been discovered or invented yet and when eventually discovered, will make current database science positively primordial.

At some point in the future, machines will be able to predict, based on big, perhaps unknown data types when someone is having a bad day or a good day, or more importantly whether a person may behave in a good or bad way. Many people do this now when they take a glance at someone across a room and infer how that person is feeling or what they will do next. They see eyes that are shiny or dull, crinkles around eyes or sides of mouths, then hear the “tone” in a voice and then their neurons put it altogether that this is a person that is having a bad day and needs a hug. Quickly. No one knows exactly how the human brain does this, but it does what it does and we go with it and we are usually right.

U.S._Air_Force_Senior_Airman__130429-F-ZX232-013

And some day, Big Data will be able to derive this and it will be an evolution point and it will also be a big business opportunity. Through bigger and better data ingestion and integration techniques and more sophisticated math and data models, a machine will do this fast and relatively speaking, cheaply. The vast majority won’t understand why or how it’s done, but it will work and it will be fairly accurate.

And my question to you all is this.

Do you see any other alternate scenarios regarding the future of big data? Is contextual computing an important evolution and will big data integration be more or less of a problem in the future.

PS. Oh yeah, one last thing to chew on concerning Big Data… If Big Data becomes big enough, does that spell the end of modelling as we know it?

Share
Posted in Big Data, Business Impact / Benefits, Business/IT Collaboration, CMO, Complex Event Processing, Data Integration Platform, Hadoop, Intelligent Data Platform | Tagged , , , | Leave a comment

“It’s not you, it’s me!” – says Data Quality to Big Data

“It’s not you, it’s me!” – says Data Quality to Big Data

“It’s not you, it’s me!” – says Data Quality to Big Data

I couldn’t help myself start this blog with George Costanza’s “You are giving me the – It’s not you, it’s me! – routine? I invented – It’s not you, it’s me …”

The thing that resonates today, in the odd context of big data, is that we may all need to look in the mirror, hold a thumb drive full of information in our hands, and concede once and for all It’s not the data… it’s us.

Many organizations have a hard time making something useful from the ever-expanding universe of big-data, but the problem doesn’t lie with the data: It’s a people problem.

The contention is that big-data is falling short of the hype because people are:

  1. too unwilling to create cultures that value standardized, efficient, and repeatable information, and
  2. too complex to be reduced to “thin data” created from digital traces.

Evan Stubbs describes poor data quality as the data analyst’s single greatest problem.


About the only satisfying thing about having bad data is the schadenfreude that goes along with it. There’s cold solace in knowing that regardless of how poor your data is, everyone else’s is equally as bad. The thing is poor quality data doesn’t just appear from the ether. It’s created. Leave the dirty dishes for long enough and you’ll end up with cockroaches and cholera. Ignore data quality and eventually you’ll have black holes of untrustworthy information. Here’s the hard truth: we’re the reason bad data exists.


I will tell you that most data teams make “large efforts” to scrub their data. Those “infrequent” big cleanups however only treat the symptom, not the cause – and ultimately lead to inefficiency, cost, and even more frustration.

It’s intuitive and natural to think that data quality is a technological problem. It’s not; it’s a cultural problem. The real answer is that you need to create a culture that values standardized, efficient, and repeatable information.

If you do that, then you’ll be able to create data that is re-usable, efficient, and high quality. Rather than trying to manage a shanty of half-baked source tables, effective teams put the effort into designing, maintaining, and documenting their data. Instead of being a one-off activity, it becomes part of business as usual, something that’s simply part of daily life.

However, even if that data is the best it can possibly be, is it even capable of delivering on the big-data promise of greater insights about things like the habits, needs, and desires of customers?

Despite the enormous growth of data and the success of a few companies like Amazon and Netflix, “the reality is that deeper insights for most organizations remain elusive,” write Mikkel Rasmussen and Christian Madsbjerg in a Bloomberg Businessweek blog post that argues “big-data gets people wrong.”


Big-data delivers thin data. In the social sciences, we distinguish between two types of human behavior data. The first – thin data – is from digital traces: He wears a size 8, has blue eyes, and drinks pinot noir. The second – rich data – delivers an understanding of how people actually experience the world: He could smell the grass after the rain, he looked at her in that special way, and the new running shoes made him look faster. Big-data focuses solely on correlation, paying no attention to causality. What good is thin “information” when there is no insight into what your consumers actually think and feel?


Accenture reported only 20 percent of the companies it profiled had found a proven causal link between “what they measure and the outcomes they are intending to drive.”

Now, I can contend they keys to transforming big-data to strategic value are critical thinking skills.

Where do we get such skills? People, it seems, are both the problem and the solution. Are we failing on two fronts: failing to create the right data-driven cultures, and failing to interpret the data we collect?

Twitter @bigdatabeat

Share
Posted in Architects, Big Data, Business Impact / Benefits, CIO, Data Governance, Data Quality, Data Transformation, Hadoop | Tagged , , , | Leave a comment

There are Three Kinds of Lies: Lies, Damned lies, and Data

Lies, Damned lies, and Data

Lies, Damned lies, and Data

The phrase Benjamin Disraeli used in the 19th century was: There are three kinds of lies: lies, damned lies, and statistics.

Not so long ago, Google created a Web site to figure out just how many people had influenza. How they did this was by tracking “flu-related search queries”, “location of the query,” and applied it to an estimation algorithm. According to the website, at the flu season’s peak in January, nearly 11 percent of the United States population may have influenza. This means that nearly 44 million of us will have had the flu or flu-like symptoms. In its weekly report the Centers for Disease Control and Prevention put this at 5.6%, which means that less than 23 million of us actually went to the doctor’s office to be tested for flu or to get a flu-shot.

Now, imagine if I were a drug manufacturer. There is a theory about what went wrong. The problems may be due to widespread media coverage of this year’s flu season. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, the algorithm is looking only at the numbers, not at the context of the search results.

In today’s digitally connected world, data is everywhere: in our phones, search queries, friendships, dating profiles, cars, food, and reading habits. Almost everything we touch is part of a larger data set. The people and companies that interpret the data may fail to apply background and outside conditions to the numbers they capture.

Now, while we build our big data repositories, we have to spend some time to explain how we collected the data and under what context.

Twitter @bigdatabeat

Share
Posted in Big Data, Cloud Data Management, Data Governance, Data Transformation, Data Warehousing, Hadoop | Tagged , , , , | Leave a comment