Tag Archives: Big Data

Remembering Big Data Gravity – PART 2

I ended my previous blog wondering if awareness of Data Gravity should change our behavior. While Data Gravity adds Value to Big Data, I find that the application of the Value is under explained.

Exponential growth of data has naturally led us to want to categorize it into facts, relationships, entities, etc. This sounds very elementary. While this happens so quickly in our subconscious minds as humans, it takes significant effort to teach this to a machine.

A friend tweeted this to me last week: I paddled out today, now I look like a lobster. Since this tweet, Twitter has inundated my friend and me with promotions from Red Lobster. It is because the machine deconstructed the tweet: paddled <PROPULSION>, today <TIME>, like <PREFERENCE> and lobster <CRUSTACEANS>. While putting these together, the machine decided that the keyword was lobster. You and I both know that my friend was not talking about lobsters.

You may think that this maybe just a funny edge case. You can confuse any computer system if you try hard enough, right? Unfortunately, this isn’t an edge case. 140 characters has not just changed people’s tweets, it has changed how people talk on the web. More and more information is communicated in smaller and smaller amounts of language, and this trend is only going to continue.

When will the machine understand that “I look like a lobster” means I am sunburned?

I believe the reason that there are not hundreds of companies exploiting machine-learning techniques to generate a truly semantic web, is the lack of weighted edges in publicly available ontologies. Keep reading, it will all make sense in about 5 sentences. Lobster and Sunscreen are 7 hops away from each other in dbPedia – way too many to draw any correlation between the two. For that matter, any article in Wikipedia is connected to any other article within about 14 hops, and that’s the extreme. Completed unrelated concepts are often just a few hops from each other.

But by analyzing massive amounts of both written and spoken English text from articles, books, social media, and television, it is possible for a machine to automatically draw a correlation and create a weighted edge between the Lobsters and Sunscreen nodes that effectively short circuits the 7 hops necessary. Many organizations are dumping massive amounts of facts without weights into our repositories of total human knowledge because they are naïvely attempting to categorize everything without realizing that the repositories of human knowledge need to mimic how humans use knowledge.

For example – if you hear the name Babe Ruth, what is the first thing that pops to mind? Roman Catholics from Maryland born in the 1800s or Famous Baseball Player?

data gravityIf you look in Wikipedia today, he is categorized under 28 categories in Wikipedia, each of them with the same level of attachment. 1895 births | 1948 deaths | American League All-Stars | American League batting champions | American League ERA champions | American League home run champions | American League RBI champions | American people of German descent | American Roman Catholics | Babe Ruth | Baltimore Orioles (IL) players | Baseball players from Maryland | Boston Braves players | Boston Red Sox players | Brooklyn Dodgers coaches | Burials at Gate of Heaven Cemetery | Cancer deaths in New York | Deaths from esophageal cancer | Major League Baseball first base coaches | Major League Baseball left fielders | Major League Baseball pitchers | Major League Baseball players with retired numbers | Major League Baseball right fielders | National Baseball Hall of Fame inductees | New York Yankees players | Providence Grays (minor league) players | Sportspeople from Baltimore | Maryland | Vaudeville performers.

Now imagine how confused a machine would get when the distance of unweighted edges between nodes is used as a scoring mechanism for relevancy.

If I were to design an algorithm that uses weighted edges (on a scale of 1-5, with 5 being the highest), the same search would yield a much more obvious result.

data gravity1895 births [2]| 1948 deaths [2]| American League All-Stars [4]| American League batting champions [4]| American League ERA champions [4]| American League home run champions [4]| American League RBI champions [4]| American people of German descent [2]| American Roman Catholics [2]| Babe Ruth [5]| Baltimore Orioles (IL) players [4]| Baseball players from Maryland [3]| Boston Braves players [4]| Boston Red Sox players [5]| Brooklyn Dodgers coaches [4]| Burials at Gate of Heaven Cemetery [2]| Cancer deaths in New York [2]| Deaths from esophageal cancer [1]| Major League Baseball first base coaches [4]| Major League Baseball left fielders [3]| Major League Baseball pitchers [5]| Major League Baseball players with retired numbers [4]| Major League Baseball right fielders [3]| National Baseball Hall of Fame inductees [5]| New York Yankees players [5]| Providence Grays (minor league) players [3]| Sportspeople from Baltimore [1]| Maryland [1]| Vaudeville performers [1].

Now the machine starts to think more like a human. The above example forces us to ask ourselves the relevancy a.k.a. Value of the response. This is where I think Data Gravity’s becomes relevant.

You can contact me on twitter @bigdatabeat with your comments.

FacebookTwitterLinkedInEmailPrintShare
Posted in Architects, Big Data, Cloud, Cloud Data Management, Data Aggregation, Data Archiving, Data Governance, General, Hadoop | Tagged , , , , , , | Leave a comment

Can Big Data Live Up To Its Promise For Retailers?

Can Big Data Live Up To Its Promise For Retailers?

Can Big Data Live Up To Its Promise For Retailers?

Can Big Data Live Up To Its Promise For Retailers? [/caption]Was Sam Walton, founder of Walmart talking about Big Data and Analytic in the 90’s when he said, “People think we got big by putting big stores in small towns. Really, we got big by replacing inventory with information.”  Walmart clearly understood the value of the large volumes of data they had access to and turned it into competitive advantage.

As retailers move from looking in the rear view mirror (what happened) to the road ahead (what will happen) they have turned to Big Data and Analytics for answers. While, Big Data holds great promise for retailers, many are skeptical. Retailers are already drinking from the data fire hose, whether its transaction data, recording every product sold to every customer across all channels or research data, covering detailed consumer profiles or web log and social data. The questions retailers are asking; will the investment drive more revenues, increase customer loyalty and create a more rewarding customer experience? Will I gain a deeper insight into customer transactions and interactions across the organization? Can we use existing resources and infrastructure?

The answer is Yes, Big Data presents the opportunity to better analyse everything from customer shopping behaviors at each stage of purchase journey, to inventory planning to delivering relevant and personalized offers. By analyzing how shoppers found your products, how long they spend browsing product pages and which products they added to their basket provides greater insight into what decision process they went through before purchase and helps retailers quickly identify cross sell and up-sell opportunities in real-time. In addition, combining transaction data and what your customers are saying on social channels (ratings, likes, dislikes, what’s trending etc.) can feed into the decisions you make on placing the right product, in the right store at the right price and ultimately deliver very personalize and contextual offers to the customers.

Data Driven Decisions Getting value from Big Data

Turning Big Data into actionable insight is not just about dumping data in to a “Data Lake” and pointing an analytics tool at it and saying job done!  Retailers need to take a number of steps to profit from Big Data and Analytics.

  • Firstly, you need to gather data from all available sources in batch or real-time, from internal and external, and from an ever increasing number of devices (beacons, mobile devices). Once you have gathered the data, it needs to be connected, validated, cleansed and a governance process put in place before integrating with analytic tools and systems.
  • Secondly, put clean and trusted data in the hands of data scientists who can distill the relevant from irrelevant and formulate commercial insights that the business can action and profit from it.
  • Lastly, plan and organize for success. IT and business need to align behind the same agenda, regularly reviewing business priorities and adjusting as needed. Maximize existing scare IT resources by leveraging existing technologies, Cloud platforms and forming alliances with 3rd party vendors to fill skills gap. Secure quick wins for your Big Data initiatives; maybe start with integrating historical transaction data with real-time purchase data to make personalized offers at point of sale. Look outside your organization and to other industries like retail banking or telecommunications and learn from their successes and failures.

With the right approach, Big Data will deliver the return on investment for retailers.

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Retail | Tagged | Leave a comment

Remembering Big Data Gravity – Part 1

If you’ve wondered why so many companies are eager to control data storage, the answer can be summed up in a simple term: data gravity. Ultimately, where data is determines where the money is. Services and applications are nothing without it.

big dataDave McCrory introduced his idea of Data Gravity with a blog post back in 2010. The core idea was – and is – Interesting. More recently, Data Gravity featured in this year’s EMC World keynote. But, beyond the observation that large or valuable agglomerations of data exert a pull that tends to see them grow in size or value, what is a recognition of Data Gravity actually good for?

As a concept, Data Gravity seems closely associated with current enthusiasm for Big Data. In addition, like Big Data, the term’s real-world connotations can be unhelpful almost as often as they are helpful. Big Data exhibits at least three characteristics, which are Volume, Velocity, and Variety. Various other V’s, including Value, is mentioned from time to time, but with less consistency. Yet, Big Data’s name says it’s all about size. The speed with which data must be ingested, processed, or excreted is less important. The complexity and diversity of the data doesn’t matter either.

On its own, the size of a data set is unimportant. Coping with lots of data certainly raises some not-insignificant technical challenges, but the community is actually doing a good job of coming up with technically impressive solutions. The interesting aspect of a huge data set isn’t its size, but the very different modes of working that become possible when you begin to unpick the complex interrelationships between data elements.

Sometimes, Big Data is the vehicle by which enough data is gathered about enough aspects of enough things from enough places for those interrelationships to become observable against the background noise. Other times, Big Data is the background noise, and any hope of insight is drowned beneath the unending stream of petabytes.

To a degree, Data Gravity falls into the same trap. More gravity must be good, right? And more mass leads to more gravity. Mass must be connected to volume, in some vague way that was explained when I was 11, and which involves STP. Therefore, bigger data sets have more gravity. This means that bigger data sets are better data sets. That assertion is clearly nonsense, but luckily, it’s not actually what McCrory is suggesting. His arguments are more nuanced than that, and potentially far more useful.

Instinctively, I like that the equation attempts to move attention away from ‘the application’ toward the pools of data that support many, many applications at once. The data is where the potential lies. Applications are merely the means to unlock that potential in various ways. So maybe notions of Potential Energy from elsewhere in Physics need to figure here.

But I’m wary of the emphasis given to real numbers that are simply the underlying technology’s vital statistics; network latency, bandwidth, request sizes, numbers of requests, and the rest. I realize that these are the measurable things that we have, but feel that more abstract notions of value need to figure just as prominently.

So I’m left reaffirming my original impression that Data Gravity is “interesting”. It’s also intriguing, and I keep feeling that it should be insightful. I’m just not — yet — sure exactly how. Is a resource with a Data Gravity of 6 twice as good as a resource with a Data Gravity of 3? Does a data set with a Data Gravity of 15 require three times as much investment/infrastructure/love as a data set scoring a humble 5? It’s unlikely to be that simple, but I do look forward to seeing what happens as McCrory begins to work with the parts of our industry that can lend empirical credibility to his initial dabbling in mathematics.

If real numbers show the equations to stand up, all we then need to do is work out what the numbers mean. Should an awareness of Data Gravity change our behavior, should it validate what gut feel led us to do already, or is it just another ‘interesting’ and ultimately self-evident number that doesn’t take us anywhere?

I don’t know, but I will continue to explore. You can contact me on twitter @bigdatabeat

FacebookTwitterLinkedInEmailPrintShare
Posted in General, Hadoop | Tagged , , | Leave a comment

Amazon Web Services and Informatica Deliver Data-Ready Cloud Computing Infrastructure for Every Business

At re:Invent 2014 in Las Vegas,  Informatica and AWS announced a broad strategic partnership to deliver data-ready cloud computing infrastructure to any type or size of business.

Informatica’s comprehensive portfolio across Informatica Cloud and PowerCenter solutions connect to multiple AWS Data Services including Amazon Redshift, RDS, DynamoDB, S3, EMR and Kinesis – the broadest pre-built connectivity available to AWS Data Services. Informatica and AWS offerings are pre-integrated, enabling customers to rapidly and cost-effectively implement data warehousing, large scale analytics, lift and shift, and other key use cases in cloud-first and hybrid IT environments. Now, any company can use Informatica’s portfolio to get a plug-and-play on-ramp to the cloud with AWS.

Economical and Flexible Path to the Cloud

As business information needs intensify and data environments become more complex, the combination of AWS and Informatica enables organizations to increase the flexibility and reduce the costs of their information infrastructures through:

  • More cost-effective data warehousing and analytics – Customers benefit from lower costs and increased agility when unlocking the value of their data with no on-premise data warehousing/analytics environment to design, deploy and manage.
  • Broad, easy connectivity to AWS – Customers gain full flexibility in integrating data from any Informatica-supported data source (the broadest set of sources supported by any integration vendor) through the use of pre-built connectors for AWS.
  • Seamless hybrid integration – Hybrid integration scenarios across Informatica PowerCenter and Informatica Cloud data integration deployments are able to connect seamlessly to AWS services.
  • Comprehensive use case coverage – Informatica solutions for data integration and warehousing, data archiving, data streaming and big data across cloud and on-premise applications mesh with AWS solutions such as RDS, Redshift, Kinesis, S3, DynamoDB, EMR and other AWS ecosystem services to drive new and rapid value for customers.

New Support for AWS Services

Informatica introduced a number of new Informatica Cloud integrations with AWS services, including connectors for Amazon DynamoDB, Amazon Elastic MapReduce (Amazon EMR) and Amazon Simple Storage Service (Amazon S3), to complement the existing connectors for Amazon Redshift and Amazon Relational Database Service (Amazon RDS).

Additionally, the latest Informatica PowerCenter release for Amazon Elastic Compute Cloud (Amazon EC2) includes support for:

  • PowerCenter Standard Edition and Data Quality Standard Edition
  • Scaling options – Grid, high availability, pushdown optimization, partitioning
  • Connectivity to Amazon RDS and Amazon Redshift
  • Domain and repository DB in Amazon RDS for current database PAM (policies and measures)

To learn more, try our 60-day free Informatica Cloud trial for Amazon Redshift.

If you’re in Vegas, please come by our booth at re:Invent, Nov. 11-14, in Booth #1031, Venetian / Sands, Hall.

FacebookTwitterLinkedInEmailPrintShare
Posted in Business Impact / Benefits, Cloud, Cloud Application Integration, Cloud Computing | Tagged , , , | Leave a comment

Big Data Driving Data Integration at the NIH

Big Data Driving Data Integration at the NIH

Big Data Driving Data Integration at the NIH

The National Institutes of Health announced new grants to develop big data technologies and strategies.

“The NIH multi-institute awards constitute an initial investment of nearly $32 million in fiscal year 2014 by NIH’s Big Data to Knowledge (BD2K) initiative and will support development of new software, tools and training to improve access to these data and the ability to make new discoveries using them, NIH said in its announcement of the funding.”

The grants will address issues around Big Data adoption, including:

  • Locating data and the appropriate software tools to access and analyze the information.
  • Lack of data standards, or low adoption of standards across the research community.
  • Insufficient polices to facilitate data sharing while protecting privacy.
  • Unwillingness to collaborate that limits the data’s usefulness in the research community.

Among the tasks funded is the creation of a “Perturbation Data Coordination and Integration Center.”  The center will provide support for data science research that focuses on interpreting and integrating data from different data types and databases.  In other words, it will make sure the data moves to where it should move, in order to provide access to information that’s needed by the research scientist.  Fundamentally, it’s data integration practices and technologies.

This is very interesting from the standpoint that the movement into big data systems often drives the reevaluation, or even new interest in data integration.  As the data becomes strategically important, the need to provide core integration services becomes even more important.

The project at the NIH will be interesting to watch, as it progresses.  These are the guys who come up with the new paths to us being healthier and living longer.  The use of Big Data provides the researchers with the advantage of having a better understanding of patterns of data, including:

  • Patterns of symptoms that lead to the diagnosis of specific diseases and ailments.  Doctors may get these data points one at a time.  When unstructured or structured data exists, researchers can find correlations, and thus provide better guidelines to physicians who see the patients.
  • Patterns of cures that are emerging around specific treatments.  The ability to determine what treatments are most effective, by looking at the data holistically.
  • Patterns of failure.  When the outcomes are less than desirable, what seems to be a common issue that can be identified and resolved?

Of course, the uses of big data technology are limitless, when considering the value of knowledge that can be derived from petabytes of data.  However, it’s one thing to have the data, and another to have access to it.

Data integration should always be systemic to all big data strategies, and the NIH clearly understands this to be the case.  Thus, they have funded data integration along with the expansion of their big data usage.

Most enterprises will follow much the same path in the next 2 to 5 years.  Information provides a strategic advantage to businesses.  In the case of the NIH, it’s information that can save lives.  Can’t get much more important than that.

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Cloud, Cloud Data Integration, Data Integration | Tagged , , , | Leave a comment

Not Just For Play, Western Union Puts Hadoop to Work

Not Just For Play, Western Union Puts Hadoop to Work

Put Hadoop to Work

Everyone’s talking about Hadoop for empowering analysts to quickly experiment, discover, and predict new insights.  But Hadoop isn’t just for play.  Leading enterprises like Western Union are putting Hadoop to work on their most mission-critical data pipelines.  Last week at Strata + Hadoop World, we had a chance to hear how Western Union uses Cloudera’s Hadoop-based enterprise data hub and Informatica to deliver faster, simpler, and cleaner data pipelines.

Western Union, a multi-billion dollar global financial services and communications company, data is recognized as their core asset.  Like many other financial services firms, Western Union thrives on data for both harvesting new business opportunities and managing its internal operations.  And like many other enterprises, Western Union isn’t just ingesting data from relational data sources.  They are mining a number of new information-rich sources like clickstream data and log data.  With Western Union’s scale and speed demands, the data pipeline just has to work so they can optimize customer experience across multiple channels (e.g. retail, online, mobile, etc.) to grow the business.

Let’s level set on how important scale and speed is to Western Union.  Western Union processes more than 29 financial transactions every second.  Analytical performance simply can’t be the bottleneck for extracting insights from this blazing velocity of data.  So to maximize the performance of their data warehouse appliance, Western Union offloaded data quality and data integration workloads onto a Cloudera Hadoop cluster.  Using the Informatica Big Data Edition, Western Union capitalized on the performance and scalability of Hadoop while unleashing the productivity of their Informatica developers.

Informatica Big Data Edition enables data driven organizations to profile, parse, transform, and cleanse data on Hadoop with a simple visual development environment, prebuilt transformations, and reusable business rules.  So instead of hand coding one-off scripts, developers can easily create mappings without worrying about the underlying execution platform.  Raw data can be easily loaded into Hadoop using Informatica Data Replication and Informatica’s suite of PowerExchange connectors.  After the data is prepared, it can be loaded into a data warehouse appliance for supporting high performance analysis.  It’s a win-win solution for both data managers and data consumers.  Using Hadoop and Informatica, the right workloads are processed by the right platforms so that the right people get the right data at the right time.

Using Informatica’s Big Data solutions, Western Union is transforming the economics of data delivery, enabling data consumers to create safer and more personalized experiences for Western Union’s customers.  Learn how the Informatica Big Data Edition can help put Hadoop to work for you.  And download a free trial to get started today!

FacebookTwitterLinkedInEmailPrintShare
Posted in Data Migration, Hadoop | Tagged , , | Leave a comment

Has Hadoop Crossed The Chasm? Thoughts About Strata 2014

Well, it’s been a little over a week since the Strata conference so I thought I should give some perspective on what I learned.  I think it was summed up at my first meeting, on the first morning of the conference. The meeting was with a financial services company who has significance experience with Hadoop. The first words out of their mouths were, “Hadoop is hard.” 

Later in the conference, after a Western Union representative spoke about their Hadoop deployment, they were mobbed by end user questions and comments. The audience was thrilled to hear about an actual operational deployment: Not just a sandbox deployment, but an actual operational Hadoop deployment from a company that is over 160 years old.

The market is crossing the chasm from early adopters who love to hand code (and the macho culture of proving they can do the hard stuff) to more mainstream companies that want to use technology to solve real problems. These mainstream companies aren’t afraid to admit that it is still hard. For the early adopters, nothing is ever hard. They love hard. But the mainstream market doesn’t view it that way.  They don’t want to mess around in the bowels of enabling technology.  They want to use the technology to solve real problems.  The comment from the financial services company represents the perspective of the vast majority of organizations. It is a sign Hadoop is hitting the mainstream market.

More proof we have moved to a new phase?  Cloudera announced they were going from shipping six versions a year down to just three.  I have been saying for awhile that we will know that Hadoop is real when the distribution vendors stop shipping every 2 months and go to a more typical enterprise software release schedule.  It isn’t that Hadoop engineering efforts have slowed down.  It is still evolving very rapidly.  It is just that real customers are telling the Hadoop suppliers that they won’t upgrade as fast because they have real business projects running and they can’t do it.  So for those of you who are disappointed by the “slow down,” don’t be.  To me, this is news that Hadoop is reaching critical mass.

Technology is closing the gap to allow organizations to use Hadoop as a platform without having to actually have an army of Hadoop experts.  That is what Informatica does for data parsing, data integration,  data quality and data lineage (recent product announcement).  In fact, the number one demo at the Informatica booth at Strata was the demonstration of “end to end” data lineage for data, going from the original source all the way to how it was loaded and then transformed within Hadoop.  This is purely an enterprise-class capability that becomes more interesting and important when you actually go into true production.

Informatica’s goal is to hide the complexity of Hadoop so companies can get on with the work of using the platform with the skills they already have in house.  And from what I saw from all of the start-up companies that were doing similar things for data exploration and analytics and all the talk around the need for governance, we are finally hitting the early majority of the market.  So, for those of you who still drop down to the underlying UNIX OS that powers a Mac, the rest of us will keep using the GUI.   To the extent that there are “fit for purpose” GUIs on top of Hadoop, the technology will get used by a much larger market.

So congratulations Hadoop, you have officially crossed the chasm!

P.S. See me on theCUBE talking about a similar topic at: youtu.be/oC0_5u_0h2Q

FacebookTwitterLinkedInEmailPrintShare
Posted in Banking & Capital Markets, Big Data, Hadoop, Informatica Events | Tagged , , , | Leave a comment

Fast and Fasterer: Screaming Streaming Data on Hadoop

Hadoop

Guest Post by Dale Kim

This is a guest blog post, written by Dale Kim, Director of Product Marketing at MapR Technologies.

Recent published research shows that “faster” is better than “slower.” The point, ladies and gentlemen, is that speed, for lack of a better word, is good. But granted, you won’t always have the need for speed. My Lamborghini is handy when I need to elude the Bakersfield fuzz on I-5, but it does nothing for my Costco trips. There, I go with capacity and haul home my 30-gallon tubs of ketchup with my Ford F150. (Note: this is a fictitious example, I don’t actually own an F150.)

But if speed is critical, like in your data streaming application, then Informatica Vibe Data Stream and the MapR Distribution including Apache™ Hadoop® are the technologies to use together. But since Vibe Data Stream works with any Hadoop distribution, my discussion here is more broadly applicable. I first discussed this topic earlier this year during my presentation at Informatica World 2014. In that talk, I also briefly described architectures that include streaming components, like the Lambda Architecture and enterprise data hubs. I recommend that any enterprise architect should become familiar with these high-level architectures.

Data streaming deals with a continuous flow of data, often at a fast rate. As you might’ve suspected by now, Vibe Data Stream, based on the Informatica Ultra Messaging technology, is great for that. With its roots in high speed trading in capital markets, Ultra Messaging quickly and reliably gets high value data from point A to point B. Vibe Data Stream adds management features to make it consumable by the rest of us, beyond stock trading. Not surprisingly, Vibe Data Stream can be used anywhere you need to quickly and reliably deliver data (just don’t use it for sharing your cat photos, please), and that’s what I discussed at Informatica World. Let me discuss two examples I gave.

Large Query Support. Let’s first look at “large queries.” I don’t mean the stuff you type on search engines, which are typically no more than 20 characters. I’m referring to an environment where the query is a huge block of data. For example, what if I have an image of an unidentified face, and I want to send it to a remote facial recognition service and immediately get the identity? The image would be the query, the facial recognition system could be run on Hadoop for fast divide-and-conquer processing, and the result would be the person’s name. There are many similar use cases that could leverage a high speed, reliable data delivery system along with a fast processing platform, to get immediate answers to a data-heavy question.

Data Warehouse Onload. For another example, we turn to our old friend the data warehouse. If you’ve been following all the industry talk about data warehouse optimization, you know pumping high speed data directly into your data warehouse is not an efficient use of your high value system. So instead, pipe your fast data streams into Hadoop, run some complex aggregations, then load that processed data into your warehouse. And you might consider freeing up large processing jobs from your data warehouse onto Hadoop. As you process and aggregate that data, you create a data flow cycle where you return enriched data back to the warehouse. This gives your end users efficient analysis on comprehensive data sets.

Hopefully this stirs up ideas on how you might deploy high speed streaming in your enterprise architecture. Expect to see many new stories of interesting streaming applications in the coming months and years, especially with the anticipated proliferation of internet-of-things and sensor data.

To learn more about Vibe Data Stream you can find it on the Informatica Marketplace .


 

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Business Impact / Benefits, Data Services, Hadoop | Tagged , , , , | Leave a comment

Ebola: Why Big Data Matters

Ebola: Why Big Data Matters

Ebola: Why Big Data Matters

The Ebola virus outbreak in West Africa has now claimed more than 4,000 lives and has entered the borders of the United States. While emergency response teams, hospitals, charities, and non-governmental organizations struggle to contain the virus, could big data analytics help?

A growing number of Data Scientists believe so.

If you recall the Cholera outbreak of Haiti in 2010 after the tragic earthquake, a joint research team from Karolinska Institute in Sweden and Columbia University in the US analyzed calling data from two million mobile phones on the Digicel Haiti network. This enabled the United Nations and other humanitarian agencies to understand population movements during the relief operations and during the subsequent cholera outbreak. They could allocate resources more efficiently and identify areas at increased risk of new cholera outbreaks.

Mobile phones, widely owned even in the poorest countries in Africa. Cell phones are also a rich source of data irrespective of which region where other reliable sources are sorely lacking. Senegal’s Orange Telecom provided Flowminder, a Swedish non-profit organization, with anonymized voice and text data from 150,000 mobile phones. Using this data, Flowminder drew up detailed maps of typical population movements in the region.

Today, authorities use this information to evaluate the best places to set up treatment centers, check-posts, and issue travel advisories in an attempt to contain the spread of the disease.

The first drawback is that this data is historic. Authorities really need to be able to map movements in real time especially since people’s movements tend to change during an epidemic.

The second drawback is, the scope of data provided by Orange Telecom is limited to a small region of West Africa.

Here is my recommendation to the Centers for Disease Control and Prevention (CDC):

  1. Increase the area for data collection to the entire region of Western Africa which covers over 2.1 million cell-phone subscribers.
  2. Collect mobile phone mast activity data to pinpoint where calls to helplines are mostly coming from, draw population heat maps, and population movement. A sharp increase in calls to a helpline is usually an early indicator of an outbreak.
  3. Overlay this data over censuses data to build up a richer picture.

The most positive impact we can have is to help emergency relief organizations and governments anticipate how a disease is likely to spread. Until now, they had to rely on anecdotal information, on-the-ground surveys, police, and hospital reports.

FacebookTwitterLinkedInEmailPrintShare
Posted in B2B Data Exchange, Big Data, Business Impact / Benefits, Business/IT Collaboration, Data Governance, Data Integration | Tagged , , , , , , , | Leave a comment

Informatica’s Hadoop Connectivity Reaches for the Clouds

The Informatica Cloud team has been busy updating connectivity to Hadoop using the Cloud Connector SDK.  Updated connectors are available now for Cloudera and Hortonworks and new connectivity has been added for MapR, Pivotal HD and Amazon EMR (Elastic Map Reduce).

Informatica Cloud’s Hadoop connectivity brings a new level of ease of use to Hadoop data loading and integration.  Informatica Cloud provides a quick way to load data from popular on premise data sources and apps such as SAP and Oracle E-Business, as well as SaaS apps, such as Salesforce.com, NetSuite, and Workday, into Hadoop clusters for pilots and POCs.  Less technical users are empowered to contribute to enterprise data lakes through the easy-to-use Informatica Cloud web user interface.

Hadoop

Informatica Cloud’s rich connectivity to a multitude of SaaS apps can now be leveraged with Hadoop.  Data from SaaS apps for CRM, ERP and other lines of business are becoming increasingly important to enterprises. Bringing this data into Hadoop for analytics is now easier than ever.

Users of Amazon Web Services (AWS) can leverage Informatica Cloud to load data from SaaS apps and on premise sources into EMR directly.  Combined with connectivity to Amazon Redshift, Informatica Cloud can be used to move data into EMR for processing and then onto Redshift for analytics.

Self service data loading and basic integration can be done by less technical users through Informatica Cloud’s drag and drop web-based user interface.  This enables more of the team to contribute to and collaborate on data lakes without having to learn Hadoop.

Bringing the cloud and Big Data together to put the potential of data to work – that’s the power of Informatica in action.

Free trials of the Informatica Cloud Connector for Hadoop are available here: http://www.informaticacloud.com/connectivity/hadoop-connector.html

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Data Services, Hadoop, SaaS | Tagged , , , | Leave a comment