Tag Archives: Big Data
Recent published research shows that “faster” is better than “slower.” The point, ladies and gentlemen, is that speed, for lack of a better word, is good. But granted, you won’t always have the need for speed. My Lamborghini is handy when I need to elude the Bakersfield fuzz on I-5, but it does nothing for my Costco trips. There, I go with capacity and haul home my 30-gallon tubs of ketchup with my Ford F150. (Note: this is a fictitious example, I don’t actually own an F150.)
But if speed is critical, like in your data streaming application, then Informatica Vibe Data Stream and the MapR Distribution including Apache™ Hadoop® are the technologies to use together. But since Vibe Data Stream works with any Hadoop distribution, my discussion here is more broadly applicable. I first discussed this topic earlier this year during my presentation at Informatica World 2014. In that talk, I also briefly described architectures that include streaming components, like the Lambda Architecture and enterprise data hubs. I recommend that any enterprise architect should become familiar with these high-level architectures.
Data streaming deals with a continuous flow of data, often at a fast rate. As you might’ve suspected by now, Vibe Data Stream, based on the Informatica Ultra Messaging technology, is great for that. With its roots in high speed trading in capital markets, Ultra Messaging quickly and reliably gets high value data from point A to point B. Vibe Data Stream adds management features to make it consumable by the rest of us, beyond stock trading. Not surprisingly, Vibe Data Stream can be used anywhere you need to quickly and reliably deliver data (just don’t use it for sharing your cat photos, please), and that’s what I discussed at Informatica World. Let me discuss two examples I gave.
Large Query Support. Let’s first look at “large queries.” I don’t mean the stuff you type on search engines, which are typically no more than 20 characters. I’m referring to an environment where the query is a huge block of data. For example, what if I have an image of an unidentified face, and I want to send it to a remote facial recognition service and immediately get the identity? The image would be the query, the facial recognition system could be run on Hadoop for fast divide-and-conquer processing, and the result would be the person’s name. There are many similar use cases that could leverage a high speed, reliable data delivery system along with a fast processing platform, to get immediate answers to a data-heavy question.
Data Warehouse Onload. For another example, we turn to our old friend the data warehouse. If you’ve been following all the industry talk about data warehouse optimization, you know pumping high speed data directly into your data warehouse is not an efficient use of your high value system. So instead, pipe your fast data streams into Hadoop, run some complex aggregations, then load that processed data into your warehouse. And you might consider freeing up large processing jobs from your data warehouse onto Hadoop. As you process and aggregate that data, you create a data flow cycle where you return enriched data back to the warehouse. This gives your end users efficient analysis on comprehensive data sets.
Hopefully this stirs up ideas on how you might deploy high speed streaming in your enterprise architecture. Expect to see many new stories of interesting streaming applications in the coming months and years, especially with the anticipated proliferation of internet-of-things and sensor data.
To learn more about Vibe Data Stream you can find it on the Informatica Marketplace .
A growing number of Data Scientists believe so.
If you recall the Cholera outbreak of Haiti in 2010 after the tragic earthquake, a joint research team from Karolinska Institute in Sweden and Columbia University in the US analyzed calling data from two million mobile phones on the Digicel Haiti network. This enabled the United Nations and other humanitarian agencies to understand population movements during the relief operations and during the subsequent cholera outbreak. They could allocate resources more efficiently and identify areas at increased risk of new cholera outbreaks.
Mobile phones, widely owned even in the poorest countries in Africa. Cell phones are also a rich source of data irrespective of which region where other reliable sources are sorely lacking. Senegal’s Orange Telecom provided Flowminder, a Swedish non-profit organization, with anonymized voice and text data from 150,000 mobile phones. Using this data, Flowminder drew up detailed maps of typical population movements in the region.
Today, authorities use this information to evaluate the best places to set up treatment centers, check-posts, and issue travel advisories in an attempt to contain the spread of the disease.
The first drawback is that this data is historic. Authorities really need to be able to map movements in real time especially since people’s movements tend to change during an epidemic.
The second drawback is, the scope of data provided by Orange Telecom is limited to a small region of West Africa.
Here is my recommendation to the Centers for Disease Control and Prevention (CDC):
- Increase the area for data collection to the entire region of Western Africa which covers over 2.1 million cell-phone subscribers.
- Collect mobile phone mast activity data to pinpoint where calls to helplines are mostly coming from, draw population heat maps, and population movement. A sharp increase in calls to a helpline is usually an early indicator of an outbreak.
- Overlay this data over censuses data to build up a richer picture.
The most positive impact we can have is to help emergency relief organizations and governments anticipate how a disease is likely to spread. Until now, they had to rely on anecdotal information, on-the-ground surveys, police, and hospital reports.
The Informatica Cloud team has been busy updating connectivity to Hadoop using the Cloud Connector SDK. Updated connectors are available now for Cloudera and Hortonworks and new connectivity has been added for MapR, Pivotal HD and Amazon EMR (Elastic Map Reduce).
Informatica Cloud’s Hadoop connectivity brings a new level of ease of use to Hadoop data loading and integration. Informatica Cloud provides a quick way to load data from popular on premise data sources and apps such as SAP and Oracle E-Business, as well as SaaS apps, such as Salesforce.com, NetSuite, and Workday, into Hadoop clusters for pilots and POCs. Less technical users are empowered to contribute to enterprise data lakes through the easy-to-use Informatica Cloud web user interface.
Informatica Cloud’s rich connectivity to a multitude of SaaS apps can now be leveraged with Hadoop. Data from SaaS apps for CRM, ERP and other lines of business are becoming increasingly important to enterprises. Bringing this data into Hadoop for analytics is now easier than ever.
Users of Amazon Web Services (AWS) can leverage Informatica Cloud to load data from SaaS apps and on premise sources into EMR directly. Combined with connectivity to Amazon Redshift, Informatica Cloud can be used to move data into EMR for processing and then onto Redshift for analytics.
Self service data loading and basic integration can be done by less technical users through Informatica Cloud’s drag and drop web-based user interface. This enables more of the team to contribute to and collaborate on data lakes without having to learn Hadoop.
Bringing the cloud and Big Data together to put the potential of data to work – that’s the power of Informatica in action.
Free trials of the Informatica Cloud Connector for Hadoop are available here: http://www.informaticacloud.com/connectivity/hadoop-connector.html
Every two years, the typical company doubles the amount of data they store. However, this Data is inherently “dumb.” Acquiring more of it only seems to compound its lack of intellect.
When revitalizing your business, I won’t ask to look at your data – not even a little bit. Instead, we look at the process of how you use the data. What I want to know is this:
How much of your day-to-day operations are driven by your data?
The Case for Smart Data
I recently learned that 7-Eleven Japan has pushed decision-making down to the store level – in fact, to the level of clerks. Store clerks decide what goes on the shelves in their individual 7-Eleven stores. These clerks push incredible inventory turns. Some 70% of the products on the shelves are new to stores each year. As a result, this chain has been the most profitable Japanese retailer for 30 years running.
Instead of just reading the data and making wild guesses on why something works and why something doesn’t, these clerks acquired the skill of looking at the quantitative and the qualitative and connected dots. Data told them what people are talking about, how it’s related to their product and how much weight it carried. You can achieve this as well. To do so, you must introduce a culture that emphasizes discipline around processes. A disciplined process culture uses:
- A template approach to data with common processes, reuse of components, and a single face presented to customers
- Employees who consistently follow standard procedures
If you cannot develop such company-wide consistency, you will not gain benefits of ERP or CRM systems.
Make data available to the masses. Like at 7-Eleven Japan, don’t centralize the data decision-making process. Instead, push it out to the ranks. By putting these cultures and practices into play, businesses can use data to run smarter.
A few months ago, while addressing a room full of IT and business professional at an Information Governance conference, a CFO said – “… if we designed our systems today from scratch, they will look nothing like the environment we own.” He went on to elaborate that they arrived there by layering thousands of good and valid decisions on top of one another.
Similarly, Information Governance has also evolved out of the good work that was done by those who preceded us. These items evolve into something that only a few can envision today. Along the way, technology evolved and changed the way we interact with data to manage our daily tasks. What started as good engineering practices for mainframes gave way to data management.
Then, with technological advances, we encountered new problems, introduced new tasks and disciplines, and created Information Governance in the process. We were standing on the shoulders of data management, armed with new solutions to new problems. Now we face the four Vs of big data and each of those new data system characteristics have introduced a new set of challenges driving the need for Big Data Information Governance as a response to changing velocity, volume, veracity, and variety.
Before I answer this question, I must ask you “How comprehensive is the framework you are using today and how well does it scale to address the new challenges?”
While there are several frameworks out in the marketplace to choose from. In this blog, I will tell you what questions you need to ask yourself before replacing your old framework with a new one:
Q. Is it nimble?
The focus of data governance practices must allow for nimble responses to changes in technology, customer needs, and internal processes. The organization must be able to respond to emergent technology.
Q. Will it enable you to apply policies and regulations to data brought into the organization by a person or process?
- Public company: Meet the obligation to protect the investment of the shareholders and manage risk while creating value.
- Private company: Meet privacy laws even if financial regulations are not applicable.
- Fulfill the obligations of external regulations from international, national, regional, and local governments.
Q. How does it Manage quality?
For big data, the data must be fit for purpose; context might need to be hypothesized for evaluation. Quality does not imply cleansing activities, which might mask the results.
Q. Does it understanding your complete business and information flow?
Attribution and lineage are very important in big data. Knowing what is the source and what is the destination is crucial in validating analytics results as fit for purpose.
Q. How does it understanding the language that you use, and can the framework manage it actively to reduce ambiguity, redundancy, and inconsistency?
Big data might not have a logical data model, so any structured data should be mapped to the enterprise model. Big data still has context and thus modeling becomes increasingly important to creating knowledge and understanding. The definitions evolve over time and the enterprise must plan to manage the shifting meaning.
Q. Does it manage classification?
It is critical for the business/steward to classify the overall source and the contents within as soon as it is brought in by its owner to support of information lifecycle management, access control, and regulatory compliance.
Q. How does it protect data quality and access?
Your information protection must not be compromised for the sake of expediency, convenience, or deadlines. Protect not just what you bring in, but what you join/link it to, and what you derive. Your customers will fault you for failing to protect them from malicious links. The enterprise must formulate the strategy to deal with more data, longer retention periods, more data subject to experimentation, and less process around it, all while trying to derive more value over longer periods.
Q. Does it foster stewardship?
Ensuring the appropriate use and reuse of data requires the action of an employee. E.g., this role cannot be automated, and it requires the active involvement of a member of the business organization to serve as the steward over the data element or source.
Q. Does it manage long-term requirements?
Policies and standards are the mechanism by which management communicates their long-range business requirements. They are essential to an effective governance program.
Q. How does it manage feedback?
As a companion to policies and standards, an escalation and exception process enables communication throughout the organization when policies and standards conflict with new business requirements. It forms the core process to drive improvements to the policy and standard documents.
Q. Does it Foster innovation?
Governance must not squelch innovation. Governance can and should make accommodations for new ideas and growth. This is managed through management of the infrastructure environments as part of the architecture.
Q. How does it control third-party content?
Third-party data plays an expanding role in big data. There are three types and governance controls must be adequate for the circumstances. They must consider applicable regulations for the operating geographic regions; therefore, you must understand and manage those obligations.
Today, 80% of the efforts in Big Data projects are related to extracting, transforming and loading data (ETL). Hortonworks and Informatica have teamed-up to leverage the power of Informatica Big Data Edition to use their existing skills to improve the efficiency of these operations and better leverage their resources in a modern data architecture. (MDA)
Next Generation Data Management
The Hortonworks Data Platform and Informatica BDE enable organizations to optimize their ETL workloads with long-term storage and processing at scale in Apache Hadoop. With Hortonworks and Informatica, you can:
• Leverage all internal and external data to achieve the full predictive power that drives the success of modern data-driven businesses.
• Optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value.
Imagine a world where you would have access to your most strategic data in a timely fashion, no matter how old the data is, where it is stored, or under what format. By leveraging Hadoop’s power of distributed processing, organizations can lower costs of data storage and processing and support large data distribution with high through put and concurrency.
Overall, the alignment between business and IT grows. The Big Data solution based on Informatica and Hortonworks allows for a complete data pipeline to ingest, parse, integrate, cleanse, and prepare data for analysis natively on Hadoop thereby increasing developer productivity by 5x over hand-coding.
Where Do We Go From Here?
At the end of the day, Big Data is not about the technology. It is about the deep business and social transformation every organization will go through. The possibilities to make more informed decisions, identify patterns, proactively address fraud and threats, and predict pretty much anything are endless.
This transformation will happen as the technology is adopted and leveraged by more and more business users. We are already seeing the transition from 20-node clusters to 100-node clusters and from a handful of technology-savvy users relying on Hadoop to hundreds of business users. Informatica and Hortonworks are accelerating the delivery of actionable Big Data insights to business users by automating the entire data pipeline.
Try It For Yourself
On September 10, 2014, Informatica announced the 60-day trial version of the Informatica Big Data Edition into the Hortonworks Sandbox. This free trial enables you to download and test out the Big Data Edition on your notebook or spare computer and experience your own personal Modern Data Architecture (MDA).
If you happen to be at Strata this October 2014, please meet us at our booths: Informatica #352 and Hortonworks #117. Don’t forget to participate in our Passport Program and join our session at 5:45 pm ET on Thursday, October 16, 2014.
“Victory won’t go to those with the most data. It will go to those who make the best use of data.” – Doug Henschen, Information Week, May 2014
But how do you actually make best use of your data and become one of the data success stories? If you are going to differentiate on data, you need to use your data to innovate. Common options include:
- New products & services which leverage a rich data set
- Different ways to sell & market existing products and services based on detailed knowledge
But there is no ‘app for that’. Think about it – if you can buy an application, you are already too late. Somebody else has identified a need and created a product they expect to sell repeatedly. Applications cannot provide you a competitive advantage if everyone has one. Most people agree they will not rise to the top because they have installed ERP, CRM, SRM, etc. So it will become with any applications which claim to win you market share and profits based on data. If you want to differentiate, you need to stay ahead of the application curve, and let your internal innovation drive you forward.
Simplistically this is a 4 step process:
- Assemble a team of innovative employees, match them with skilled data scientists
- Identify data-based differentiation opportunities
- Feed the team high quality data at the rate in which they need it
- Provide them tools for data analysis and integrating data into business processes as required
Leaving aside the simplicity of these steps for a process – there is one key change to a ‘normal’ IT project. Normally data provisioning is an afterthought during IT projects. Now it must take priority. Frequently data integration is poorly executed, and barely documented. Data quality is rarely considered during projects. Poor data provisioning is a direct cause of spaghetti charts which contribute to organisational inflexibility and poor data availability to the business. Does “It will take 6 months to make those changes” sound familiar?
We have been told Big Data will change our world; Data is a raw material; Data is the new oil.
The business world is changing. We are moving into a world where our data is one of our most valuable resources, especially when coupled with our internal innovation. Applications used to differentiate us, now they are becoming commodities to be replaced and upgraded, or new ones acquired as rapidly as our business changes.
I believe that in order to differentiate on data, an organisation needs to treat data as the valuable resource we all say it is. Data Agility, Management and Governance are the true differentiators of our era. This is a frustration for those trying to innovate, but locked in an inflexible data world, built at a time people still expected ERP to be the answer to everything.
To paraphrase a recent complaint I heard: “My applications should be like my phone. I buy a new one, turn it on and it already has all my data”.
This is the exact vision that is driving Informatica’s Intelligent Data Platform.
In the end, differentiating on data comes down to one key necessity: High quality data MUST be available to all who need it, when they need it.
Gartner’s official definition of Information Governance is “…the specification of decision rights and an accountability framework to encourage desirable behavior in the valuation, creation, storage, use, archival and deletion of information. It includes the processes, roles, standards, and metrics that ensure the effective and efficient use of information in enabling a business to achieve its goals.” It therefore looks to address important considerations that key stakeholders within an enterprise face.
A CIO of a large European bank once asked me – “How long do we need to keep information?”
Keeping Information Governance relevant
This bank had to govern, index, search, and provide content to auditors to show it is managing data appropriately to meet Dodd-Frank regulation. In the past, this information was retrieved from a database or email. Now, however, the bank was required to produce voice recordings from phone conversations with customers, show the Reuters feeds coming in that are relevant, and document all appropriate IMs and social media interactions between employees.
All these were systems the business had never considered before. These environments continued to capture and create data and with it complex challenges. These islands of information that seemingly do not have anything to do with each other, yet impact how that bank governs itself and how it saves any of the records associated with trading or financial information.
Coping with the sheer growth is one issue; what to keep and what to delete is another. There is also the issue of what to do with all the data once you have it. The data is potentially a gold mine for the business, but most businesses just store it and forget about it.
Legislation, in tandem, is becoming more rigorous and there are potentially thousands of pieces of regulation relevant to multinational companies. Businesses operating in the EU, in particular, are affected by increasing regulation. There are a number of different regulations, including Solvency II, Dodd-Frank, HIPAA, Gramm-Leach-Bliley Act (GLBA), Basel III and new tax laws. In addition, companies face the expansion of state-regulated privacy initiatives and new rules relating to disaster recovery, transportation security, value chain transparency, consumer privacy, money laundering, and information security.
Regardless, an enterprise should consider the following 3 core elements before developing and implementing a policy framework.
Whatever your size or type of business, there are several key processes you must undertake in order to create an effective information governance program. As a Business Transformation Architect, I can see 3 foundation stones of an effective Information Governance Program:
Assess Your Business Maturity
Understand the full scope of requirements on your business is a heavy task. Assess whether your business is mature enough to embrace information governance. Many businesses in EMEA do not have an information governance team already in place, but instead have key stakeholders with responsibility for information assets spread across their legal, security, and IT teams.
Undertake a Regulatory Compliance Review
Understand the legal obligations to your business are critical in shaping an information governance program. Every business is subject to numerous compliance regimes managed by multiple regulatory agencies, which can differ across markets. Many compliance requirements are dependent upon the numbers of employees and/or turnover reaching certain limits. For example, certain records may need to be stored for 6 years in Poland, yet the same records may need to be stored for 3 years in France.
Establish an Information Governance Team
It is important that a core team be assigned responsibility for the implementation and success of the information governance program. This steering group and a nominated information governance lead can then drive forward operational and practical issues, including; Agreeing and developing a work program, Developing policy and strategy, and Communication and awareness planning.