Category Archives: Big Data
A few months ago, while addressing a room full of IT and business professional at an Information Governance conference, a CFO said – “… if we designed our systems today from scratch, they will look nothing like the environment we own.” He went on to elaborate that they arrived there by layering thousands of good and valid decisions on top of one another.
Similarly, Information Governance has also evolved out of the good work that was done by those who preceded us. These items evolve into something that only a few can envision today. Along the way, technology evolved and changed the way we interact with data to manage our daily tasks. What started as good engineering practices for mainframes gave way to data management.
Then, with technological advances, we encountered new problems, introduced new tasks and disciplines, and created Information Governance in the process. We were standing on the shoulders of data management, armed with new solutions to new problems. Now we face the four Vs of big data and each of those new data system characteristics have introduced a new set of challenges driving the need for Big Data Information Governance as a response to changing velocity, volume, veracity, and variety.
Before I answer this question, I must ask you “How comprehensive is the framework you are using today and how well does it scale to address the new challenges?”
While there are several frameworks out in the marketplace to choose from. In this blog, I will tell you what questions you need to ask yourself before replacing your old framework with a new one:
Q. Is it nimble?
The focus of data governance practices must allow for nimble responses to changes in technology, customer needs, and internal processes. The organization must be able to respond to emergent technology.
Q. Will it enable you to apply policies and regulations to data brought into the organization by a person or process?
- Public company: Meet the obligation to protect the investment of the shareholders and manage risk while creating value.
- Private company: Meet privacy laws even if financial regulations are not applicable.
- Fulfill the obligations of external regulations from international, national, regional, and local governments.
Q. How does it Manage quality?
For big data, the data must be fit for purpose; context might need to be hypothesized for evaluation. Quality does not imply cleansing activities, which might mask the results.
Q. Does it understanding your complete business and information flow?
Attribution and lineage are very important in big data. Knowing what is the source and what is the destination is crucial in validating analytics results as fit for purpose.
Q. How does it understanding the language that you use, and can the framework manage it actively to reduce ambiguity, redundancy, and inconsistency?
Big data might not have a logical data model, so any structured data should be mapped to the enterprise model. Big data still has context and thus modeling becomes increasingly important to creating knowledge and understanding. The definitions evolve over time and the enterprise must plan to manage the shifting meaning.
Q. Does it manage classification?
It is critical for the business/steward to classify the overall source and the contents within as soon as it is brought in by its owner to support of information lifecycle management, access control, and regulatory compliance.
Q. How does it protect data quality and access?
Your information protection must not be compromised for the sake of expediency, convenience, or deadlines. Protect not just what you bring in, but what you join/link it to, and what you derive. Your customers will fault you for failing to protect them from malicious links. The enterprise must formulate the strategy to deal with more data, longer retention periods, more data subject to experimentation, and less process around it, all while trying to derive more value over longer periods.
Q. Does it foster stewardship?
Ensuring the appropriate use and reuse of data requires the action of an employee. E.g., this role cannot be automated, and it requires the active involvement of a member of the business organization to serve as the steward over the data element or source.
Q. Does it manage long-term requirements?
Policies and standards are the mechanism by which management communicates their long-range business requirements. They are essential to an effective governance program.
Q. How does it manage feedback?
As a companion to policies and standards, an escalation and exception process enables communication throughout the organization when policies and standards conflict with new business requirements. It forms the core process to drive improvements to the policy and standard documents.
Q. Does it Foster innovation?
Governance must not squelch innovation. Governance can and should make accommodations for new ideas and growth. This is managed through management of the infrastructure environments as part of the architecture.
Q. How does it control third-party content?
Third-party data plays an expanding role in big data. There are three types and governance controls must be adequate for the circumstances. They must consider applicable regulations for the operating geographic regions; therefore, you must understand and manage those obligations.
Today, 80% of the efforts in Big Data projects are related to extracting, transforming and loading data (ETL). Hortonworks and Informatica have teamed-up to leverage the power of Informatica Big Data Edition to use their existing skills to improve the efficiency of these operations and better leverage their resources in a modern data architecture. (MDA)
Next Generation Data Management
The Hortonworks Data Platform and Informatica BDE enable organizations to optimize their ETL workloads with long-term storage and processing at scale in Apache Hadoop. With Hortonworks and Informatica, you can:
• Leverage all internal and external data to achieve the full predictive power that drives the success of modern data-driven businesses.
• Optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value.
Imagine a world where you would have access to your most strategic data in a timely fashion, no matter how old the data is, where it is stored, or under what format. By leveraging Hadoop’s power of distributed processing, organizations can lower costs of data storage and processing and support large data distribution with high through put and concurrency.
Overall, the alignment between business and IT grows. The Big Data solution based on Informatica and Hortonworks allows for a complete data pipeline to ingest, parse, integrate, cleanse, and prepare data for analysis natively on Hadoop thereby increasing developer productivity by 5x over hand-coding.
Where Do We Go From Here?
At the end of the day, Big Data is not about the technology. It is about the deep business and social transformation every organization will go through. The possibilities to make more informed decisions, identify patterns, proactively address fraud and threats, and predict pretty much anything are endless.
This transformation will happen as the technology is adopted and leveraged by more and more business users. We are already seeing the transition from 20-node clusters to 100-node clusters and from a handful of technology-savvy users relying on Hadoop to hundreds of business users. Informatica and Hortonworks are accelerating the delivery of actionable Big Data insights to business users by automating the entire data pipeline.
Try It For Yourself
On September 10, 2014, Informatica announced the 60-day trial version of the Informatica Big Data Edition into the Hortonworks Sandbox. This free trial enables you to download and test out the Big Data Edition on your notebook or spare computer and experience your own personal Modern Data Architecture (MDA).
If you happen to be at Strata this October 2014, please meet us at our booths: Informatica #352 and Hortonworks #117. Don’t forget to participate in our Passport Program and join our session at 5:45 pm ET on Thursday, October 16, 2014.
“Victory won’t go to those with the most data. It will go to those who make the best use of data.” – Doug Henschen, Information Week, May 2014
But how do you actually make best use of your data and become one of the data success stories? If you are going to differentiate on data, you need to use your data to innovate. Common options include:
- New products & services which leverage a rich data set
- Different ways to sell & market existing products and services based on detailed knowledge
But there is no ‘app for that’. Think about it – if you can buy an application, you are already too late. Somebody else has identified a need and created a product they expect to sell repeatedly. Applications cannot provide you a competitive advantage if everyone has one. Most people agree they will not rise to the top because they have installed ERP, CRM, SRM, etc. So it will become with any applications which claim to win you market share and profits based on data. If you want to differentiate, you need to stay ahead of the application curve, and let your internal innovation drive you forward.
Simplistically this is a 4 step process:
- Assemble a team of innovative employees, match them with skilled data scientists
- Identify data-based differentiation opportunities
- Feed the team high quality data at the rate in which they need it
- Provide them tools for data analysis and integrating data into business processes as required
Leaving aside the simplicity of these steps for a process – there is one key change to a ‘normal’ IT project. Normally data provisioning is an afterthought during IT projects. Now it must take priority. Frequently data integration is poorly executed, and barely documented. Data quality is rarely considered during projects. Poor data provisioning is a direct cause of spaghetti charts which contribute to organisational inflexibility and poor data availability to the business. Does “It will take 6 months to make those changes” sound familiar?
We have been told Big Data will change our world; Data is a raw material; Data is the new oil.
The business world is changing. We are moving into a world where our data is one of our most valuable resources, especially when coupled with our internal innovation. Applications used to differentiate us, now they are becoming commodities to be replaced and upgraded, or new ones acquired as rapidly as our business changes.
I believe that in order to differentiate on data, an organisation needs to treat data as the valuable resource we all say it is. Data Agility, Management and Governance are the true differentiators of our era. This is a frustration for those trying to innovate, but locked in an inflexible data world, built at a time people still expected ERP to be the answer to everything.
To paraphrase a recent complaint I heard: “My applications should be like my phone. I buy a new one, turn it on and it already has all my data”.
This is the exact vision that is driving Informatica’s Intelligent Data Platform.
In the end, differentiating on data comes down to one key necessity: High quality data MUST be available to all who need it, when they need it.
The future of lighting may first be peeking through at Newark Liberty Airport in New Jersey. The airport has installed 171 new LED-based light fixtures that include a variety of sensors to detect and record what’s going in the airport, as reported by Diane Cardwell in The New York Times. Together they make a network of devices that communicates wirelessly and allows authorities to scan license plates of passing cars, watch out for lines and delays, and check out travelers for suspicious activities.
I get the feeling that Newark’s new gear will not be the last of lighting-based digital networks. Over the last few years, LED street lights have gone from something cities would love to have to the sector standard. That the market has shifted so swiftly is thanks to the efforts of early movers such as the City of Los Angeles, which last year completed the world’s largest LED street light replacement project, with LED fixtures installed on 150,000 streetlights.
Los Angeles is certainly not alone in making the switch to LED street lighting. In March 2013, Las Vegas outfitted 50,000 streetlights with LED fixtures. One month later, the Austin TX announced plans to install 35,000 LED street lights. Not to be outdone, New York City, is planning to go all-LED by 2017, which would save $14 million and many tons of carbon emissions each year.
The impending switch to LEDs is an excellent opportunity for LED light fixture makers and Big Data software vendors like Informatica. These fixtures are made with a wide variety of sensors that can be tailored to whatever the user wants to detect, including temperature, humidity, seismic activity, radiation, audio, and video, among other things. The sensors could even detect and triangulate the source of a gunshot.
This steady stream of real-time data collected from these fixtures can be transformed into torrents of small messages and events with unprecedented agility using Informatica Vibe Data Stream. Analyzed data can then be distributed to various governmental and non-governmental agencies, such as; law enforcement, environmental monitors, retailers, etc.
If I were to guess the number of streetlights in the world, I would say 4 billion. Upgrading these is a “once-in-a-generation opportunity” to harness “lots of data, i.e., Sensory big data.”
Come and get it. For developers hungry to get their hands on Informatica on Hadoop, a downloadable free trial of Informatica Big Data Edition was launched today on the Informatica Marketplace. See for yourself the power of the killer app on Hadoop from the leader in data integration and quality.
Thanks to the generous help of our partners, the Informatica Big Data team has preinstalled the Big Data Edition inside the sandbox VMs of the two leading Hadoop distributions. This empowers Hadoop and Informatica developers to easily try the codeless, GUI driven Big Data Edition to build and execute ETL and data integration pipelines natively on Hadoop for Big Data analytics.
Informatica Big Data Edition is the most complete and powerful suite for Hadoop data pipelines and can increase productivity up to 5 times. Developers can leverage hundreds of out-of-the-box Informatica pre-built transforms and connectors for structured and unstructured data processing on Hadoop. With the Informatica Vibe Virtual Data Machine running directly on each node of the Hadoop cluster, the Big Data Edition can profile, parse, transform and cleanse data at any scale to prepare data for data science, business intelligence and operational analytics.
The Informatica Big Data Edition Trial Sandbox VMs will have a 60 day trial version of the Big Data Edition preinstalled inside a 1-node Hadoop cluster. The trials include sample data and mappings as well as getting started documentation and videos. It is possible to try your own data with the trials, but processing is limited to the 1-node Hadoop cluster and the machine you have it running on. Any mappings you develop in the trial can be easily moved on to a production Hadoop cluster running the Big Data Edition. The Informatica Big Data Edition also supports MapR and Pivotal Hadoop distributions, however, the trial is currently only available for Cloudera and Hortonworks.
Accelerate your ability to bring Hadoop from the sandbox into production by leveraging Informatica’s Big Data Edition. Informatica’s visual development approach means that more than one hundred thousand existing Informatica developers are now Hadoop developers without having to learn Hadoop or new hand coding techniques and languages. Informatica can help organizations easily integrate Hadoop into their enterprise data infrastructure and bring the PowerCenter data pipeline mappings running on traditional servers onto Hadoop clusters with minimal modification. Informatica Big Data Edition reduces the risk of Hadoop projects and increases agility by enabling more of your organization to interact with the data in your Hadoop cluster.
To get the Informatica Big Data Edition Trial Sandbox VMs and more information please visit Informatica Marketplace
Get connected. Be connected. Make connections. Find connections. The Internet of Things (IoT) is all about connecting people, processes, data and, as the name suggests, things. The recent social media frenzy surrounding the ALS Ice Bucket Challenge has certainly reminded everyone of the power of social media, the Internet and a willingness to answer a challenge. Fueled by personal and professional connections, the craze has transformed fund raising for at least one charity. Similarly, IoT may potentially be transformational to the business of the public sector, should government step up to the challenge.
Government is struggling with the concept and reality of how IoT really relates to the business of government, and perhaps rightfully so. For commercial enterprises, IoT is far more tangible and simply more fun. Gaming, televisions, watches, Google glasses, smartphones and tablets are all about delivering over-the-top, new and exciting consumer experiences. Industry is delivering transformational innovations, which are connecting people to places, data and other people at a record pace.
It’s time to accept the challenge. Government agencies need to keep pace with their commercial counterparts and harness the power of the Internet of Things. The end game is not to deliver new, faster, smaller, cooler electronics; the end game is to create solutions that let devices connecting to the Internet interact and share data, regardless of their location, manufacturer or format and make or find connections that may have been previously undetectable. For some, this concept is as foreign or scary as pouring ice water over their heads. For others, the new opportunity to transform policy, service delivery, leadership, legislation and regulation is fueling a transformation in government. And it starts with one connection.
One way to start could be linking previously siloed systems together or creating a golden record of all citizen interactions through a Master Data Management (MDM) initiative. It could start with a big data and analytics project to determine and mitigate risk factors in education or linking sensor data across multiple networks to increase intelligence about potential hacking or breaches. Agencies could stop waste, fraud and abuse before it happens by linking critical payment, procurement and geospatial data together in real time.
This is the Internet of Things for government. This is the challenge. This is transformation.
You probably know this already, but I’m going to say it anyway: It’s time you changed your infrastructure. I say this because most companies are still running infrastructure optimized for ERP, CRM and other transactional systems. That’s all well and good for running IT-intensive, back-office tasks. Unfortunately, this sort of infrastructure isn’t great for today’s business imperatives of mobility, cloud computing and Big Data analytics.
Virtually all of these imperatives are fueled by information gleaned from potentially dozens of sources to reveal our users’ and customers’ activities, relationships and likes. Forward-thinking companies are using such data to find new customers, retain existing ones and increase their market share. The trick lies in translating all this disparate data into useful meaning. And to do that, IT needs to move beyond focusing solely on transactions, and instead shine a light on the interactions that matter to their customers, their products and their business processes.
They need what we at Informatica call a “Data First” perspective. You can check out my first blog first about being Data First here.
A Data First POV changes everything from product development, to business processes, to how IT organizes itself and —most especially — the impact IT has on your company’s business. That’s because cloud computing, Big Data and mobile app development shift IT’s responsibilities away from running and administering equipment, onto aggregating, organizing and improving myriad data types pulled in from internal and external databases, online posts and public sources. And that shift makes IT a more-empowering force for business change. Think about it: The ability to connect and relate the dots across data from multiple sources finally gives you real power to improve entire business processes, departments and organizations.
I like to say that the role of IT is now “big I, little t,” with that lowercase “t” representing both technology and transactions. But that role requires a new set of priorities. They are:
- Think about information infrastructure first and application infrastructure second.
- Create great data by design. Architect for connectivity, cleanliness and security. Check out the eBook Data Integration for Dummies.
- Optimize for speed and ease of use – SaaS and mobile applications change often. Click here to try Informatica Cloud for free for 30 days.
- Make data a team sport. Get tools into your users’ hands so they can prepare and interact with it.
I never said this would be easy, and there’s no blueprint for how to go about doing it. Still, I recognize that a little guidance will be helpful. In a few weeks, Informatica’s CIO Eric Johnson and I will talk about how we at Informatica practice what we preach.
If you ask a CIO today about the importance of data to their enterprises, they will likely tell you about the need to “compete on analytics” and to enable faster business decisions. At the same time, CIOs believe they “need to provide the intelligence to make better business decisions”. One CIO said it was in fact their personal goal to get the business to a new place faster, to enable them to derive new business insights, and to get to the gold at the end of the rainbow”.
Similarly, another CIO said that Big Data and Analytics were her highest priorities. “We have so much knowledge locked up in the data, it is just huge. We need the data cleaning and analytics to pull this knowledge out of data”. At the same time the CIOs that we talked to see their organizations as “entering an era of ubiquitous computing where users want all data on any device when they need it.”
Why does faster, better data really matters to the enterprise?
So why does it matter? Thomas H. Davenport says, “at a time when firms in many industries offer similar products and use comparable technologies, business processes are among the last remaining points of differentiation.” A CIO that we have talked to concurred in saying, “today, we need to move from “management by exception to management by observation”. Derick Abell amplified upon this idea when he said in his book Managing with Dual Strategies “for control to be effective, data must be timely and provided at intervals that allow effective intervention”.
Davenport explains why timely data matters in this way “analytics competitors wring every last drop of value from those processes”. Given this, “they know what products their customers want, but they also know what prices those customers will pay, how many items each will buy in a lifetime, and what triggers will make people buy more. Like other companies, they know compensation costs and turnover rates, but they can also calculate how much personnel contribute to or detract from the bottom line and how salary levels relate to individuals’ performance. Like other companies, they know when inventories are running low, but they can also predict problems with demand and supply chains, to achieve low rates of inventory and high rates of perfect orders”.
What then prevents businesses from competing on analytics?
Moving to what Davenport imagines requires not just a visualizing tool. It involves fixing what is allying IT’s systems. One CIO suggested this process can be thought of like an athlete building the muscles they need to compete. He said that businesses really need the same thing. In his eyes, data cleaning, data security, data governance, and master data management represent the muscles to compete effectively on analytics. Unless you do these things, you cannot truly compete on analytics. At UMASS Memorial Health, for example, they “had four independent patient registration systems supporting the operations of their health system, with each of these having its own means of identifying patients, assigning medical record numbers, and recording patient care and encounter information”. As a result, “UMass lacked an accurate, reliable, and trustworthy picture of how many unique patients were being treated by its health system. In order to fix things, UMASS needed to “resolve patient, provider and encounter data quality problems across 11 source systems to allow aggregation and analysis of data”. Prior to fixing its data management system, this meant that “UMass lacked a top-down, comprehensive view of clinical and financial performance across its extended healthcare enterprise”.
UMASS demonstrates how IT needs to fix their data management in order to improve their organization’s information intelligence and drive real and substantial business advantage. Fixing data management clearly involves delivering the good data that business users can safely use to make business decisions. It, also, involves ensuring that data created is protected. CFOs that we have talked to say Target was a watershed event for them—something that they expect will receive more and more auditing attention.
Once our data is good and safe, we need to connect current data sources and new data sources. And this needs to not take as long as it did in the past. The delivery of data needs to happen fast enough that business problems can be recognized as they occur and be solved before they become systemic. For this reason, users need to get access to data when and where they it is needed.
With data management fixed, data intelligence is needed so that business users can make sense out of things faster. Business users need to be able to search and find data. They need self-service so they can combine existing and new unstructured data sources to test data interrelationship hypothesis. This means the ability to assemble data from different sources at different times. Simply put this is all about data orchestration without having any preconceived process. And lastly, they need the intelligence to automatically sense and respond to changes as new data becomes collected.
Some parting thoughts
The next question may be whether competing upon data actual pay business dividends. Alvin Toffler says “Tiny insights can yield huge outputs”. In other words, the payoff can be huge. And those that do so will increasingly have the “right to win” against their competitors as you use information to wring every last drop of value from your business processes.
Solution Brief: The Intelligent Data Platform
Malcolm Gladwell wrote an article in The New Yorker magazine in January, 2007 entitled “Open Secrets.” In the article, he pointed out that a national-security expert had famously made a distinction between puzzles and mysteries.
Osama bin Laden’s whereabouts were, for many years, a puzzle. We couldn’t find him because we didn’t have enough information. The key to the puzzle, it was assumed, would eventually come from someone close to bin Laden, and until we could find that source, bin Laden would remain at large. In fact, that’s precisely what happened. Al-Qaida’s No. 3 leader, Khalid Sheikh Mohammed, gave authorities the nicknames of one of bin Laden’s couriers, who then became the linchpin to the CIA’s efforts to locate Bin Laden.
By contrast, the problem of what would happen in Iraq after the toppling of Saddam Hussein was a mystery. It wasn’t a question that had a simple, factual answer. Mysteries require judgments and the assessment of uncertainty, and the hard part is not that we have too little information but that we have too much.
This was written before “Big Data” was a household word and it begs the very interesting question of whether organizations and corporations that are, by anyone’s standards, totally deluged with data, are facing puzzles or mysteries. Consider the amount of data that a company like Western Union deals with.
Western Union is a 160-year old company. Having built scale in the money transfer business, the company is in the process of evolving its business model by enabling the expansion of digital products, growth of web and mobile channels, and a more personalized online customer experience. Sounds good – but get this: the company processes more than 29 transactions per seconds on average. That’s 242 million consumer-to-consumer transactions and 459 million business payments in a year. Nearly a billion transactions – a billion! As my six-year-old might say, that number is big enough “to go to the moon and back.” Layer on top of that the fact that the company operates in 200+ countries and territories, and conducts business in 120+ currencies. Senior Director and Head of Engineering Abhishek Banerjee has said, “The data is speaking to us. We just need to react to it.” That implies a puzzle, not a mystery – but only if data scientists are able to conduct statistical modeling and predictive analysis, systematically noting trends in sending and receiving behaviors. Check out what Banerjee and Western Union CTO Sanjay Saraf have to say about it here.
Or consider General Electric’s aggressive and pioneering move into what’s dubbed as the industrial internet. In a white paper entitled “The Case for an Industrial Big Data Platform: Laying the Groundwork for the New Industrial Age,” GE reveals some of the staggering statistics related to the industrial equipment that it manufactures and supports (services comprise 75% of GE’s bottom line):
- A modern wind turbine contains approximately 50 sensors and control loops which collect data every 40 milliseconds.
- A farm controller then receives more than 30 signals from each turbine at 160-millisecond intervals.
- At every one-second interval, the farm monitoring software processes 200 raw sensor data points with various associated properties with each turbine.
Phew! I’m no electricity operations expert, and you probably aren’t either. And most of us will get no further than simply wrapping our heads around the simple fact that GE turbines are collecting a LOT of data. But what the paper goes on to say should grab your attention in a big way: “The key to success for this wind farm lies in the ability to collect and deliver the right data, at the right velocity, and in the right quantities to a wide set of well-orchestrated analytics.” And the paper goes on to recommend that anyone involved in the Industrial Internet revolution strongly consider its talent requirements, with the suggestion that Chief Data officers and/or Data Scientists may be the next critical hires.
Which brings us back to Malcolm Gladwell. In the aforementioned article, Gladwell goes on to pull apart the Enron debacle, and argues that it was a prime example of the perils of too much information. “If you sat through the trial of (former CEO) Jeffrey Skilling, you’d think that the Enron scandal was a puzzle. The company, the prosecution said, conducted shady side deals that no one quite understood. Senior executives withheld critical information from investors…We were not told enough—the classic puzzle premise—was the central assumption of the Enron prosecution.” But in fact, that was not true. Enron employed complicated – but perfectly legal–accounting techniques used by companies that engage in complicated financial trading. Many journalists and professors have gone back and looked at the firm’s regulatory filings, and have come to the conclusion that, while complex and difficult to identify, all of the company’s shenanigans were right there in plain view. Enron cannot be blamed for covering up the existence of its side deals. It didn’t; it disclosed them. As Gladwell summarizes:
“Puzzles are ‘transmitter-dependent’; they turn on what we are told. Mysteries are ‘receiver dependent’; they turn on the skills of the listener.”
I would argue that this extremely complex, fast moving and seismic shift that we call Big Data will favor those who have developed the ability to attune, to listen and make sense of the data. Winners in this new world will recognize what looks like an overwhelming and intractable mystery, and break that mystery down into small and manageable chunks and demystify the landscape, to uncover the important nuggets of truth and significance.