Tag Archives: Analytics

Data Visibility From the Source to Hadoop and Beyond with Cloudera and Informatica Integration

Data Visibility From the Source to Hadoop

Data Visibility From the Source to Hadoop

This is a guest post by Amr Awadallah, Founder, CTO at Cloudera, Inc.

It takes a village to build mainstream big data solutions. We often get so caught up in Hadoop use cases and customer successes that sometimes we don’t talk enough about the innovative partner technologies and integrations that enable our customers to put the enterprise data hub at the core of their data architecture and innovate with confidence. Cloudera and Informatica have been working together to integrate our products to enable new levels of productivity and lower deployment and production risk.

Going from Hadoop to an enterprise data hub, means a number of things. It means that you recognize the business value of capturing and leveraging all your data for exploration and analytics. It means you’re ready to make the move from Hadoop pilot project to production. And it means your data is important enough that it’s worth securing and making data pipelines visible. It’s the visibility layer, and in particular, the unique integration between Cloudera Navigator and Informatica that I want to focus on in this post.

The era of big data has ushered in increased regulations in a number of industries – banking, retail, healthcare, energy – most of which deal in how data is managed throughout its lifecycle. Cloudera Navigator is the only native end-to-end solution for governance in Hadoop. It provides visibility for analysts to explore data in Hadoop, and enables administrators and managers to maintain a full audit history for HDFS, HBase, Hive, Impala, Spark and Sentry then run reports on data access for auditing and compliance.The integration of Informatica Metadata Manager in the Big Data Edition and Cloudera Navigator extends this level of visibility and governance beyond the enterprise data hub.

Hadoop
Today, only Informatica and Cloudera provide end-to-end data lineage from source systems through Hadoop, and into BI/analytic and data warehouse systems. And you can view it from a single pane within Informatica.

This is important because Hadoop, and the enterprise data hub in particular, doesn’t function in a silo. It’s an integrated part of a larger enterprise-wide data management architecture. The better the insight into where data originated, where it traveled, who had access to it and what they did with it, the greater our ability to report and audit. No other combination of technologies provides this level of audit granularity.

But more so than that, the visibility Cloudera and Informatica provides our joint customers with the ability to confidently stand up an enterprise data hub as a part of their production enterprise infrastructure because they can verify the integrity of the data that undergirds their analytics. I encourage you to check out a demo of the Informatica-Cloudera Navigator integration at this link: http://infa.media/1uBpPbT

You can also check out a demo and learn a little more about Cloudera Navigator  and the Informatica integration in the recorded  TechTalk hosted by Informatica at this link:

http://www.informatica.com/us/company/informatica-talks/?commid=133311

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Cloud Data Integration, Governance, Risk and Compliance, Hadoop | Tagged , , , , | Leave a comment

Should Analytics Be Focused on Small Questions Versus Big Questions?

AnalyticsShould the analytic resources of your company be focused upon small questions or big questions? For many, answering this question is not an easy one. Some find key managers preferring to make decisions from personal intuition or experience. When I worked for a large computer peripheral company, I remember executives making major decisions about product direction from their gut even when there was clear evidence that a major technology shift was about to happen. This company went from being a multi-billion dollar company to a $50 million dollar company in the matter of a few years.

In other cases, the entire company may not see the relationship between data and good decision making. When this happens, silos of the business collect data of value to them but there is not a coordinated, focused effort placed toward enterprise level strategic targets. This naturally leads to silos of analytical activity. Cleary answering small question may provide the value of having analytics quickly. However, answering the bigger questions will have the most value to the business as a whole. And while the big questions are often harder to answer, they can be pivotal to the go forward business. Here are just a few examples of the big questions that are worthy of being answered by most enterprises.

  • Which performance factors have the greatest impact on our future growth and profitability?
  • How can we anticipate and influence changing market conditions?
  • If customer satisfaction improves, what is the impact on profitability?
  • How should we optimize investments across our products, geographies, and market channels?

However, most businesses cannot easily answer these questions. Why then do they lack the analytical solutions to answer these questions?

Departmental BI does not yield strategic relevant data

Analytic- Business IntelligenceLet’s face it, business intelligence to data has largely been a departmental exercise. In most enterprises as we have been saying, analytics start as pockets of activity versus as an enterprise wide capability. The departmental approach leads business analysts to buy the same data or software that others in the organization have already bought. Enterprises end up with hundreds of data marts, reporting packages, forecasting tools, data management solutions, integration tools, and methodologies. According to Thomas Davenport, one firm he knows well has “275 data marts and thousand different information resources, but it couldn’t pull together a single view of the business in terms of key performance metrics and customer data” (Analytics at Work, Thomas Davenport, Harvard University Press, Page 47)

Clearly, answering the Big Questions requires leadership and a coordinated approach. Amazingly, taking this road often even reduces enterprise analytical expenditure as silos of information including data marts and spaghetti codes integrations are eliminated and replaced with a single enterprise capability. But if you want to take this approach how do you make sure that you get the right business questions answered?

Strategic Approach

Strategy Drives AnalyticsThe strategic approach starts with enterprise strategy.  In enterprise strategy,   leadership will define opportunities for business growth, innovation, differentiation, and marketplace impact. According to Derek Abell, this process should occur in a three cycle strategic planning approach. This approach has the enterprise doing business planning followed by functional planning, and lastly budgeting. Each cycle provides fodder for the stages that follow. For each stage, a set of overarching cascading objectives can be derived. From these, the businesses can define a set of critical success factors that will let it know whether or not business objectives are being met. Supporting each critical success factor are quantitative key performance indicators that in aggregate say whether the success factors are going to met. Finally, these key performance indicators derive the data that is needed to support the KPIs in terms of metrics or the supporting dimensional data for analysis. So the art and science here is defining critical success factors and KPIs that answer the big questions.

Core Capabilities

automationAs we saw above, the strategic approach is about tying questions to business strategy. In the capabilities approach, we tie questions to the capabilities that drive business competitive advantage. To determine these business capabilities, we need to start by looking at “the underling mechanism of value creation in the company (what they do best) and what the opportunities for meeting the market effectively. (“The Essential Advantage”, Paul Leinwand, Harvard Business Review Press, page 19). Typically, this determines 3-6 distinctive capabilities that impact the success of their enterprises service or product portfolio. These are the things that “enable your company to consistently outperform rivals” (“The Essential Advantage”, Paul Leinwand, Harvard Business Review Press, page 14). To optimize key business capabilities over time, and innovate and operate in ways that differentiate the businesses in the eyes and experience of customers (Analytics at Work, Thomas Davenport, Harvard University Press, Page 73). Here we want to target analytics investments at their distinctive capabilities. Here are some examples of potential target capabilities by industry:

  • Financial services: Credit scoring
  • Retail: Replenishment
  • Manufacturing: Supply Chain Optimization
  • Healthcare: Disease Management

Parting remarks

So as we have discussed, many firms are spending too much on analytic solutions that do not solve real business problems. Getting after this is not a technical issue—it is a business issue. It starts by asking the right business questions which can come from business strategy or your core business capabilities or some mix of each.

Related links

Related Blogs

Analytics Stories: A Banking Case Study
Analytics Stories: A Financial Services Case Study
Analytics Stories: A Healthcare Case Study
Who Owns Enterprise Analytics and Data?
Competing on Analytics: A Follow Up to Thomas H. Davenport’s Post in HBR
Thomas Davenport Book “Competing On Analytics”

Solution Brief: The Intelligent Data Platform

Author Twitter: @MylesSuer

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, CIO, Data Governance | Tagged , , , , , | Leave a comment

Gaining a Data-First Perspective with Salesforce Wave

Gaining a Data-First Perspective with Salesforce Wave

Data-First with Salesforce Wave

Salesforce.com made waves (pardon the pun) at last month’s Dreamforce conference when it unveiled the Salesforce Wave Analytics Cloud. You know Big Data has reached prime-time when Salesforce, which has a history of knowing when to enter new markets, decides to release a major analytics service.

Why now? Because companies need help making sense of the data deluge, Salesforce’s CEO Marc Benioff said at Dreamforce: “Did you know 90% of the world’s data was created in the last two years? There’s going to be 10 times more mobile data by 2020, 19 times more unstructured data, and 50 times more product data by 2020.” Average business users want to understand what that data is telling them, he said. Given Salesforce’s marketing expertise, this could be the spark that gets mainstream businesses to adopt the Data-First perspective I’ve been talking about.

As I’ve said before, a Data First POV shines a light on important interactions so that everyone inside a company can see and understand what matters. As a trained process engineer, I can tell you, though, that good decisions depend on great data — and great data doesn’t just happen: At the most basic level, you have to clean it, relate it, connect and secure it  — so that information from, say, SAP, can be viewed in the same context as data from Salesforce. Informatica obviously plays a role in this. If you want to find out more, click on this link to download our Salesforce Integration for Dummies brochure.

But that’s the basics for getting started. The bigger issue — and the one so many people seem to have trouble with — is deciding which metrics to explore. Say, for example, that the sales team keeps complaining about your marketing leads. Chances are, it’s a familiar complaint. How do you discover what’s really the problem?

One obvious place to start to first look at the conversation rates for every sales rep and group. Next explore the marketing leads they do accept such as deal size, product type or customer category. Now take it deeper. Examine which sales reps like to hunt for new customers and which prefer to mine their current base. That will tell you if you’re sending opportunities to the right profiles.

The key is never looking at the sales organization as a whole. If it’s EMEA, for instance, have a look to see how France is doing selling to emerging markets vs. the team in Germany. These metrics are digital trails of human behavior. Data First allows you to explore that behavior and either optimize it or change it.

But for this exploration to pay off, you actually have to do some of the work. You can’t just job it out to an analyst. This exercise doesn’t become meaningful until you are mentally engaged in the process. And that’s how it should be: If you are a Data First company, you have to be a Data First leader.

FacebookTwitterLinkedInEmailPrintShare
Posted in Data Archiving, Data First, Data Governance, Data Integration | Tagged , , , , | Leave a comment

Analytics Stories: A Banking Case Study

Right to winAs I have shared within other post within this series, businesses are using analytics to improve their internal and external facing business processes and to strengthen their “right to win” within the markets that they operate. In banking, the right to win increasingly comes from improving two core sets of business capabilities—risk management and customer service.

Significant change has occurred in risk management over the last few years following theAnalytics subprime crisis and the subsequent credit crunch. These environmental changes have put increased regulatory pressure upon banks around the world. Among other things, banks need to comply with measures aimed at limiting the overvaluation of real estate assets and at preventing money laundering. A key element of handling these is to ensuring that go forward business decisions are made consistently using the most accurate business data available. It seems clear that data consistency can determine the quality of business operations especially business risk.

At the same time as banks need to strengthen their business capabilities around operations, and in particular risk management, they also need to use better data to improve the loyalty of their existing customer base.

Banco Popular launches itself into the banking vanguard

Banco Popular is an early responder regarding the need for better banking data consistency. Its leadership created a Quality of Information Office (the Office uniquely is not based within IT but instead with the Office of the President) with the mandate of delivering on two business objectives:

  1. Ensuring compliance with governmental regulations occurs
  2. Improving customer satisfaction based on accurate and up-to-date information

Part of the second objective is aimed at ensuring that each of Banco Popular’s customers was offered the ideal products for their specific circumstances. This is interesting because by its nature it assists in obtainment of the first objective. To validate it achieves both mandates, the Office started by creating an “Information Quality Index”. The Index is created using many different types of data relating to each of the bank’s six million customers–including addresses, contact details, socioeconomic data, occupation data, and banking activity data. The index is expressed in percentage terms, which reflects the quality of the information collected for each individual customer. The overarching target set for the organization is a score of 90 percent—presently, the figure sits at 75 percent. There is room to grow and improve!

Current data management systems limit obtainment of its business goals

Unfortunately, the millions of records needed by the Quality Information Office are spread across different tables in the organization’s central computing system and must be combined into one information file for each customer to be useful to business users. The problem is that they had depended on third parties to manually pull and clean up this data. This approach with the above mandates proved too slow to be executed in timely fashion. This, in turn, has impacted the quality of their business capabilities for risk and customer service. According to Banco Popular, their approach did not create the index and other analyses “with the frequency that we wanted and examining the variables of interest to us,” explains Federico Solana, an analyst at the Banco Popular Quality of Information Office.

Creating the Quality Index was just too time consuming and costly. But not improving data delivery performance had a direct impact on decision making.

Automation proves key to better business processes

TrustTo speed up delivery of its Quality Index, Banco Popular determined it needed to automate it’s creation of great data—data which is trustworthy and timely. According to Tom Davenport, “you can’t be analytical without data and you can’t be really good at analytics without really good data”. (Analytics at Work, 2010, Harvard Business Press, Page 23). Banco Popular felt that automating the tasks of analyzing and comparing variables would increase the value of data at lower cost and ensuring a faster return on data.

In addition to fixing the Quality Index, Banco Popular needed to improve its business capabilities around risk and customer service automation. This aimed at improving the analysis of mortgages while reducing the cost of data, accelerating the return on data, and boosting business and IT productivity.

Everything, however, needed to start with the Quality Index. After the Quality Index was created for individuals, Banco Popular created a Quality of Information Index for Legal Entities and is planning to extend the return on data by creating indexes for Products and Activities. For the Quality Index related to legal entities, the bank included variables that aimed at preventing the consumption of capital as well as other variables used to calculate the probability of underpayments and Basel models. Variables are classified as essential, required, and desirable. This evaluation of data quality allows for the subsequent definition of new policies and initiatives for transactions, the network of branches, and internal processes, among other aspects. In addition, the bank is also working on the in-depth analysis of quality variables for improving its critical business processes including mortgages.

Some Parting Remarks

In the end, Banco Popular has shown the way forward for analytics. In banking the measures of performance are often known, however, what is problematic is ensuring the consistency of decision making across braches and locations. By working first on data quality, Banco Popular ensured that the quality of data measures are consistent and therefore, it can now focus its attentions on improving underling business effectiveness and efficiency.

Related links

Related Blogs

Analytics Stories: A Financial Services Case Study
Analytics Stories: A Healthcare Case Study
Who Owns Enterprise Analytics and Data?
Competing on Analytics: A Follow Up to Thomas H. Davenport’s Post in HBR
Thomas Davenport Book “Competing On Analytics”

Solution Brief: The Intelligent Data Platform

Author Twitter: @MylesSuer

 

FacebookTwitterLinkedInEmailPrintShare
Posted in CIO, Data Governance | Tagged , , , | 1 Comment

Who Owns Enterprise Analytics and Data?

processing dataWith the increasing importance of enterprise analytics, the question becomes who should own the analytics and data agenda. This question really matters today because, according to Thomas Davenport, “business processes are among the last remaining points of differentiation.” For this reason, Davenport even suggests that businesses that create a sustainable right to win use analytics to “wring every last drop of value from their processes”.

The CFO is the logical choice?

enterpriseIn talking with CIOs about both enterprise analytics and data, they are clear that they do not want to become their company’s data steward. They insist instead that they want to be an enabler of the analytics and data function. So what business function then should own enterprise analytics and data? Last week an interesting answer came from a CFO Magazine Article by Frank Friedman. Frank contends that CFOs are “the logical choice to own analytics and put them to work to serve the organization’s needs”.

To justify his position, Frank made the following claims:

  1. CFOs own most of the unprecedented quantities of data that businesses create from supply chains, product processes, and customer interactions
  2. Many CFOs already use analytics to address their organization’s strategic issues
  3. CFOs uniquely can act as a steward of value and an impartial guardian of truth across the organizations. This fact gives them the credibility and trust needed when analytics produce insights that effectively debunk currently accepted wisdom

Frank contends as well that owning the analytics agenda is a good thing because it allows CFOs to expand their strategic leadership role in doing the following:

  • Growing top line revenue
  • Strengthening their business ties
  • Expanding the CFO’s influence outside the finance function.

Frank suggests as well that analytics empowers the CFO to exercise more centralized control of operational business decision making. The question is what do other CFOs think about Frank’s position?

CFOs clearly have an opinion about enterprise analytics and data

A major Retail CFO says that finance needs to own “the facts for the organization”—the metrics and KPIs. And while he honestly admits that finance organizations in the past have not used data well, he claims finance departments need to make the time to become truly data centric. He said “I do not consider myself a data expert, but finance needs to own enterprise data and the integrity of this data”. This CFO claims as well that “finance needs to use data to make sure that resources are focused on the right things; decisions are based on facts; and metrics are simple and understandable”. A Food and Beverage CFO agrees with the Retail CFO by saying that almost every piece of data is financial in one way or another. CFOs need to manage all of this data since they own operational performance for the enterprise. CFOs should own the key performance indicators of the business.

CIOs should own data, data interconnect, and system selection

A Healthcare CFO said he wants, however, the CIO to own data systems, data interconnect, and system selection. However, he believes that the finance organization is the recipient of data. “CFOs have a major stake in data. CFOs need to dig into operational data to be able to relate operations to internal accounting and to analyze things like costs versus price”. He said that “the CFOs can’t function without good operational data”.

An Accounting Firm CFO agreed with the Healthcare CFO by saying that CIOs are a means to get data. She said that CFOs need to make sense out of data in their performance management role. CFOs, therefore, are big consumers of both business intelligence and analytics. An Insurance CFO concurred by saying CIOs should own how data is delivered.

CFOs should be data validators

Data AnalysisThe Insurance CFOs said, however, CFOs need to be validators of data and reports. They should, as a result, in his opinion be very knowledgeable on BI and Analytics. In other words, CFOs need to be the Underwriters Laboratory (UL) for corporate data.

Now it is your chance

So the question is what do you believe? Does the CFO own analytics, data, and data quality as a part of their operational performance role? Or is it a group of people within the organization? Please share your opinions below.

Related links

Solution Brief: The Intelligent Data Platform

Related Blogs

CFOs Move to Chief Profitability Officer
CFOs Discuss Their Technology Priorities
The CFO Viewpoint upon Data
How CFOs can change the conversation with their CIO?
New type of CFO represents a potent CIO ally
Competing on Analytics
The Business Case for Better Data Connectivity

Twitter: @MylesSuer

 

FacebookTwitterLinkedInEmailPrintShare
Posted in CIO, Data First, Data Governance, Enterprise Data Management | Tagged , , , | 6 Comments

Analytics Stories: A Financial Services Case Study

As I indicated in my last case study regarding competing on analytics, Thomas H. Davenport believes “business processes are among the last remaining points of differentiation.” For this reason, Davenport contends that businesses that create a sustainable right to win use analytics to “wring every last drop of value from their processes”. For financial services, the mission critical areas needing process improvement center are around improving the consistency of decision making and making the management of regulatory and compliance more efficient and effective.

Why does Fannie Mae need to compete on analytics?

Fannie MaeFannie Mae is in the business of enabling people to buy, refinance, or rent homes. As a part of this, Fannie Mae says it is all about keeping people in their homes and getting people into new homes. Foundational to this mission is the accurate collection and reporting of data for decision making and risk management. According to Tracy Stephan at Fannie Mae, their “business needs to have the data to make decisions in a more real time basis. Today, this is all about getting the right data to the right people at the right time”.

Fannie Mae claims when the mortgage crisis hit, a lot of the big banks stopped lending and this meant that Fannie Mae among others needed to pick up the slack. Their action here, however, caused the Federal Government to require them to report monthly and quarterly against goals that the Federal Government set for it. “This meant that there was not room for error in how data gets reported”. In the end, Fannie Mae says three business imperatives drove it’s need to improve its reporting and its business processes:

  1. To ensure that go forward business decisions were made consistently using the most accurate business data available
  2. To avoid penalties by adhering to Dodd-Frank and other regulatory requirements established for it after the 2008 Global Financial Crisis
  3. To comply with reporting to Federal Reserve and Wall Street regarding overall business risk as a function of: data quality and accuracy, credit-worthiness of loans, and risk levels of investment positions.

Delivering required Fannie Mae to change how it managed data

AnalyticsGiven these business imperatives, IT leadership quickly realized it needed to enable the business to use data to truly drive better business processes from end to end of the organization. However, this meant enabling Fannie Mae’s business operations teams to more effectively and efficiently manage data. This caused Fannie Mae to determine that it needed a single source of truth whether it was for mortgage applications or the passing of information securely to investors. This need required Fannie Mae to establish the ability to share the same data across every Fannie Mae repository.

But there was a problem. Fannie Mae needed clean and correct data collected and integrated from more than 100 data sources. Fannie Mae determined that doing so with its current data processes could not scale. And as well, it determined that its data processes would not allow it to meet its compliance reporting requirements. At the same time, Fannie Mae needed to deliver more proactive management of compliance. This required that it know how critical business data enters and flows through each of its systems. This includes how data was changed by multiple internal processing and reporting applications. As well, Fannie Mae leadership felt that this was critical to ensure traceability to the individual user.

The solution

analyticsPer its discussions with business customers, Fannie Mae’s IT leadership determined that it needed to get real time, trustworthy data to improve its business operations and to improve its business processes and decision making. As said, these requirements could not be met with its historical approaches to integrating and managing data.

Fannie Mae determined that it needed to create a platform that was high availability, scalable, and largely automating its management of data quality management.  At the same time, the platform needed to provide the ability to create a set of business glossaries with clear data lineages. Fannie Mae determined it needed effectively a single source of truth across all of its business systems. According to Tracy Stephan, IT Director, Fannie Mae, “Data quality is the key to the success of Fannie Mae’s mission of getting the right people into the right homes. Now all our systems look at the same data – that one source of truth – which gives us great comfort.” To learn more specifics about how Fannie Mae improved its business processes and demonstrated that it is truly “data driven”, please click on this video of their IT leadership.

Related links
Solution Brief: The Intelligent Data Platform
Related Blogs
Thomas Davenport Book “Competing On Analytics”
Competing on Analytics
The Business Case for Better Data Connectivity
The CFO Viewpoint upon Data
What an enlightened healthcare CEO should tell their CIO?

Twitter: @MylesSuer

FacebookTwitterLinkedInEmailPrintShare
Posted in CIO, Financial Services | Tagged , , , | Leave a comment

The Streetlight Is Watching You

The Streetlight Is Watching You

The Streetlight Is Watching You

We are hugely dependent upon technology and sometimes take it for granted. It is always worth reminding ourselves where it all began so we can fully appreciate how lucky we are. Take the Light Emitting Diodes (LEDs) for example. They have come a long way in a relatively short time. They were used first as low-intensity light emitters in electronic devices, it is difficult to believe anyone would foresee them one day lighting our homes.

The future of lighting may first be peeking through at Newark Liberty Airport in New Jersey. The airport has installed 171 new LED-based light fixtures that include a variety of sensors to detect and record what’s going in the airport, as reported by Diane Cardwell in The New York Times. Together they make a network of devices that communicates wirelessly and allows authorities to scan license plates of passing cars, watch out for lines and delays, and check out travelers for suspicious activities.

I get the feeling that Newark’s new gear will not be the last of lighting-based digital networks. Over the last few years, LED street lights have gone from something cities would love to have to the sector standard. That the market has shifted so swiftly is thanks to the efforts of early movers such as the City of Los Angeles, which last year completed the world’s largest LED street light replacement project, with LED fixtures installed on 150,000 streetlights.

Los Angeles is certainly not alone in making the switch to LED street lighting. In March 2013, Las Vegas outfitted 50,000 streetlights with LED fixtures. One month later, the Austin TX announced plans to install 35,000 LED street lights. Not to be outdone, New York City, is planning to go all-LED by 2017, which would save $14 million and many tons of carbon emissions each year.

The impending switch to LEDs is an excellent opportunity for LED light fixture makers and Big Data software vendors like Informatica. These fixtures are made with a wide variety of sensors that can be tailored to whatever the user wants to detect, including temperature, humidity, seismic activity, radiation, audio, and video, among other things. The sensors could even detect and triangulate the source of a gunshot.

This steady stream of real-time data collected from these fixtures can be transformed into torrents of small messages and events with unprecedented agility using Informatica Vibe Data Stream. Analyzed data can then be distributed to various governmental and non-governmental agencies, such as; law enforcement, environmental monitors, retailers, etc.

If I were to guess the number of streetlights in the world, I would say 4 billion. Upgrading these is a “once-in-a-generation opportunity” to harness “lots of data, i.e., Sensory big data.”

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Utilities & Energy, Vibe | Tagged , , , , , | Leave a comment

Download the Informatica Big Data Edition Trial and Unleash the Power of Hadoop

Cloudera Hadoop

Big Data Edition Trial Sandbox for Cloudera

Come and get it.  For developers hungry to get their hands on Informatica on Hadoop, a downloadable free trial of Informatica Big Data Edition was launched today on the Informatica Marketplace.  See for yourself the power of the killer app on Hadoop from the leader in data integration and quality.

Thanks to the generous help of our partners, the Informatica Big Data team has preinstalled the Big Data Edition inside the sandbox VMs of the two leading Hadoop distributions.  This empowers Hadoop and Informatica developers to easily try the codeless, GUI driven Big Data Edition to build and execute ETL and data integration pipelines natively on Hadoop for Big Data analytics.

Informatica Big Data Edition is the most complete and powerful suite for Hadoop data pipelines and can increase productivity up to 5 times. Developers can leverage hundreds of out-of-the-box Informatica pre-built transforms and connectors for structured and unstructured data processing on Hadoop.  With the Informatica Vibe Virtual Data Machine running directly on each node of the Hadoop cluster, the Big Data Edition can profile, parse, transform and cleanse data at any scale to prepare data for data science, business intelligence and operational analytics.

The Informatica Big Data Edition Trial Sandbox VMs will have a 60 day trial version of the Big Data Edition preinstalled inside a 1-node Hadoop cluster.  The trials include sample data and mappings as well as getting started documentation and videos.  It is possible to try your own data with the trials, but processing is limited to the 1-node Hadoop cluster and the machine you have it running on.  Any mappings you develop in the trial can be easily moved on to a production Hadoop cluster running the Big Data Edition. The Informatica Big Data Edition also supports MapR and Pivotal Hadoop distributions, however, the trial is currently only available for Cloudera and Hortonworks.

Hadoop Hortonworks

Big Data Edition Trial Sandbox for Hortonworks

Accelerate your ability to bring Hadoop from the sandbox into production by leveraging Informatica’s Big Data Edition. Informatica’s visual development approach means that more than one hundred thousand existing Informatica developers are now Hadoop developers without having to learn Hadoop or new hand coding techniques and languages. Informatica can help organizations easily integrate Hadoop into their enterprise data infrastructure and bring the PowerCenter data pipeline mappings running on traditional servers onto Hadoop clusters with minimal modification. Informatica Big Data Edition reduces the risk of Hadoop projects and increases agility by enabling more of your organization to interact with the data in your Hadoop cluster.

To get the Informatica Big Data Edition Trial Sandbox VMs and more information please visit Informatica Marketplace

FacebookTwitterLinkedInEmailPrintShare
Posted in Big Data, Data Integration, Hadoop | Tagged , , , , | Leave a comment

Reflections Of A Former Data Analyst (Part 2) – Changing The Game For Data Plumbing

 

Elephant cleansing

Cleaning. Sometimes is challenging!

In my last blog I promised I would report back my experience on using Informatica Data Quality, a software tool that helps automate the hectic, tedious data plumbing task, a task that routinely consumes more than 80% of the analyst time. Today, I am happy to share what I’ve learned in the past couple of months.

But first, let me confess something. The reason it took me so long to get here was that I was dreaded by trying the software.  Never a savvy computer programmer, I was convinced that I would not be technical enough to master the tool and it would turn into a lengthy learning experience. The mental barrier dragged me down for a couple of months and I finally bit the bullet and got my hands on the software. I am happy to report that my fear  was truly unnecessary –  It took me one half day to get a good handle on most features in the Analyst Tool, a component  of the Data Quality designed for analyst and business users,   then I spent 3 days trying to figure out how to maneuver the Developer Tool, another key piece of the Data Quality offering mostly used by – you guessed it, developers and technical users.  I have to admit that I am no master of the Developer Tool after 3 days of wrestling with it, but, I got the basics and more importantly, my hands-on interaction with the entire software helped me understand the logic behind the overall design, and see for myself  how analyst and business user can easily collaborate with their IT counterpart within our Data Quality environment.

To break it all down, first comes to Profiling. As analyst we understand too well the importance of profiling as it provides an anatomy of the raw data we collected. In many cases, it is a must have first step in data preparation (especially when our  raw data came from different places and can also carry different formats).  A heavy user of Excel, I used to rely on all the tricks available in the spreadsheet to gain visibility of my data. I would filter, sort, build pivot table, make charts to learn what’s in my raw data.  Depending on how many columns in my data set, it could take hours, sometimes days just to figure out whether the data I received was any good at all, and how good it was.

which one do you like better?

which one do you like better?

Switching to the Analyst Tool in Data Quality, learning my raw data becomes a task of a few clicks – maximum 6 if I am picky about how I want it to be done.  Basically I load my data, click on a couple of options, and let the software do the rest.  A few seconds later I am able to visualize the statistics of the data fields I choose to examine,  I can also measure the quality of the raw data by using Scorecard feature in the software. No more fiddling with spreadsheet and staring at busy rows and columns.  Take a look at the above screenshots and let me know your preference?

Once I decide that my raw data is adequate enough to use after the profiling, I still need to clean up the nonsense in it before performing any analysis work, otherwise  bad things can happen — we call it garbage in garbage out. Again, to clean and standardize my data, Excel came to rescue in the past.  I would play with different functions and learn new ones, write macro or simply do it by hand. It was tedious but worked if I worked on static data set. Problem however, was when I needed to incorporate new data sources in a different format, many of the previously built formula would break loose and become inapplicable. I would have to start all over again. Spreadsheet tricks simply don’t scale in those situation.

Rule Builder in Analyst Tool

Rule Builder in Analyst Tool

With Data Quality Analyst Tool, I can use the Rule Builder to create a set of logical rules in hierarchical manner based on my objectives,  and test those rules to see the immediate results. The nice thing is, those rules are not subject to data format, location, or size, so I can reuse them when the new data comes in.  Profiling can be done at any time so I can re-examine my data after applying the rules, as many times as I like. Once I am satisfied with the rules, they will be passed on to my peers in IT so they can create executable rules based on the logic I create and run them automatically in production. No more worrying about the difference in format, volume or other discrepancies in the data sets, all the complexity is taken care of by the software, and all I need to do is to build meaningful rules to transform the data to the appropriate condition so I can have good quality data to work with for my analysis.  Best part? I can do all of the above without hassling my IT – feeling empowered is awesome!

Changing The Game For Data Plumbing

Use the Right Tool for the Job

Use the right tool for the right job will improve our results, save us time, and make our jobs much more enjoyable. For me, no more Excel for data cleansing after trying our Data Quality software, because now I can get a more done in less time, and I am no longer stressed out by the lengthy process.

I encourage my analyst friends to try Informatica Data Quality, or at least the Analyst Tool in it.  If you are like me, feeling weary about the steep learning curve then fear no more. Besides, if Data Quality can cut down your data cleansing time by half (mind you our customers have reported higher numbers), how many more predictive models you can build, how much you will learn, and how much faster you can build your reports in Tableau, with more confidence?

FacebookTwitterLinkedInEmailPrintShare
Posted in Data Governance, Data Quality | Tagged , , , | Leave a comment