Informatica’s Vibe virtual data machine can streamline big data work and allow data scientists to be more efficient
Informatica introduced an embeddable Vibe engine for not only transformation, but also for data quality, data profiling, data masking and a host of other data integration tasks. It will have a meaningful impact on the data scientist shortage.
Some clear economic facts are already apparent in the current world of data. Hadoop provides a significantly less expensive platform for gathering and analyzing data; cloud computing (potentially) is a more economical computing location than on-premises, if managed well. These are clearly positive developments. On the other hand, the human resources required to exploit these new opportunities are actually quite expensive. When there is greater demand than can be met in the short term for a hot product, suppliers put customers “on allocation” to manage the distribution to the most strategic customers.
This is the situation with “data scientists,” this new breed of experts with quantitative skills, data management skills, presentation skills and deep domain expertise. Current estimates are that there are 60,000 – 120,000 unfilled positions in the US alone. Naturally, data scientists are “allocated” to the most critical (economically lucrative) efforts, and their time is limited to those tasks that most completely leverage their unique skills.
To address this shortage, industry turns to universities to develop curricula to manufacture data scientists, but this will take time. In the meantime, salaries for data scientists are very high. Unfortunately, most data science work involves a great deal of effort that does not require data science skills, especially in the areas of managing the data prior to the insightful analytics. Some estimates are that data scientists spend 50-80% of their time finding and cleaning data, managing their computing platforms and writing programs. Reducing this effort with better tools can not only make data scientists more effective, it have an impact on the most expensive component of big data – human resources.
Informatica today introduced Vibe, its embeddable virtual data machine to do exactly that. Informatica has, for over 20 years, provided tools that allow developers to design and execute transformation of data without the need for writing or maintaining code. With Vibe, this capability is extended to include data quality, masking and profiling and the engine itself can be embedded in the platforms where the work is performed. In addition, the engine can generate separate code from a single data management design.
In the case of Hadoop, Informatica designers can continue to operate in the familiar design studio, and have Vibe generate the code for whatever platform is needed.In this way, it is possible for an Informatica developer to develop these data management routines for Hadoop, without learning Hadoop or writing code in Java. And the real advantage is that the data scientist is freed from work that can be performed by those in lower pay grades and can parallelize that work too – multiple programmers and integration developers to one data scientist.
Vibe is a major innovation for Informatica that provides many interesting opportunities for it’s customers. Easing the data scientist problem is only one.
This is a guest blog penned by Neil Raden, a well-known industry figure as an author, lecturer and practitioner. He has in-depth experience as a developer, consultant and analyst in all areas of Analytics and Decision Services including Big Data strategy and implementation, Business Intelligence, Data Warehousing, Statistical/Predictive Modeling, Decision Management, and IT systems integration including assessment, architecture, planning, project management and execution. Neil has authored dozens of sponsored white papers and articles, blogger and co-author of “Smart Enough) Systems” (Prentice Hall, 2007). He has 25 years as an actuary, software engineer and systems integrator.
The data warehouse’s goal is timely delivery of trusted data to support decision-enabling insights. However, it’s difficult to get insights out of an environment that’s hard to see inside of. This is why, as much as is possible given the necessities of data privacy, a data warehouse should be turned into a glass house, allowing us to see data quality and business intelligence challenges as they truly are.
Trusted data is not perfect data. Trusted data is transparent data, honest about its imperfections, and realistic about the practical trade-offs between delivery and quality. You can’t fix what you can’t see, but even more important, concealing or ignoring known data quality issues is only going to decrease business users’ trust of the data warehouse. Perfect data is impossible, but the more control enforced wherever data originates, and the more monitoring performed wherever data flows, the better overall data quality will be in the warehouse. (more…)
The reality in data warehousing is that the primary focus is on delivery. The data warehouse team is tasked with extracting, transforming, integrating, and loading data into the warehouse within increasingly tight timeframes. Twenty years ago, monthly data warehouse loads were common. Ten years ago, weekly loads became the norm. Five years ago, daily loads were called for. Nowadays, near-real-time analytics demands the data warehouse be loaded more frequently than once a day. (more…)
The Benefits of Product Information Management, by Andy Hayler, CEO of “The Information Difference”
A recent survey by The Information Difference of well over 100 large organisations found that, on average, they had nine separate systems providing competing sources of product data (13% of respondents had over 100 sources). As can be imagined, that diversity of product data creates headaches for anyone trying to measure business performance, e.g. “what are our most profitable products?” is an easy question to ask but a tough one to answer if no one can agree what a product is, or into which category it is placed.
It also presents operational problems: if you are a retailer who has high street stores, a print catalogue operation, and also an eCommerce web site, then how are you to ensure a consistent process for onboarding and updating product information if different parts of the business have different systems and definitions? Customers that see a special offer online will expect that offer to be available in a store or vice versa, and will not be happy if it is not. There are further issues with eCommerce compared to a retail store: in a store customers can touch and see a product, so online they need more detailed information in order to have the confidence to purchase, such as detailed images of the product and its specifications.
Various phases of application consolidation, including ERP, have failed to improve this situation. Master data management has evolved as a discipline and technology to provide dedicated hubs of high quality data in an enterprise that can serve other systems as needed. It may be impractical to switch off all those legacy systems, but you can put in place a new hub for your product data where the data is trustworthy. This can then be linked directly back to other systems, either in batch or in real time via a web service, so that new product data, when updated in the master data hub, can be immediately used in other systems such as an inventory system or eCommerce web site.
There are various approaches to master data management: some technologies are designed to deal with all kinds of different master data (customer, product, asset, location etc.) while others specialise in a particular data domain, such as product or customer. There are reasons why specialising can make sense. Product data is much more complex than customer name and address data, with materials master files often appearing in unstructured files. Such data needs to be parsed and structured and then validated, requiring different approaches to those used to handle address data. Moreover the classification of products can be complex, with something like a camera having a large number of components and options, so systems to handle product data must be strong at handling complex classification hierarchies.
One example of a company confronting this issue is Kramp, Europe’s largest wholesaler of spare parts for motorized equipment. With 2,000 suppliers they used to take weeks to transfer new product data from suppliers into their internal systems and its eCommerce hub. By implementing a product data hub they were able to radically streamline this process, allowing suppliers to interact directly with the product data hub, and for this data to be consistently updated in the systems that need it, without need for time-consuming interactions with the suppliers to discuss particular data formats. This has led to higher margins due to being able to take advantage of “Long Tail’ niche items, lower process costs and quicker reaction time, important in new markets.
Improved multichannel processes, such as in this example, are why more and more companies are evaluating master data management solutions in order to finally tackle the issue of inconsistent product data. The evident benefits that such improvements bring means that businesses see real, quantified benefits, and why master data management is arguably the fastest growing enterprise software segment right now.
Over the last few years most enterprises have implemented several (if not more) large ERP and CRM suites. Although these applications were meant to have self-contained data models, it turns out that many enterprises still need to manage “master data” between the various applications. So the traditional IT role of hardware administration and custom programming has evolved to packaged application implementation and large scale data management. According to Wikipedia: “MDM has the objective of providing processes for collecting, aggregating, matching, consolidating, quality-assuring, persisting and distributing such data throughout an organization to ensure consistency and control in the ongoing maintenance and application use of this information.” Instead of designing large data warehouses to maintain the master data, many organizations turn to packaged Master Data Management (MDM) packages (such as Informatica MDM). With these tools at hand, IT shops can then build true Customer Master, Product Master (Product Information Management – PIM), Employee, or Supplier Master solutions. (more…)
By John Wollman, Executive Vice President, HighPoint Solutions, www.highpoint-solutions.com
Over the next two years, leading up to the ICD-10 “go live” date of October 1, 2014, there will be many procurement cycles for ICD-10 mapping and crosswalk tools. At present, we are seeing an evolution in the philosophy relating to the management of codes and mappings that will influence buyer decisions on the types of tools to incorporate into an ICD-10 program.
In considering mapping and crosswalk tools, leading companies are viewing the problem from an ongoing operational perspective, and not simply through a transition/conversion lens. The notion is that the complexity of ICD-10 is not limited to a one-time transition or conversion. Rather, the complexity will continue to be a problem well after October 1, 2014. The ongoing operational requirements call more for Master Data Management (MDM) approaches to managing codesets, mappings and the enterprise artifacts that are comprised of codes. (more…)
Big Data, Big Problems: Leveraging Informatica 9.5 to Build an Effective Data Governance Strategy to Meet the Big Data Challenge
By: Chris Cingrani, Informatica DQ & MDM Practice Lead, Data Management Practice at SSG Ltd., www.ssglimited.com
Big data is something that I am continually asked about by clients, as the subject continues to gain significant press. While discussing this topic, I often address it from the angle that bigger data volumes will result in bigger data problems. Although this seems like a logical premise, the reality of what it really means to an organization and how to plan accordingly is what is often overlooked. Rather than solve the problem in this blog post, I want to focus on two key considerations from a data governance standpoint, as well as discuss why SSG sees Informatica 9.5 as a core component of a sound data governance strategy that can ensure an organizations’ business decision-making success. (more…)
A post from the TABB Group
For the biggest swaps dealers, creation of their new OTC derivatives infrastructure will include rebuilding existing platforms, buying key elements from technology providers, leveraging technology already in place in other asset classes and, of course, building new platforms from scratch. This is not a buy-versus-build decision—it’s a careful balancing act of process and technology decisions to create a best-of-breed infrastructure. (more…)
When asked by the Conference Board in 2011 to rank the challenges that keep them up at night, U.S.-based CEOs put business growth in the number one position. Growing the business means growing the customer base by delivering a superior customer experience—and that demands leadership for the elimination of customer data silos and delivering complete, reliable customer data to the business.
The CIO is in a unique strategic position to help out—and emerge as (an unexpected) customer champion. Cases of CIOs taking on the role of customer champion, in my opinion, are not prevalent enough and represent a missed opportunity to advance the organization’s quest for customer profits. Companies need to focus on such immediately actionable key metrics as understanding the value of gained and lost customers, quantities of referrals, and the movement of customers from one level of profitability to another. I call these the “Guerrilla Metrics” because they power the customer onto the corporate agenda—and will help the CEO determine the value of the corporation based on its ability to manage customers as assets. This requires enabling the integration of customer data and driving that as a priority. (more…)