Category Archives: CIO
Speed is the top challenge facing IT today, and it’s reaching crisis proportions at many organizations. Specifically, IT needs to deliver business value at the speed that the business requires.
The challenge does not end there; This has to be accomplished without compromising cost or quality. Many people have argued that you only get two out of three on the Speed/Cost/Quality triangle, but I believe that achieving this is the central challenge facing Enterprise Architects today. Many people I talk to are looking at agile technologies, and in particular Agile Data Integration.
There have been a lot of articles written about the challenges, but it’s not all doom and gloom. Here is something you can do right now to dramatically increase the speed of your project delivery while improving cost and quality at the same time: Take a fresh look you Agile Data Integration environment and specifically at Data Virtualization. Data Virtualization offers the opportunity to simplify and speed up the data part of enterprise projects. And this is the place where more and more projects are spending 40% and more of their time. For more information and an industry perspective you can download the latest Forrester Wave report for Data Virtualization Q1 2015.
Here is a quick example of how you can use Data Virtualization technology for rapid prototyping to speed up business value delivery:
- Use data virtualization technology to present a common view of your data to your business-IT project teams.
- IT and business can collaborate in realtime to access and manage data from a wide variety of very large data sources – eliminating the long, slow cycles of passing specifications back and forth between business and IT.
- Your teams can discover, profile, and manage data using a single virtual interface that hides the complexity of the underlying data.
- By working with a virtualization layer, you are assured that your teams are using the right data and data that can by verified by linking it to a Business Glossary with clear terms, definitions, owners, and business context to reduce the chance of misunderstandings and errors.
- Leading offerings in this space include data quality and data masking tools in the interface, ensuring that you improve data quality in the process.
- Data virtualization means that your teams can be delivering in days rather than months and faster delivery means lower cost.
There has been a lot of interest in agile development, especially as it relates to data projects. Data Virtualization is a key tool to accelerate your team in this direction.
Informatica has a leading position in the Forrester report due to the productivity of the Agile Data Integration environment but also because of the integration with the rest of the Informatica platform. From an architect’s point of view it is critical to start standardizing on an enterprise data management platform. Continuing data and data tool fragmentation will only slow down future project delivery. The best way to deal with the growing complexity of both data and tools is to drive standardization within your organizations.
I recently got to talk to several senior IT leaders about their views on information governance and analytics. Participating were a telecom company, a government transportation entity, a consulting company, and a major retailer. Each shared openly in what was a free flow of ideas.
The CEO and Corporate Culture is critical to driving a fact based culture
I started this discussion by sharing the COBIT Information Life Cycle. Everyone agreed that the starting point for information governance needs to be business strategy and business processes. However, this caused an extremely interesting discussion about enterprise analytics readiness. Most said that they are in the midst of leading the proverbial horse to water—in this case the horse is the business. The CIO in the group said that he personally is all about the data and making factual decisions. But his business is not really there yet. I asked everyone at this point about the importance of culture and the CEO. Everyone agreed that the CEO is incredibly important in driving a fact based culture. Apparent, people like the new CEO of Target are in the vanguard and not the mainstream yet.
KPIs need to be business drivers
The above CIO said that too many of his managers are operationally, day-to-day focused and don’t understand the value of analytics or of predictive analytics. This CIO said that he needs to teach the business to think analytically and to understand how analytics can help drive the business as well as how to use Key Performance Indicators (KPIs). The enterprise architect in the group shared at this point that he had previously worked for a major healthcare organization. When organization was asked to determine a list of KPIs, they came back 168 KPIs. Obviously, this could not work so he explained to the business that an effective KPI must be a “driver of performance”. He stressed to the healthcare organization’s leadership the importance of having less KPIs and of having those that get produced being around business capabilities and performance drivers.
IT needs increasingly to understand their customers business models
I shared at this point that I visited a major Italian bank a few years ago. The key leadership had high definition displays that would roll by an analytic every five minutes. Everyone laughed at the absurdity of having so many KPIs. But with this said, everyone felt that they needed to get business buy in because only the business can derive the value from acting upon the data. According to this group of IT leaders, this causing them more and more to understand their customer’s business models.
Others said that they were trying to create an omni-channel view of customers. The retailer wanted to get more predictive. While Theodore Levitt said the job of marketing is to create and keep a customer. This retailer is focused on keeping and bringing back more often the customer. They want to give customers offers that use customer data that to increase sales. Much like what I described recently was happening at 58.com, eBay, and Facebook.
Most say they have limited governance maturity
We talked about where people are in their governance maturity. Even though, I wanted to gloss over this topic, the group wanted to spend time here and compare notes between each other. Most said that they were at stage 2 or 3 in in a five stage governance maturity process. One CIO said, gee does anyone ever at level 5. Like analytics, governance was being pushed forward by IT rather than the business. Nevertheless, everyone said that they are working to get data stewards defined for each business function. At this point, I asked about the elements that COBIT 5 suggests go into good governance. I shared that it should include the following four elements: 1) clear information ownership; 2) timely, correct information; 3) clear enterprise architecture and efficiency; and 4) compliance and security. Everyone felt the definition was fine but wanted specifics with each element. I referred them and you to my recent article in COBIT Focus.
CIO says they are the custodians of data only
At this point, one of the CIOs said something incredibly insightful. We are not data stewards. This has to be done by the business—IT is the custodians of the data. More specifically, we should not manage data but we should make sure what the business needs done gets done with data. Everyone agreed with this point and even reused the term, data custodians several times during the next few minutes. Debbie Lew of COBIT said just last week the same thing. According to her, “IT does not own the data. They facilitate the data”. From here, the discussion moved to security and data privacy. The retailer in the group was extremely concerned about privacy and felt that they needed masking and other data level technologies to ensure a breach minimally impacts their customers. At this point, another IT leader in the group said that it is the job of IT leadership to make sure the business does the right things in security and compliance. I shared here that one my CIO friends had said that “the CIOs at the retailers with breaches weren’t stupid—it is just hard to sell the business impact”. The CIO in the group said, we need to do risk assessments—also a big thing for COBIT 5–that get the business to say we have to invest to protect. “It is IT’s job to adequately explain the business risk”.
Is mobility a driver of better governance and analytics?
Several shared towards the end of the evening that mobility is an increasing impetus for better information governance and analytics. Mobility is driving business users and business customers to demand better information and thereby, better governance of information. Many said that a starting point for providing better information is data mastering. These attendees felt as well that data governance involves helping the business determine its relevant business capabilities and business processes. It seems that these should come naturally, but once again, IT for these organizations seems to be pushing the business across the finish line.
Blogs and Articles:
This is an age of technology disruption and digitization. Winners will be those organizations that can adapt quickly and drive business transformation on an ongoing basis.
When I first met John Schmidt Vice President of Global Integration Services at Informatica, he asked me to visualize Business Transformation as “A modern tool like the internet and Google Maps, with which planning a road trip from New York to San Francisco with a number of stops along the way to visit friends or see some sights takes just minutes. So you’re halfway through the trip and a friend calls to say he has suddenly been called out of town, you get on your mobile phone and within a few minutes, you have a new roadmap and a new plan.”
So, why is it that creating a roadmap for an enterprise initiative takes months or even years, and upon development of such a plan, it is nearly impossible to change even when new information or external events invalidate the plan? A single transformation is useful, but what you really want is the ability to transform our business on an ongoing basis. You need to be agile in planning of the transformation initiative itself. Is it even feasible to achieve a planning capability for complex enterprise initiatives that could approach the speed and agility of cross-country road-trip planning?
The short answer is YES; you can get much faster if you do three things:
First, throw out old notions of how planning in complex corporate environments is done, while keeping in mind that planning an enterprise transformation is fundamentally different than planning a focused departmental initiative.
Second, invest in tools equivalent to Google Maps for building the enterprise roadmap. Google Maps works because it leverages a database of information about roads, rules of the roads, related local services, and points of interest. In short, Google Map the enterprise, which is not as onerous as it sounds.
Third, develop a team of Enterprise Architects and planners with the skills and discipline to use the BOST™ Framework to maintain the underlying reference data about the business, its operations, the systems that support it, and the technologies that they are based on. This will provide the execution framework for your organization to deliver the data to fuel your business initiatives and digital strategy.
The results in a closer alignment of your business and IT organizations, there will be fewer errors due to communication issues, and because your business plans are linked directly to the underlying technical implementation, your business value will be delivered quicker.
This is not some “pie in the sky” theory or a futuristic dream. What you need is a tool like Google Maps for Business Transformation. The tool is the BOST™ Toolkit leverages the BOST™ Framework, which through models, elements, and associated relationships built around an underlying Metamodel, interprets enterprise processes using a 4-dimensional view driven by business, operations, systems, and technology. Informatica in collaboration with certified partners built The BOST™ Framework. It provides an Architecture-led Planning approach to for business transformation.
Benefits of Architecture-led Planning
The Architecture-led Planning approach is effective when applied with governance and oversight. The following four features describe the benefits:
Enablement of Business and IT Collaboration – Uses a common reference model to facilitate cross-functional business alignment, as well as alignment between business and IT. The model gets everyone on the same page, regardless of line of business, location, or IT function. This model explicitly and dynamically starts with business strategy and links from there to the technical implementation.
Data-driven Planning – Being able to capture data in a structured repository helps with rapid planning. A data-driven plan makes it dynamic and adaptable to changing circumstances. When the plan changes, rather than updating dozens of documents, simply apply the change to the relevant components in the enterprise model repository and all business and technical model views that reference that component update automatically.
Cross-Functional Decision Making – Cross-functional decision-making is facilitated in several ways. First, by showing interdependencies between functions, business operations, and systems, the holistic view helps each department or team to understand the big-picture and its role in the overall process. Second, the future state architectural models are based on a view of how business operations will change. This provides the foundation to determine the business value of the initiative, measure your progress, and ultimately report the achievement of the goals. Quantifiable metrics help decision makers look beyond the subjective perspectives and agree on fact-based success metrics.
Reduced Execution Risk – Reduced execution risk results from having a robust and holistic plan based on a rigorous analysis of all the dependent enterprise components in the business, operations, systems and technology view. Risk is reduced with an effective governance discipline both from a program management as well as from an architectural change perspective.
Business Transformation with Informatica
Integrated Program Planning is for organizations that need large or complex Change Management assistance. Examples of candidates for Integrated Program Planning include:
Enterprise Initiatives: Large-scale mergers or acquisitions, switching from a product-centric operating model to more customer-centric operations, restructuring channel or supplier relationships, rationalizing the company’s product or service portfolio, or streamlining end-to-end processes such as order-to-cash, procure-to-pay, hire-to-retire or customer on-boarding.
Top-level Directives: Examples include board-mandated data governance, regulatory compliance initiatives that have broad organizational impacts such as data privacy or security, or risk management initiatives.
Expanding Departmental Solutions into Enterprise Solutions: Successful solutions in specific business areas can often be scaled-up to become cross-functional enterprise-wide initiatives. For example, expanding a successful customer master data initiative in marketing to an enterprise-wide Customer Information Management solution used by sales, product development, and customer service for an Omni-channel customer experience.
|The BOST™ Framework identifies and defines enterprise capabilities. These capabilities are modularized as reconfigurable and scalable business services. These enterprise capabilities are independent of organizational silos and politics, which provide strategists, architects, and planners the means to drive for high performance across the enterprise, regardless of the shifting set of strategic business drivers.The BOST™ Toolkit facilitates building and implementing new or improved capabilities, adjusting business volumes, and integrating with new partners or acquisitions through common views of these building blocks and through reusing solution components. In other words, Better, Faster, Cheaper projects.
The BOST™ View creates a visual understanding of the relationship between business functions, data, and systems. It helps with the identification of relevant operational capabilities and underlying support systems that need to change in order to achieve the organization’s strategic objectives. The result will be a more flexible business process with greater visibility and the ability to adjust to change without error.
I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.
MongoDB as your Big Data platform
Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.
RDBMS schema as files
You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.
Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.
Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.
Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”
Missing use cases
Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.
It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.
To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.
As I have shared within the posts of this series, businesses are using analytics to improve their internal and external facing business processes and to strengthen their “right to win” within the markets that they operate. Like healthcare institutions across the country, UPMC is striving to improve its quality of care and business profitability. One educational healthcare CEO put it to me this way–“if we can improve our quality of service, we can reduce costs while we increase our pricing power”. In UPMC’s case, they believe that the vast majority of their costs are in a fraction of their patients, but they want to prove this with real data and then use this information drive their go forward business strategies.
Getting more predictive to improved outcomes and reduce cost
Armed with this knowledge, UPMC’s leadership wanted to use advanced analytic and predictive modeling to improve clinical and financial decision making. And taking this action was seen as producing better patient outcomes and reducing costs. A focus area for analysis involved creating “longitudinal records” for the complete cost of providing particular types of care. For those that aren’t versed in time series analysis, longitudinal analysis uses a series of observations obtained from many respondents over time to derive a relevant business insight. When I was also involved in healthcare, I used this type of analysis to interrelate employee and patient engagement results versus healthcare outcomes. In UPMC’s case, they wanted to use this type of analysis to understand for example the total end to end cost of a spinal surgery. UPMC wanted to look beyond the cost of surgery and account for the pre-surgery care and recovery-related costs. However, to do this for the entire hospital meant that it needed to bring together data from hundreds of sources across UPMC and outside entities, including labs and pharmacies. However, by having this information, UPMC’s leadership saw the potential to create an accurate and comprehensive view which could be used to benchmark future procedures. Additionally, UPMC saw the potential to automate the creation of patient problem lists or examine clinical practice variations. But like the other case studies that we have reviewed, these steps required trustworthy and authoritative data to be accessed with agility and ease.
UPMC’s starts with a large, multiyear investment
In October 2012, UPMC made a $100 million investment to establish an enterprise analytics initiative to bring together for the first time, clinical, financial, administrative, genomic and other information together in one place. Tom Davenport, the author of Competing on Analytics, suggests in his writing that establishing an enterprise analytics capability represents a major step forward because it allows enterprises to answer the big questions, to better tie strategy and analytics, and to finally rationalize applications interconnect and business intelligence spending. As UPMC put its plan together, it realized that it needed to impact more than 1200 applications. As well it realized that it needed one system manage with data integration, master data management, and eventually complex event processing capabilities. At the same time, it created the people side of things by creating a governance team to manage data integrity improvements, ensuring that trusted data populates enterprise analytics and provides transparency into data integrity challenges. One of UPMC’s goals was to provide self-service capabilities. According to Terri Mikol, a project leader, “We can’t have people coming to IT for every information request. We’re never going to cure cancer that way.” Here is an example of the promise that occurred within the first eight months of this project. Researchers were able to integrate—for the first time ever– clinical and genomic information on 140 patients previously treated for breast cancer. Traditionally, these data have resided in separate information systems, making it difficult—if not impossible—to integrate and analyze dozens of variables. The researchers found intriguing molecular differences in the makeup of pre-menopausal vs. post-menopausal breast cancer, findings which will be further explored. For UPMC, this initial cancer insight is just the starting point of their efforts to mine massive amounts of data in the pursuit of smarter medicines.
Building the UPMC Enterprise Analytics Capability
To create their enterprise analytics platform, UPMC determined it was critical to establish “a single, unified platform for data integration, data governance, and master data management,” according to Terri Mikol. The solution required a number of key building blocks. The first was data integration to collect and cleanses data from hundreds of sources and organizes them into repositories that would enable fast, easy analysis and reporting by and for end users.
Specifically, the UPMC enterprise analytics capability pulls clinical and operational data from a broad range of sources, including systems for managing hospital admissions, emergency room operations, patient claims, health plans, electronic health records, as well as external databases that hold registries of genomic and epidemiological data needed for crafting personalized and translational medicine therapies. UPMC has integrated quality checked source data in accordance with industry-standard healthcare information models. This effort included putting together capabilities around data integration, data quality and master data management to manage transformations and enforce consistent definitions of patients, providers, facilities and medical terminology.
As said, the cleansed and harmonized data is organized into specialized genomics databases, multidimensional warehouses, and data marts. The approach makes use of traditional data warehousing approaches as well as big data capabilities to handle unstructured data and natural language processing. UPMC has also deployed analytical tools that allow end users to exploit the data enabled from the Enterprise Analytics platform. The tools drive everything from predictive analytics, cohort tracking, and business and compliance reporting. And UPMC did not stop here. If their data had value then it needed to be secured. UPMC created data audits and data governance practices. As well, they implemented a dynamic data masking solution ensures data security and privacy.
As I have discussed, many firms are pushing point silo solutions into their environments, but as UPMC shows this limits their ability to ask the bigger business questions or in UPMC’s case to discover things that can change people’s live. Analytics are more and more a business enabler if they are organized as an enterprise analytics capability. As well, I have come to believe that analytics have become foundational capability to all firms’ right to win. It informs a coherent set of capabilities and establishes a firm’s go forward right to win. For this, UPMC is a shining example of getting things right.
Author Twitter: @MylesSuer
Recently, I got to attend the Predictive Analytics Summit in San Diego. It felt great to be in a room full of data scientists from around the world—all my hidden statistics, operations research, and even modeling background came back to me instantly. I was most interested to learn what this vanguard was doing as well as any lessons learned that could be shared with the broader analytics audience. Presenters ranged from Internet leaders to more traditional companies like Scotts Miracle Gro. Brendan Hodge of Scotts Miracle Gro in fact said, as 125 year old company, he feels like “a dinosaur at a mammal convention”. So in the space that follows, I will share my key take-aways from some of the presenters.
Fei Long from 58.com
58.com is the Craigslist, Yelp, and Monster of China. Fei shared that 58.com is using predictive analytics to recommend resumes to employers and to drive more intelligent real time bidding for its products. Fei said that 58.com has 300 million users—about the number of people in the United States. Most interesting, Fei said that predictive analytics has driven a 10-20% increase in 58.com’s click through rate.
Ian Zhao from eBay
Ian said that eBay is starting to increase the footprint of its data science projects. He said that historical the focus for eBay’s data science was marketing, but today eBay is applying data science to sales and HR. Provost and Fawcett agree in “Data Science for Business” by saying that “the widest applications of data mining techniques are in marketing for tasks such as target marketing, online advertising, and recommendations for cross-selling”.
Ian said that in the non-marketing areas, they are finding a lot less data. The data is scattered across data sources, and requires a lot more cleansing. Ian is using things like time series and ARIMA to look at employee attrition. One thing that Ian found that was particularly interesting is that there is strong correlation between attrition and bonus payouts. Ian said it is critical to leave ample time for data prep. He said that it is important to start the data prep process by doing data exploration and discovery. This includes confirming that data is available for hypothesis testing. Sometimes, Ian said that this the data prep process can include inputting data that is not available in the data set and validating data summary statistics. With this, Ian said that data scientists need to dedicate time and resources for determining what things are drivers. He said with the business, data scientist should talk about likelihood because business people in general do not understand statistics. It is important as well that data scientist ask business people the so what questions. Data scientist should narrow things down to a dollar impact.
Barkha Saxena from Poshmark
Barkha is trying to model the value of user growth. Barkha said that this matters because Poshmark wants to be the #1 community driven marketplace. They want to use data to create a “personal boutique experience”. With 700,000 transactions a day, they are trying to measure customer lifetime value by implementing a cohort analysis. What was the most interesting in Barkha’s data is she discovered repeatable performance across cohorts. In their analysis, different models work better based upon the data—so a lot of time goes into procedurally determining the best model fit.
Meagan Huth from Google
Meagan said that Google is creating something that they call People Analytics. They are trying to make all people decisions by science and data. They want to make it cheaper and easier to work at Google. They have found through their research that good managers lower turnover, increase performance, and increase workplace happiness. The most interesting thing that she says they have found is the best predictor of being a good manager is being a good coach. They have developed predictive models around text threads including those that occur in employee surveys to ensure they have the data to needed to improve.
Hobson Lane from Sharp Labs
Hobson reminded everyone of the importance Nyquist (you need to sample data twice as fast as the fastest data event). This is especially important for organizations moving to the so called Internet of Things. Many of these devices have extremely large data event rates. Hobson, also, discussed the importance of looking at variance against the line that gets drawn in a regression analysis. Sometimes, multiple lines can be drawn. He, also, discussed the problem of not having enough data to support the complexity of the decision that needs to be made.
Ravi Iyer from Ranker
Ravi started by saying Ranker is a Yelp for everyone else. He then discussed the importance of have systematic data. A nice quote from him is as follows: “better data=better predictions”. Ravi discussed as well the topic of response bias. He said that asking about Coke can lead to different answer when you ask about Coke or Coke at a movie. He discussed interesting how their research shows that millennials are really all about “the best”. I see this happening every time that I take my children out to dinner—there is no longer a cheap dinner out.
Ranjan Sinha at eBay
Ranjan discussed the importance of customer centric commerce and creating predictive models around it. At eBay, they want to optimize the customer experience and improve their ability to make recommendations. eBay is finding customer expectations are changing. For this reason, they want customer context to be modeled by looking at transactions, engagement, intent, account, and inferred social behaviors. With modeling completed, they are using complex event processing to drive a more automated response to data. An amazing example given was for Valentine Day’s where they use a man’s partner’s data to predict the items that the man should get for his significant other.
Andrew Ahn from LinkedIn
Andrew is using analytics to create what he calls an economic graph and to make professionals more productive. One area that he personally is applying predictive analytics to is with LinkedIn’s sales solutions. In LinkedIn Sales Navigator, they display potential customers based upon the sales person’s demographic data—effectively the system makes lead recommendations. However, they want to de-risk this potential interaction for sale professionals and potential customers. Andrews says at the same time that they have found through data analysis that small changes in a LinkedIn profile can lead to big changes. To put this together, they have created something that they call the social selling index. It looks at predictors that they have determined are statistically relevant including member demographic, site engagement, and social network. The SSI score is viewed as a predictive index. Andrew says that they are trying to go from serendipity to data science.
Robert Wilde from Slacker Radio
Robert discussed the importance of simplicity and elegance in model building. He then went through a set of modeling issues to avoid. He said that modelers need to own the discussion of causality and cause and effect and how this can bias data interpretation. In addition, looking at data variance was stressed because what does one do when a line doesn’t have a single point fall on it. Additionally, Robert discussed what do you do when correlation is strong, weak, or mistaken. Is it X or Y that has the relationship. Or worse yet what do you do when there is coincidental correlation. This led to a discussion of forward and reverse causal inference. For this reason, Robert argued strongly for principal component analysis. This eliminates regression causational bias. At the same time, he suggested that models should be valued by complexity versus error rates.
Parsa Bakhtary from Facebook
Parsa has been looking at what games generate revenue and what games do not generate revenue for Facebook—Facebook amazingly has over 1,000 revenue bearing game. For this reason, Facebook wants to look at the Lifetime Value of Customers for Facebook Games—ithe dollar value of a relationship. Parsa said, however, there is a problem, only 20% pay for their games. Parsa argued that customer life time value (which was developed in the 1950s) doesn’t really work for apps where everyones lifetime is not the same. Additionally, social and mobile gamers are not particularly loyalty. He says that he, therefore, has to model individual games for their first 90 days across all periods of joining and then look at the cumulative revenue curves.
So we have seen here a wide variety of predictive analytics techniques being used by today’s data scientists. To me this says that predictive analytical approaches are alive and kicking. This is good news and shows that data scientists are trying to enable businesses to make better use of their data. Clearly, a key step that holds data scientist back today is data prep. While it is critical to leave ample time for data prep, it is also essential to get quality data to ensure models are working appropriately. At the same time, data prep needs to support inputting data that is not available within the original data set.
Solution Brief: Data Prep
Author Twitter: @MylesSuer
I’ve spent most of my career working with new technology, most recently helping companies make sense of mountains of incoming data. This means, as I like to tell people, that I have the sexiest job in the 21st century.
Harvard Business Review put the data scientist into the national spotlight in their publication Data Scientist: The Sexiest Job of the 21st Century. Job trends data from Indeed.com confirms the rise in popularity for the position, showing that the number of job postings for data scientist positions increased by 15,000%.
In the meantime, the role of data scientist has changed dramatically. Data used to reside on the fringes of the operation. It was usually important but seldom vital – a dreary task reserved for the geekiest of the geeks. It supported every function but never seemed to lead them. Even the executives who respected it never quite absorbed it.
For every Big Data problem, the solution often rests on the shoulders of a data scientist. The role of the data scientist is similar in responsibility to the Wall Street “quants” of the 80s and 90s – now, these data experienced are tasked with the management of databases previously thought too hard to handle, and too unstructured to derive any value.
So, is it the sexiest job of the 21st Century?
Think of a data scientist more like the business analyst-plus, part mathematician, part business strategist, these statistical savants are able to apply their background in mathematics to help companies tame their data dragons. But these individuals aren’t just math geeks, per se.
A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a renaissance individual who really wants to learn and bring change to an organization.
If this sounds like you, the good news is demand for data scientists is far outstripping supply. Nonetheless, with the rising popularity of the data scientist – not to mention the companies that are hiring for these positions – you have to be at the top of your field to get the jobs.
Companies look to build teams around data scientists that ask the most questions about:
- How the business works
- How it collects its data
- How it intends to use this data
- What it hopes to achieve from these analyses
These questions were important because data scientists will often unearth information that can “reshape an entire company.” Obtaining a better understanding of the business’ underpinnings not only directs the data scientist’s research, but helps them present the findings and communicate with the less-analytical executives within the organization.
While it’s important to understand your own business, learning about the successes of other corporations will help a data scientist in their current job–and the next.
Talking to architects about analytics at a recent event, I kept hearing the familiar theme; data scientists are spending 80% of their time on “data wrangling” leaving only 20% for delivering the business insights that will drive the company’s innovation. It was clear to everybody that I spoke to that the situation will only worsen. The coming growth everybody sees in data volume and complexity, will only lengthen the time to value.
Gartner recently predicted that:
“by 2015, 50% of organizations will give up on managing growth and will redirect funds to improve classification and analytics.”
Some of the details of this study are interesting. In the end, many organizations are coming to two conclusions:
- It’s risky to delete data, so they keep it around as insurance.
- All data has potential business value, so more organizations are keeping it around for potential analytical purposes.
The other mega-trend here is that more and more organizations are looking to compete on analytics – and they need data to do it, both internal data and external data.
From an architect’s perspective, here are several observations:
- The floodgates are open and analytics is a top priority. Given that, the emphasis should be on architecting to manage the dramatic increases in both data quantity and data complexity rather than on trying to stop it.
- The immediate architectural priority has to be on simplifying and streamlining your current enterprise data architecture. Break down those data silos and standardize your enterprise data management tools and processes as much as possible. As discussed in other blogs, data integration is becoming the biggest bottleneck to business value delivery in your environment. Gartner has projected that “by 2018, more than half the cost of implementing new large systems will be spent on integration.” The more standardized your enterprise data management architecture is, the more efficient it will be.
- With each new data type, new data tool (Hive, Pig, etc.), and new data storage technology (Hadoop, NoSQL, etc.) ask first if your existing enterprise data management tools can handle the task before people go out and create a new “data silo” based on the cool, new technologies. Sometimes it will be necessary, but not always.
- The focus needs to be on speeding value delivery for the business. And the key bottleneck is highly likely to be your enterprise data architecture.
Rather than focusing on managing data growth, the priority should be on managing it in the most standardized and efficient way possible. It is time to think about enterprise data management as a function with standard processes, skills and tools (just like Finance, Marketing or Procurement.)
Several of our leading customers have built or are building a central “Data as a Service” platform within their organizations. This is a single, central place where all developers and analysts can go to get trustworthy data that is managed by IT through a standard architecture and served up for use by all.
For more information, see “The Big Big Data Workbook”
*Gartner Predicts 2015: Managing ‘Data Lakes’ of Unprecedented Enormity, December 2014 http://www.gartner.com/document/2934417#
Customers often inquire about the best way to get their team up to speed on the Informatica solutions. The question Informatica University hears frequently is whether a team should attend our public scheduled courses or hold a Private training event. The number of resources to be skilled on the products will help to determine which option to choose. If your team, or multiple teams within your company, has 7 or more resources that require getting up to speed on the Informatica products, then a Private training event is the recommended choice.
Seven (7) for a remote instructor and nine (9) for an onsite instructor is the break even cost per resource when determining whether to hold a private training and is the most cost efficient delivery for a team. In addition to the cost benefit, customers who have taken this option value the daily access to their team members to keep business operations humming along, and the opportunity to collaborate with key team members not attending by allowing them to provide input to project perspective.
These reserved events also provide the opportunity to be adapted to focus on a customers needs by tailoring course materials to highlight topics that will be key to a project’s implementation which provide creative options to get a team up to speed on the Informatica projects at hand.
With Informatica University’s new flexible pricing, hosting a Private Training event is easy. All it takes is:
- A conference room
- Training PC’s or laptops for participants
- Access to the Internet
- An LCD projector, screen, white board, and appropriate markers
Private training events provide the opportunity to get your resources comfortable and efficient with the Informatica Solutions and have a positive impact on the success of your projects.
To understand more about Informatica’s New Flexible Pricing, contact firstname.lastname@example.org