Category Archives: CIO
As I have shared within the posts of this series, businesses are using analytics to improve their internal and external facing business processes and to strengthen their “right to win” within the markets that they operate. Like healthcare institutions across the country, UPMC is striving to improve its quality of care and business profitability. One educational healthcare CEO put it to me this way–“if we can improve our quality of service, we can reduce costs while we increase our pricing power”. In UPMC’s case, they believe that the vast majority of their costs are in a fraction of their patients, but they want to prove this with real data and then use this information drive their go forward business strategies.
Getting more predictive to improved outcomes and reduce cost
Armed with this knowledge, UPMC’s leadership wanted to use advanced analytic and predictive modeling to improve clinical and financial decision making. And taking this action was seen as producing better patient outcomes and reducing costs. A focus area for analysis involved creating “longitudinal records” for the complete cost of providing particular types of care. For those that aren’t versed in time series analysis, longitudinal analysis uses a series of observations obtained from many respondents over time to derive a relevant business insight. When I was also involved in healthcare, I used this type of analysis to interrelate employee and patient engagement results versus healthcare outcomes. In UPMC’s case, they wanted to use this type of analysis to understand for example the total end to end cost of a spinal surgery. UPMC wanted to look beyond the cost of surgery and account for the pre-surgery care and recovery-related costs. However, to do this for the entire hospital meant that it needed to bring together data from hundreds of sources across UPMC and outside entities, including labs and pharmacies. However, by having this information, UPMC’s leadership saw the potential to create an accurate and comprehensive view which could be used to benchmark future procedures. Additionally, UPMC saw the potential to automate the creation of patient problem lists or examine clinical practice variations. But like the other case studies that we have reviewed, these steps required trustworthy and authoritative data to be accessed with agility and ease.
UPMC’s starts with a large, multiyear investment
In October 2012, UPMC made a $100 million investment to establish an enterprise analytics initiative to bring together for the first time, clinical, financial, administrative, genomic and other information together in one place. Tom Davenport, the author of Competing on Analytics, suggests in his writing that establishing an enterprise analytics capability represents a major step forward because it allows enterprises to answer the big questions, to better tie strategy and analytics, and to finally rationalize applications interconnect and business intelligence spending. As UPMC put its plan together, it realized that it needed to impact more than 1200 applications. As well it realized that it needed one system manage with data integration, master data management, and eventually complex event processing capabilities. At the same time, it created the people side of things by creating a governance team to manage data integrity improvements, ensuring that trusted data populates enterprise analytics and provides transparency into data integrity challenges. One of UPMC’s goals was to provide self-service capabilities. According to Terri Mikol, a project leader, “We can’t have people coming to IT for every information request. We’re never going to cure cancer that way.” Here is an example of the promise that occurred within the first eight months of this project. Researchers were able to integrate—for the first time ever– clinical and genomic information on 140 patients previously treated for breast cancer. Traditionally, these data have resided in separate information systems, making it difficult—if not impossible—to integrate and analyze dozens of variables. The researchers found intriguing molecular differences in the makeup of pre-menopausal vs. post-menopausal breast cancer, findings which will be further explored. For UPMC, this initial cancer insight is just the starting point of their efforts to mine massive amounts of data in the pursuit of smarter medicines.
Building the UPMC Enterprise Analytics Capability
To create their enterprise analytics platform, UPMC determined it was critical to establish “a single, unified platform for data integration, data governance, and master data management,” according to Terri Mikol. The solution required a number of key building blocks. The first was data integration to collect and cleanses data from hundreds of sources and organizes them into repositories that would enable fast, easy analysis and reporting by and for end users.
Specifically, the UPMC enterprise analytics capability pulls clinical and operational data from a broad range of sources, including systems for managing hospital admissions, emergency room operations, patient claims, health plans, electronic health records, as well as external databases that hold registries of genomic and epidemiological data needed for crafting personalized and translational medicine therapies. UPMC has integrated quality checked source data in accordance with industry-standard healthcare information models. This effort included putting together capabilities around data integration, data quality and master data management to manage transformations and enforce consistent definitions of patients, providers, facilities and medical terminology.
As said, the cleansed and harmonized data is organized into specialized genomics databases, multidimensional warehouses, and data marts. The approach makes use of traditional data warehousing approaches as well as big data capabilities to handle unstructured data and natural language processing. UPMC has also deployed analytical tools that allow end users to exploit the data enabled from the Enterprise Analytics platform. The tools drive everything from predictive analytics, cohort tracking, and business and compliance reporting. And UPMC did not stop here. If their data had value then it needed to be secured. UPMC created data audits and data governance practices. As well, they implemented a dynamic data masking solution ensures data security and privacy.
As I have discussed, many firms are pushing point silo solutions into their environments, but as UPMC shows this limits their ability to ask the bigger business questions or in UPMC’s case to discover things that can change people’s live. Analytics are more and more a business enabler if they are organized as an enterprise analytics capability. As well, I have come to believe that analytics have become foundational capability to all firms’ right to win. It informs a coherent set of capabilities and establishes a firm’s go forward right to win. For this, UPMC is a shining example of getting things right.
Author Twitter: @MylesSuer
Recently, I got to attend the Predictive Analytics Summit in San Diego. It felt great to be in a room full of data scientists from around the world—all my hidden statistics, operations research, and even modeling background came back to me instantly. I was most interested to learn what this vanguard was doing as well as any lessons learned that could be shared with the broader analytics audience. Presenters ranged from Internet leaders to more traditional companies like Scotts Miracle Gro. Brendan Hodge of Scotts Miracle Gro in fact said, as 125 year old company, he feels like “a dinosaur at a mammal convention”. So in the space that follows, I will share my key take-aways from some of the presenters.
Fei Long from 58.com
58.com is the Craigslist, Yelp, and Monster of China. Fei shared that 58.com is using predictive analytics to recommend resumes to employers and to drive more intelligent real time bidding for its products. Fei said that 58.com has 300 million users—about the number of people in the United States. Most interesting, Fei said that predictive analytics has driven a 10-20% increase in 58.com’s click through rate.
Ian Zhao from eBay
Ian said that eBay is starting to increase the footprint of its data science projects. He said that historical the focus for eBay’s data science was marketing, but today eBay is applying data science to sales and HR. Provost and Fawcett agree in “Data Science for Business” by saying that “the widest applications of data mining techniques are in marketing for tasks such as target marketing, online advertising, and recommendations for cross-selling”.
Ian said that in the non-marketing areas, they are finding a lot less data. The data is scattered across data sources, and requires a lot more cleansing. Ian is using things like time series and ARIMA to look at employee attrition. One thing that Ian found that was particularly interesting is that there is strong correlation between attrition and bonus payouts. Ian said it is critical to leave ample time for data prep. He said that it is important to start the data prep process by doing data exploration and discovery. This includes confirming that data is available for hypothesis testing. Sometimes, Ian said that this the data prep process can include inputting data that is not available in the data set and validating data summary statistics. With this, Ian said that data scientists need to dedicate time and resources for determining what things are drivers. He said with the business, data scientist should talk about likelihood because business people in general do not understand statistics. It is important as well that data scientist ask business people the so what questions. Data scientist should narrow things down to a dollar impact.
Barkha Saxena from Poshmark
Barkha is trying to model the value of user growth. Barkha said that this matters because Poshmark wants to be the #1 community driven marketplace. They want to use data to create a “personal boutique experience”. With 700,000 transactions a day, they are trying to measure customer lifetime value by implementing a cohort analysis. What was the most interesting in Barkha’s data is she discovered repeatable performance across cohorts. In their analysis, different models work better based upon the data—so a lot of time goes into procedurally determining the best model fit.
Meagan Huth from Google
Meagan said that Google is creating something that they call People Analytics. They are trying to make all people decisions by science and data. They want to make it cheaper and easier to work at Google. They have found through their research that good managers lower turnover, increase performance, and increase workplace happiness. The most interesting thing that she says they have found is the best predictor of being a good manager is being a good coach. They have developed predictive models around text threads including those that occur in employee surveys to ensure they have the data to needed to improve.
Hobson Lane from Sharp Labs
Hobson reminded everyone of the importance Nyquist (you need to sample data twice as fast as the fastest data event). This is especially important for organizations moving to the so called Internet of Things. Many of these devices have extremely large data event rates. Hobson, also, discussed the importance of looking at variance against the line that gets drawn in a regression analysis. Sometimes, multiple lines can be drawn. He, also, discussed the problem of not having enough data to support the complexity of the decision that needs to be made.
Ravi Iyer from Ranker
Ravi started by saying Ranker is a Yelp for everyone else. He then discussed the importance of have systematic data. A nice quote from him is as follows: “better data=better predictions”. Ravi discussed as well the topic of response bias. He said that asking about Coke can lead to different answer when you ask about Coke or Coke at a movie. He discussed interesting how their research shows that millennials are really all about “the best”. I see this happening every time that I take my children out to dinner—there is no longer a cheap dinner out.
Ranjan Sinha at eBay
Ranjan discussed the importance of customer centric commerce and creating predictive models around it. At eBay, they want to optimize the customer experience and improve their ability to make recommendations. eBay is finding customer expectations are changing. For this reason, they want customer context to be modeled by looking at transactions, engagement, intent, account, and inferred social behaviors. With modeling completed, they are using complex event processing to drive a more automated response to data. An amazing example given was for Valentine Day’s where they use a man’s partner’s data to predict the items that the man should get for his significant other.
Andrew Ahn from LinkedIn
Andrew is using analytics to create what he calls an economic graph and to make professionals more productive. One area that he personally is applying predictive analytics to is with LinkedIn’s sales solutions. In LinkedIn Sales Navigator, they display potential customers based upon the sales person’s demographic data—effectively the system makes lead recommendations. However, they want to de-risk this potential interaction for sale professionals and potential customers. Andrews says at the same time that they have found through data analysis that small changes in a LinkedIn profile can lead to big changes. To put this together, they have created something that they call the social selling index. It looks at predictors that they have determined are statistically relevant including member demographic, site engagement, and social network. The SSI score is viewed as a predictive index. Andrew says that they are trying to go from serendipity to data science.
Robert Wilde from Slacker Radio
Robert discussed the importance of simplicity and elegance in model building. He then went through a set of modeling issues to avoid. He said that modelers need to own the discussion of causality and cause and effect and how this can bias data interpretation. In addition, looking at data variance was stressed because what does one do when a line doesn’t have a single point fall on it. Additionally, Robert discussed what do you do when correlation is strong, weak, or mistaken. Is it X or Y that has the relationship. Or worse yet what do you do when there is coincidental correlation. This led to a discussion of forward and reverse causal inference. For this reason, Robert argued strongly for principal component analysis. This eliminates regression causational bias. At the same time, he suggested that models should be valued by complexity versus error rates.
Parsa Bakhtary from Facebook
Parsa has been looking at what games generate revenue and what games do not generate revenue for Facebook—Facebook amazingly has over 1,000 revenue bearing game. For this reason, Facebook wants to look at the Lifetime Value of Customers for Facebook Games—ithe dollar value of a relationship. Parsa said, however, there is a problem, only 20% pay for their games. Parsa argued that customer life time value (which was developed in the 1950s) doesn’t really work for apps where everyones lifetime is not the same. Additionally, social and mobile gamers are not particularly loyalty. He says that he, therefore, has to model individual games for their first 90 days across all periods of joining and then look at the cumulative revenue curves.
So we have seen here a wide variety of predictive analytics techniques being used by today’s data scientists. To me this says that predictive analytical approaches are alive and kicking. This is good news and shows that data scientists are trying to enable businesses to make better use of their data. Clearly, a key step that holds data scientist back today is data prep. While it is critical to leave ample time for data prep, it is also essential to get quality data to ensure models are working appropriately. At the same time, data prep needs to support inputting data that is not available within the original data set.
Solution Brief: Data Prep
Author Twitter: @MylesSuer
I’ve spent most of my career working with new technology, most recently helping companies make sense of mountains of incoming data. This means, as I like to tell people, that I have the sexiest job in the 21st century.
Harvard Business Review put the data scientist into the national spotlight in their publication Data Scientist: The Sexiest Job of the 21st Century. Job trends data from Indeed.com confirms the rise in popularity for the position, showing that the number of job postings for data scientist positions increased by 15,000%.
In the meantime, the role of data scientist has changed dramatically. Data used to reside on the fringes of the operation. It was usually important but seldom vital – a dreary task reserved for the geekiest of the geeks. It supported every function but never seemed to lead them. Even the executives who respected it never quite absorbed it.
For every Big Data problem, the solution often rests on the shoulders of a data scientist. The role of the data scientist is similar in responsibility to the Wall Street “quants” of the 80s and 90s – now, these data experienced are tasked with the management of databases previously thought too hard to handle, and too unstructured to derive any value.
So, is it the sexiest job of the 21st Century?
Think of a data scientist more like the business analyst-plus, part mathematician, part business strategist, these statistical savants are able to apply their background in mathematics to help companies tame their data dragons. But these individuals aren’t just math geeks, per se.
A data scientist is somebody who is inquisitive, who can stare at data and spot trends. It’s almost like a renaissance individual who really wants to learn and bring change to an organization.
If this sounds like you, the good news is demand for data scientists is far outstripping supply. Nonetheless, with the rising popularity of the data scientist – not to mention the companies that are hiring for these positions – you have to be at the top of your field to get the jobs.
Companies look to build teams around data scientists that ask the most questions about:
- How the business works
- How it collects its data
- How it intends to use this data
- What it hopes to achieve from these analyses
These questions were important because data scientists will often unearth information that can “reshape an entire company.” Obtaining a better understanding of the business’ underpinnings not only directs the data scientist’s research, but helps them present the findings and communicate with the less-analytical executives within the organization.
While it’s important to understand your own business, learning about the successes of other corporations will help a data scientist in their current job–and the next.
Talking to architects about analytics at a recent event, I kept hearing the familiar theme; data scientists are spending 80% of their time on “data wrangling” leaving only 20% for delivering the business insights that will drive the company’s innovation. It was clear to everybody that I spoke to that the situation will only worsen. The coming growth everybody sees in data volume and complexity, will only lengthen the time to value.
Gartner recently predicted that:
“by 2015, 50% of organizations will give up on managing growth and will redirect funds to improve classification and analytics.”
Some of the details of this study are interesting. In the end, many organizations are coming to two conclusions:
- It’s risky to delete data, so they keep it around as insurance.
- All data has potential business value, so more organizations are keeping it around for potential analytical purposes.
The other mega-trend here is that more and more organizations are looking to compete on analytics – and they need data to do it, both internal data and external data.
From an architect’s perspective, here are several observations:
- The floodgates are open and analytics is a top priority. Given that, the emphasis should be on architecting to manage the dramatic increases in both data quantity and data complexity rather than on trying to stop it.
- The immediate architectural priority has to be on simplifying and streamlining your current enterprise data architecture. Break down those data silos and standardize your enterprise data management tools and processes as much as possible. As discussed in other blogs, data integration is becoming the biggest bottleneck to business value delivery in your environment. Gartner has projected that “by 2018, more than half the cost of implementing new large systems will be spent on integration.” The more standardized your enterprise data management architecture is, the more efficient it will be.
- With each new data type, new data tool (Hive, Pig, etc.), and new data storage technology (Hadoop, NoSQL, etc.) ask first if your existing enterprise data management tools can handle the task before people go out and create a new “data silo” based on the cool, new technologies. Sometimes it will be necessary, but not always.
- The focus needs to be on speeding value delivery for the business. And the key bottleneck is highly likely to be your enterprise data architecture.
Rather than focusing on managing data growth, the priority should be on managing it in the most standardized and efficient way possible. It is time to think about enterprise data management as a function with standard processes, skills and tools (just like Finance, Marketing or Procurement.)
Several of our leading customers have built or are building a central “Data as a Service” platform within their organizations. This is a single, central place where all developers and analysts can go to get trustworthy data that is managed by IT through a standard architecture and served up for use by all.
For more information, see “The Big Big Data Workbook”
*Gartner Predicts 2015: Managing ‘Data Lakes’ of Unprecedented Enormity, December 2014 http://www.gartner.com/document/2934417#
Customers often inquire about the best way to get their team up to speed on the Informatica solutions. The question Informatica University hears frequently is whether a team should attend our public scheduled courses or hold a Private training event. The number of resources to be skilled on the products will help to determine which option to choose. If your team, or multiple teams within your company, has 7 or more resources that require getting up to speed on the Informatica products, then a Private training event is the recommended choice.
Seven (7) for a remote instructor and nine (9) for an onsite instructor is the break even cost per resource when determining whether to hold a private training and is the most cost efficient delivery for a team. In addition to the cost benefit, customers who have taken this option value the daily access to their team members to keep business operations humming along, and the opportunity to collaborate with key team members not attending by allowing them to provide input to project perspective.
These reserved events also provide the opportunity to be adapted to focus on a customers needs by tailoring course materials to highlight topics that will be key to a project’s implementation which provide creative options to get a team up to speed on the Informatica projects at hand.
With Informatica University’s new flexible pricing, hosting a Private Training event is easy. All it takes is:
- A conference room
- Training PC’s or laptops for participants
- Access to the Internet
- An LCD projector, screen, white board, and appropriate markers
Private training events provide the opportunity to get your resources comfortable and efficient with the Informatica Solutions and have a positive impact on the success of your projects.
To understand more about Informatica’s New Flexible Pricing, contact firstname.lastname@example.org
A Data Lake is a simple concept. They are a catchment area for data entering the organization. In the past, most businesses didn’t need to organize such a data store because almost all data was internal. It traveled via traditional ETL mechanisms from transactional systems to a data warehouse and then was sprayed around the business, as required.
When a good deal of data comes from external sources, or even from internal sources like log files, which never previously made it into the data warehouse, there is a need for an “operational data store.” This has definitely become the premier application for Hadoop and it makes perfect sense to me that such technology be used for a data catchment area. The neat thing about Hadoop for this application is that:
- It scales out “as far as the eye can see,” so there’s no likelihood of it being unable to manage the data volumes even when they grow beyond the petabyte level.
- It is a key-value store, which means that you don’t need to expend much effort in modeling data when you decide to accommodate a new data source. You just define a key and define the metadata at leisure.
- The cost of the software and the storage is very low.
So let’s imagine that we have a need for a data catchment area, because we have decided to collect data from log-files, mobile devices, social networks, from public data sources, or whatever. So let us also imagine that we have implemented Hadoop and some of its useful components and we have begun to collect data.
Is it reasonable to describe this as a data lake?
A Hadoop implementation should not be a set of servers randomly placed at the confluence of various data flows. The placement needs to be carefully considered and if the implementation is to resemble a “data lake” in any way, then it must be a well-engineered man-made lake. Since the data doesn’t just sit there until it evaporates but eventually flows to various applications, we should think of this as a “data reservoir” rather than a “data lake.”
There is no point in arranging all that data neatly along the aisles because when we get it, we may not know what we want to do with it at the time we get it. We should organize the data when we know that.
Another reason we should think of this as more like a reservoir than a lake is that we might like to purify the data a little before sending it down the pipes to applications or users that want to use it.
The start of the year is a great time to refresh and take a new look at your capabilities, goals, and plans for your future-state architecture. That being said, you have to take into consideration that the most scarce resource in your architecture is probably your own personal time.
Looking forward, here are three things that I would recommend that every architect do. I realize that all three of these relate to data, but as I have said in the eBook, Think “Data First” to Drive Business Value, we believe that data is the key bottleneck in your enterprise architecture in terms of slowing the delivery of business initiatives in support of your organization’s business strategy.
So, here are the recommendations. None of these will cost you anything if you are a current Informatica PowerCenter customer. And #2 and #3 are free regardless. It is only a matter of your time:
1. Take a look at the current Informatica Cloud offering and in particular the templating capabilities.
Informatica Cloud is probably much more capable than you think. The standard templating functionality supports very complex use cases and does it all from a very easy to use, no-coding, user interface. It comes with a strong library of integration stubs that can be dragged & dropped into Microsoft Viseo to create complex integrations. Once the flow is designed in Viseo, it can be easily imported into Informatica Cloud and from there users have a Wizard-driven UI to do the final customization for sources, targets, mappings, transformations, filters, etc. It is all very powerful and easy to use.
- YouTube: Building Custom templates https://www.youtube.com/watch?v=yHmFkxov6bs
- 30 day free Informatica Cloud trial. http://more.informatica.com/en/cloud_trial/org?offer=30day-ICwebPage
Why This Matters to Architects
- You will see how easy it is for new groups to get going with fairly complex integrations.
- This is a great tool for departmental or new user use, and it will be completely compatible with the rest of your Informatica architecture – not another technology silo for you to manage.
- Any mapping created for Informatica on-premise can also run on the cloud version.
2. Download Informatica Rev and understand what it can do for your analysts and “data wranglers.”
Your data analysts are spending 80% of their time managing their data and only 20% on the actual analysis they are trying to provide. Informatica Rev is a great way to prepare your data before use in analytics tools such as Qlik, Tableau, and others.
With Informatica Rev, people who are not data experts can access, mashup, prototype and cleanse their data all in a User Interface that looks like a spreadsheet and requires no previous experience in data tools.
- For a free Informatica Rev download https://rev.informatica.com/
- Informatica Rev (Project Springbok) demo https://www.youtube.com/watch?v=0F_58bHKDDs
Why This Matters for Architects
- Your data analysts are going to use analytics tools with or without the help of IT. This enables you to help them while ensuring that they are managing their data well and optimizing their productivity.
- This tool will also enable them to share their “data recipes” and for IT to be involved in how they access and use the organization’s data.
3. Look at the new features in PowerCenter 9.6. First, upgrade to 9.6 if you haven’t already, and particularly take a good look at these new capabilities that are bundled in every version. Many people we talk to have 9.6 but don’t realize the power of what they already own.
- Profiling: Discover and analyze your data quickly. Find relationships and data issues.
- Data Services: This presents any JDBC or ODBC repository as a logical data object. From there you can rapidly prototype new applications using these logical objects without worrying about the complexities of the underlying repositories. It can also do data cleansing on the fly.
- Webinar: Great Data by Design. https://www.brighttalk.com/webcast/10477/104939
- PowerCenter 9.6 deep dive demo https://www.brighttalk.com/webcast/10477/110535
Why This Matters for Architects
- The key challenge for IT and for Architects is to be able to deliver at the “speed of business.” These tools can dramatically improve the productivity of your team and speed the delivery of projects for your business “customers.”
Taking the time to understand what these tools can do in terms of increasing the productivity of your IT team and enabling your end users to self-service will make you a better business partner overall and increase your influence across the organization. Have a great year!
The thing that resonates today, in the odd context of big data, is that we may all need to look in the mirror, hold a thumb drive full of information in our hands, and concede once and for all It’s not the data… it’s us.
Many organizations have a hard time making something useful from the ever-expanding universe of big-data, but the problem doesn’t lie with the data: It’s a people problem.
The contention is that big-data is falling short of the hype because people are:
- too unwilling to create cultures that value standardized, efficient, and repeatable information, and
- too complex to be reduced to “thin data” created from digital traces.
Evan Stubbs describes poor data quality as the data analyst’s single greatest problem.
About the only satisfying thing about having bad data is the schadenfreude that goes along with it. There’s cold solace in knowing that regardless of how poor your data is, everyone else’s is equally as bad. The thing is poor quality data doesn’t just appear from the ether. It’s created. Leave the dirty dishes for long enough and you’ll end up with cockroaches and cholera. Ignore data quality and eventually you’ll have black holes of untrustworthy information. Here’s the hard truth: we’re the reason bad data exists.
I will tell you that most data teams make “large efforts” to scrub their data. Those “infrequent” big cleanups however only treat the symptom, not the cause – and ultimately lead to inefficiency, cost, and even more frustration.
It’s intuitive and natural to think that data quality is a technological problem. It’s not; it’s a cultural problem. The real answer is that you need to create a culture that values standardized, efficient, and repeatable information.
If you do that, then you’ll be able to create data that is re-usable, efficient, and high quality. Rather than trying to manage a shanty of half-baked source tables, effective teams put the effort into designing, maintaining, and documenting their data. Instead of being a one-off activity, it becomes part of business as usual, something that’s simply part of daily life.
However, even if that data is the best it can possibly be, is it even capable of delivering on the big-data promise of greater insights about things like the habits, needs, and desires of customers?
Despite the enormous growth of data and the success of a few companies like Amazon and Netflix, “the reality is that deeper insights for most organizations remain elusive,” write Mikkel Rasmussen and Christian Madsbjerg in a Bloomberg Businessweek blog post that argues “big-data gets people wrong.”
Big-data delivers thin data. In the social sciences, we distinguish between two types of human behavior data. The first – thin data – is from digital traces: He wears a size 8, has blue eyes, and drinks pinot noir. The second – rich data – delivers an understanding of how people actually experience the world: He could smell the grass after the rain, he looked at her in that special way, and the new running shoes made him look faster. Big-data focuses solely on correlation, paying no attention to causality. What good is thin “information” when there is no insight into what your consumers actually think and feel?
Accenture reported only 20 percent of the companies it profiled had found a proven causal link between “what they measure and the outcomes they are intending to drive.”
Now, I can contend they keys to transforming big-data to strategic value are critical thinking skills.
Where do we get such skills? People, it seems, are both the problem and the solution. Are we failing on two fronts: failing to create the right data-driven cultures, and failing to interpret the data we collect?
The current trend is that new types of data and new types of physical storage are changing all of that.
When I got back from my trip I found a TDWI white paper by Philip Russom that describes the situation very well in a white paper detailing his research on this subject; Evolving Data Warehouse Architectures in the Age of Big Data.
From an enterprise data architecture and management point of view, this is a very interesting paper.
- First the DW architectures are getting complex because of all the new physical storage options available
- Hadoop – very large scale and inexpensive
- NoSQL DBMS – beyond tabular data
- Columnar DBMS – very fast seek time
- DW Appliances – very fast / very expensive
- What is driving these changes is the rapidly-increasing complexity of data. Data volume has captured the imagination of the press, but it is really the rising complexity of the data types that is going to challenge architects.
- But, here is what really jumped out at me. When they asked the people in their survey what are the important components of their data warehouse architecture, the answer came back; Standards and rules. Specifically, they meant how data is modeled, how data quality metrics are created, metadata requirements, interfaces for data integration, etc.
The conclusion for me, from this part of the survey, was that business strategy is requiring more complex data for better analyses (example: realtime response or proactive recommendations) and business processes (example: advanced customer service). This, in turn, is driving IT to look into more advanced technology to deal with different data types and different use cases for the data. And finally, the way they are dealing with the exploding complexity was through standards, particularly data standards. If you are dealing with increasing complexity and have to do it better, faster and cheaper, they only way you are going to survive is by standardizing as much as reasonably makes sense. But, not a bit more.
If you think about it, it is good advice. Get your data standards in place first. It is the best way to manage the data and technology complexity. …And a chance to be the driver rather than the driven.
I highly recommend reading this white paper. There is far more in it than I can cover here. There is also a Philip Russom webinar on DW Architecture that I recommend.
A month ago, I shared that Frank Friedman believes CFOs are “the logical choice to own analytics and put them to work to serve the organization’s needs”. Even though many CFOs are increasingly taking on what could be considered an internal CEO or COO role, many readers protested my post which focused on reviewing Frank Friedman’s argument. At the same time, CIOs have been very clear with me that they do not want to personally become their company’s data steward. So the question becomes should companies be creating a CDO or CAO role to lead this important function? And if yes, how common are these two roles anyway?
Regardless of eventual ownership, extracting value out of data is becoming a critical business capability. It is clear that data scientists should not be shoe horned into the traditional business analyst role. Data Scientists have the unique ability to derive mathematical models “for the extraction of knowledge from data “(Data Science for Business, Foster Provost, 2013, pg 2). For this reason, Thomas Davenport claims that data scientists need to be able to network across an entire business and be able to work at the intersection of business goals, constraints, processes, available data and analytical possibilities. Given this, many organizations today are starting to experiment with the notion of having either a chief data officers (CDOs) or chief analytics officers (CAOs). The open questions is should an enterprise have a CDO or a CAO or both? And as important in the end, it is important to determine where should each of these roles report in the organization?
Data policy versus business questions
In my opinion, it is the critical to first look into the substance of each role before making a decision with regards to the above question. The CDO should be about ensuring that information is properly secured, stored, transmitted or destroyed. This includes, according to COBIT 5, that there are effective security and controls over information systems. To do this, procedures need to be defined and implemented to ensure the integrity and consistency of information stored in databases, data warehouses, and data archives. According to COBIT 5, data governance requires the following four elements:
- Clear information ownership
- Timely, correct information
- Clear enterprise architecture and efficiency
- Compliance and security
To me, these four elements should be the essence of the CDO role. Having said this, the CAO is related but very different in terms of the nature of the role and the business skills require. The CRISP model points out just how different the two roles are. According to CRISP, the CAO role should be focused upon business understanding, data understanding, data preparation, data modeling, and data evaluation. As such the CAO is focused upon using data to solve business problems while the CDO is about protecting data as a business critical asset. I was living in in Silicon Valley during the “Internet Bust”. I remember seeing very few job descriptions and few job descriptions that existed said that they wanted a developer who could also act as a product manager and do some marketing as a part time activity. This of course made no sense. I feel the same way about the idea of combining the CDO and CAO. One is about compliance and protecting data and the other is about solving business problems with data. Peanut butter and chocolate may work in a Reese’s cup but it will not work here—the orientations are too different.
So which business leader should own the CDO and CAO?
Clearly, having two more C’s in the C-Suite creates a more crowded list of corporate officers. Some have even said that this will extended what is called senior executive bloat. And what of course how do these new roles work with and impact the CIO? The answer depends on organization’s culture, of course. However, where there isn’t an executive staff office, I suggest that these roles go to different places. Clearly, many companies already have their CIO function already reporting to finance. Where this is the case, it is important determine whether a COO function is in place. The COO clearly could own the CDO and CAO functions because they have a significant role in improving process processes and capabilities. Where there isn’t a COO function and the CIO reports to the CEO, I think you could have the CDO report to the CIO even though CIOs say they do not want to be a data steward. This could be a third function in parallel the VP of Ops and VP of Apps. And in this case, I would put the CAO report to one of the following: the CFO, Strategy, or IT. Again this all depends on current organizational structure and corporate culture. Regardless of where it reports, the important thing is to focus the CAO on an enterprise analytics capability.
Author Twitter: @MylesSuer