Category Archives: Big Data
As I continue to counsel insurers about master data, they all agree immediately that it is something they need to get their hands around fast. If you ask participants in a workshop at any carrier; no matter if life, p&c, health or excess, they all raise their hands when I ask, “Do you have broadband bundle at home for internet, voice and TV as well as wireless voice and data?”, followed by “Would you want your company to be the insurance version of this?”
Now let me be clear; while communication service providers offer very sophisticated bundles, they are also still grappling with a comprehensive view of a client across all services (data, voice, text, residential, business, international, TV, mobile, etc.) each of their touch points (website, call center, local store). They are also miles away of including any sort of meaningful network data (jitter, dropped calls, failed call setups, etc.)
Similarly, my insurance investigations typically touch most of the frontline consumer (business and personal) contact points including agencies, marketing (incl. CEM & VOC) and the service center. On all these we typically see a significant lack of productivity given that policy, billing, payments and claims systems are service line specific, while supporting functions from developing leads and underwriting to claims adjucation often handle more than one type of claim.
This lack of performance is worsened even more by the fact that campaigns have sub-optimal campaign response and conversion rates. As touchpoint-enabling CRM applications also suffer from a lack of complete or consistent contact preference information, interactions may violate local privacy regulations. In addition, service centers may capture leads only to log them into a black box AS400 policy system to disappear.
Here again we often hear that the fix could just happen by scrubbing data before it goes into the data warehouse. However, the data typically does not sync back to the source systems so any interaction with a client via chat, phone or face-to-face will not have real time, accurate information to execute a flawless transaction.
On the insurance IT side we also see enormous overhead; from scrubbing every database from source via staging to the analytical reporting environment every month or quarter to one-off clean up projects for the next acquired book-of-business. For a mid-sized, regional carrier (ca. $6B net premiums written) we find an average of $13.1 million in annual benefits from a central customer hub. This figure results in a ROI of between 600-900% depending on requirement complexity, distribution model, IT infrastructure and service lines. This number includes some baseline revenue improvements, productivity gains and cost avoidance as well as reduction.
On the health insurance side, my clients have complained about regional data sources contributing incomplete (often driven by local process & law) and incorrect data (name, address, etc.) to untrusted reports from membership, claims and sales data warehouses. This makes budgeting of such items like medical advice lines staffed by nurses, sales compensation planning and even identifying high-risk members (now driven by the Affordable Care Act) a true mission impossible, which makes the life of the pricing teams challenging.
Over in the life insurers category, whole and universal life plans now encounter a situation where high value clients first faced lower than expected yields due to the low interest rate environment on top of front-loaded fees as well as the front loading of the cost of the term component. Now, as bonds are forecast to decrease in value in the near future, publicly traded carriers will likely be forced to sell bonds before maturity to make good on term life commitments and whole life minimum yield commitments to keep policies in force.
This means that insurers need a full profile of clients as they experience life changes like a move, loss of job, a promotion or birth. Such changes require the proper mitigation strategy, which can be employed to protect a baseline of coverage in order to maintain or improve the premium. This can range from splitting term from whole life to using managed investment portfolio yields to temporarily pad premium shortfalls.
Overall, without a true, timely and complete picture of a client and his/her personal and professional relationships over time and what strategies were presented, considered appealing and ultimately put in force, how will margins improve? Surely, social media data can help here but it should be a second step after mastering what is available in-house already. What are some of your experiences how carriers have tried to collect and use core customer data?
Recommendations and illustrations contained in this post are estimates only and are based entirely upon information provided by the prospective customer and on our observations. While we believe our recommendations and estimates to be sound, the degree of success achieved by the prospective customer is dependent upon a variety of factors, many of which are not under Informatica’s control and nothing in this post shall be relied upon as representative of the degree of success that may, in fact, be realized and no warrantee or representation of success, either express or implied, is made.
The greatest challenge to big data management and analysis isn’t necessarily the technical underpinnings, but rather, lingering executive confusion and uncertainty about what it is and what it can do for their organizations.
The main issue – and root of executive befuddlement – is the abject and ongoing confusion about what, exactly, is meant by “big data.” It’s certainly a hyped-up term for something that has been around for a long time. If you had a one-terabyte database around the turn of the century, you had big data, that’s for sure. If you had a 500-megabyte database in 1990, that would have been big data.
So the “volume” has always been there, and has always been a relative measure. The same goes for the “variety” aspect of big data. Unstructured data – such as word documents or machine log data – has been floating around organizations for decades now. How about the “velocity”? Real-time processing has been on corporate radar screens for well over a decade.
So, what’s changed that we suddenly see this data as an enabler, a game-changer, opening up the gates to a brave new world of analtyics-driven purpose? The rise of relatively cheap open-source tools and platforms for one. Capturing and analyzing large volumes of fast-moving data of various structures required very expensive equipment and consulting assistance. The expensive consultants may still be needed, but the technology is within reach for many organizations.
With this in mind, it is interesting to see that business leaders are warming up to the possibilities of big data, as a new industry survey shows. But what it is exactly they think they’re warming up to is still a big question mark. The survey, conducted among 500 business and IT executives by CompTIA, shows the big data phenomenon has caught the eyes of executives. The vast majority of organizations, 78%, say they feel more positive about big data as a business initiative this year compared to a similar survey conducted a year ago. And, remarkably, 57% feel they’ve made progress in moving in the right direction with data-driven programs, compared with 37% the year before.
To its credit, the CompTIA study’s authors question how accurately these findings actually translate to progress on the big data front: They note that while this years’ survey finds 42% of respondents claiming to be engaged in some of big data initiative – more than double from a year ago (19%) – such initiatives may be “big data” in name only. “This may stem from confusion or reflect the possibility of different users interpreting the concept of big data in different ways,” they observe.
So what we have is a lot of organizations diving into what they see as “big data” projects because that’s what everybody tells them they should be doing. But how much of this is simply the same types of data management and analytics projects that may have been engaged five, 10 years ago?
To really be making the most of big data as we understand it today, organizations should be addressing the following questions:
How much unstructured data is coursing through the organization, and how much of it is worth harvesting? It’s usually easy to measure the amount of structured data, such as that stored in relational databases or data warehouses, but unstructured data is a huge question mark. In many cases, management is clueless about what types of unstructured assets (user-generated files, machine-generated data) are actually available. It’s going to take a lot of research and discovery to uncover the unstructured data assets that are truly meaningful for the business.
Does the current data architecture support the introduction and integration of data sources? Most traditional data architectures are fairly rigid, built to support the inputs and outputs of relational data. Efforts involving other forms of data tend to be one-off projects, in which connectors or interfaces are hand-built built for a single purpose and then forgotten. Reaching out and exploring new and varied types of data require an architecture in which new sources can be rapidly and seamlessly introduced, without the usual silos.
Is the organization moving to an analytics culture? Big data analytics will never be “big” if it only is available to a few select decision makers or analysts. Big data will pack its punch when it enables decision makers at all levels of the organization – from customer care centers to production floors to the executive suite – to access analytics from various data sources. Even more helpful would be a way in which decision-makers can access analytical tools and back-end data sources through self-service approaches.
Unlike some of my friends, History was a subject in high school and college that I truly enjoyed. I particularly appreciated biographies of favorite historical figures because it painted a human face and gave meaning and color to the past. I also vowed at that time to navigate my life and future under the principle attributed to Harvard professor Jorge Agustín Nicolás Ruiz de Santayana y Borrás that goes, “Those who cannot remember the past are condemned to repeat it.”
So that’s a little ditty regarding my history regarding history.
Forwarding now to the present in which I have carved out my career in technology, and in particular, enterprise software, I’m afforded a great platform where I talk to lots of IT and business leaders. When I do, I usually ask them, “How are you implementing advanced projects that help the business become more agile or effective or opportunistically proactive?” They usually answer something along the lines of “this is the age and renaissance of data science and analytics” and then end up talking exclusively about their meat and potatoes business intelligence software projects and how 300 reports now run their business.
Then when I probe and hear their answer more in depth, I am once again reminded of THE history quote and think to myself there’s an amusing irony at play here. When I think about the Business Intelligence systems of today, most are designed to “remember” and report on the historical past through large data warehouses of a gazillion transactions, along with basic, but numerous shipping and billing histories and maybe assorted support records.
But when it comes right down to it, business intelligence “history” is still just that. Nothing is really learned and applied right when and where it counted – AND when it would have made all the difference had the company been able to react in time.
So, in essence, by using standalone BI systems as they are designed today, companies are indeed condemned to repeat what they have already learned because they are too late – so the same mistakes will be repeated again and again.
This means the challenge for BI is to reduce latency, measure the pertinent data / sensors / events, and get scalable – extremely scalable and flexible enough to handle the volume and variety of the forthcoming data onslaught.
There’s a part 2 to this story so keep an eye out for my next blog post History Repeats Itself (Part 2)
That tag line got your attention – did it not? Last week I talked about how companies are trying to squeeze more value out of their asset data (e.g. equipment of any kind) and the systems that house it. I also highlighted the fact that IT departments in many companies with physical asset-heavy business models have tried (and often failed) to create a consistent view of asset data in a new ERP or data warehouse application. These environments are neither equipped to deal with all life cycle aspects of asset information, nor are they fixing the root of the data problem in the sources, i.e. where the stuff is and what it look like. It is like a teenager whose parents have spent thousands of dollars on buying him the latest garments but he always wears the same three outfits because he cannot find the other ones in the pile he hoardes under her bed. And now they bought him a smart phone to fix it. So before you buy him the next black designer shirt, maybe it would be good to find out how many of the same designer shirts he already has, what state they are in and where they are.
Recently, I had the chance to work on a like problem with a large overseas oil & gas company and a North American utility. Both are by definition asset heavy, very conservative in their business practices, highly regulated, very much dependent on outside market forces such as the oil price and geographically very dispersed; and thus, by default a classic system integration spaghetti dish.
My challenge was to find out where the biggest opportunities were in terms of harnessing data for financial benefit.
The initial sense in oil & gas was that most of the financial opportunity hidden in asset data was in G&G (geophysical & geological) and the least on the retail side (lubricants and gas for sale at operated gas stations). On the utility side, the go to area for opportunity appeared to be maintenance operations. Let’s say that I was about right with these assertions but that there were a lot more skeletons in the closet with diamond rings on their fingers than I anticipated.
After talking extensively with a number of department heads in the oil company; starting with the IT folks running half of the 400 G&G applications, the ERP instances (turns out there were 5, not 1) and the data warehouses (3), I queried the people in charge of lubricant and crude plant operations, hydrocarbon trading, finance (tax, insurance, treasury) as well as supply chain, production management, land management and HSE (health, safety, environmental).
The net-net was that the production management people said that there is no issue as they already cleaned up the ERP instance around customer and asset (well) information. The supply chain folks also indicated that they have used another vendor’s MDM application to clean up their vendor data, which funnily enough was not put back into the procurement system responsible for ordering parts. The data warehouse/BI team was comfortable that they cleaned up any information for supply chain, production and finance reports before dimension and fact tables were populated for any data marts.
All of this was pretty much a series of denial sessions on your 12-step road to recovery as the IT folks had very little interaction with the business to get any sense of how relevant, correct, timely and useful these actions are for the end consumer of the information. They also had to run and adjust fixes every month or quarter as source systems changed, new legislation dictated adjustments and new executive guidelines were announced.
While every department tried to run semi-automated and monthly clean up jobs with scripts and some off-the-shelve software to fix their particular situation, the corporate (holding) company and any downstream consumers had no consistency to make sensible decisions on where and how to invest without throwing another legion of bodies (by now over 100 FTEs in total) at the same problem.
So at every stage of the data flow from sources to the ERP to the operational BI and lastly the finance BI environment, people repeated the same tasks: profile, understand, move, aggregate, enrich, format and load.
Despite the departmental clean-up efforts, areas like production operations did not know with certainty (even after their clean up) how many well heads and bores they had, where they were downhole and who changed a characteristic as mundane as the well name last and why (governance, location match).
Marketing (Trading) was surprisingly open about their issues. They could not process incoming, anchored crude shipments into inventory or assess who the counterparty they sold to was owned by and what payment terms were appropriate given the credit or concentration risk associated (reference data, hierarchy mgmt.). As a consequence, operating cash accuracy was low despite ongoing improvements in the process and thus, incurred opportunity cost.
Operational assets like rig equipment had excess insurance coverage (location, operational data linkage) and fines paid to local governments for incorrectly filing or not renewing work visas was not returned for up to two years incurring opportunity cost (employee reference data).
A big chunk of savings was locked up in unplanned NPT (non-production time) because inconsistent, incorrect well data triggered incorrect maintenance intervals. Similarly, OEM specific DCS (drill control system) component software was lacking a central reference data store, which did not trigger alerts before components failed. If you add on top a lack of linkage of data served by thousands of sensors via well logs and Pi historians and their ever changing roll-up for operations and finance, the resulting chaos is complete.
One approach we employed around NPT improvements was to take the revenue from production figure from their 10k and combine it with the industry benchmark related to number of NPT days per 100 day of production (typically about 30% across avg depth on & offshore types). Then you overlay it with a benchmark (if they don’t know) how many of these NPT days were due to bad data, not equipment failure or alike, and just fix a portion of that, you are getting big numbers.
When I sat back and looked at all the potential it came to more than $200 million in savings over 5 years and this before any sensor data from rig equipment, like the myriad of siloed applications running within a drill control system, are integrated and leveraged via a Hadoop cluster to influence operational decisions like drill string configuration or asmyth.
Next time I’ll share some insight into the results of my most recent utility engagement but I would love to hear from you what your experience is in these two or other similar industries.
Recommendations contained in this post are estimates only and are based entirely upon information provided by the prospective customer and on our observations. While we believe our recommendations and estimates to be sound, the degree of success achieved by the prospective customer is dependent upon a variety of factors, many of which are not under Informatica’s control and nothing in this post shall be relied upon as representative of the degree of success that may, in fact, be realized and no warrantee or representation of success, either express or implied, is made.
I’m glad to hear you feel comfortable explaining data to your friends, and I completely understand why you’ll avoid discussing metadata with them. You’re in great company – most business leaders also avoid discussing metadata at all costs! You mentioned during our last call that you keep reading articles in the New York Times about this thing called “Big Data” so as promised I’ll try to explain it as best I can. (more…)
Data Warehouse Optimization (DWO) is becoming a popular term that describes how an organization optimizes their data storage and processing for cost and performance while data volumes continue to grow from an ever increasing variety of data sources.
Data warehouses are reaching their capacity much too quickly as the demand for more data and more types of data are forcing IT organizations into very costly upgrades. Further compounding the problem is that many organizations don’t have a strategy for managing the lifecycle of their data. It is not uncommon for much of the data in a data warehouse to be unused or infrequently used or that too much compute capacity is consumed by extract-load-transform (ELT) processing. This is sometimes the result of business requests for one off business reports that are no longer used or staging raw data in the data warehouse. A large global bank’s data warehouse was exploding with 200TB of data forcing them to consider an upgrade that would cost $20 million. They discovered that much of the data was no longer being used and could be archived to lower cost storage thereby avoiding the upgrade and saving millions. This same bank continues to retire data monthly resulting in on-going savings of $2-3 million annually. A large healthcare insurance company discovered that fewer than 2% of their ELT scripts were consuming 65% of their data warehouse CPU capacity. This company is now looking at Hadoop as a staging platform to offload the storage of raw data and ELT processing freeing up their data warehouse to support the hundreds of concurrent business users. A global media & entertainment company saw their data increase by 20x per year and the associated costs increase 3x within 6 months as they on-boarded more data such as web clickstream data from thousands of web sites and in-game telemetry data.
In this era of big data, not all data is created equal with most raw data originating from machine log files, social media, or years of original transaction data considered to be of lower value – at least until it has been prepared and refined for analysis. This raw data should be staged in Hadoop to reduce storage and data preparation costs while the data warehouse capacity should be reserved for refined, curated and frequently used datasets. Therefore, it’s time to consider optimizing your data warehouse environment to lower costs, increase capacity, optimize performance, and establish an infrastructure that can support growing data volumes from a variety of data sources. Informatica has a complete solution available for data warehouse optimization.
The first step in the optimization process as illustrated in Figure 1 below is to identify inactive and infrequently used data and ELT performance bottlenecks in the data warehouse. Step 2 is to offload the data and ELT processing identified in step 1 to Hadoop. PowerCenter customers have the advantage of Vibe which allows them to map once and deploy anywhere so that ELT processing executed through PowerCenter pushdown capabilities can be converted to ETL processing on Hadoop as part of a simple configuration step during deployment. Most raw data, such as original transaction data, log files (e.g. Internet clickstream), social media, sensor device, and machine data should be staged in Hadoop as noted in step 3. Informatica provides near-universal connectivity to all types of data so that you can load data directly into Hadoop. You can even replicate entire schemas and files into Hadoop, capture just the changes, and stream millions of transactions per second into Hadoop such as machine data. The Informatica PowerCenter Big Data Edition makes every PowerCenter developer a Hadoop developer without having to learn Hadoop so that all ETL, data integration and data quality can be executed natively on Hadoop using readily available resource skills while increasing productivity up to 5x over hand-coding. Informatica also provides data discovery and profiling tools on Hadoop to help data science teams collaborate and understand their data. The final step is to move the resulting high value and frequently used data sets prepared and refined on Hadoop into the data warehouse that supports your enterprise BI and analytics applications.
To get started, Informatica has teamed up with Cloudera to deliver a reference architecture for data warehouse optimization so organizations can lower infrastructure and operational costs, optimize performance and scalability, and ensure enterprise-ready deployments that meet business SLA’s. To learn more please join the webinar A Big Data Reference Architecture for Data Warehouse Optimization on Tuesday November 19 at 8:00am PST.
Figure 1: Process steps for Data Warehouse Optimization
I believe that most in the software business believe that it is tough enough to calculate and hence financially justify the purchase or build of an application - especially middleware – to a business leader or even a CIO. Most of business-centric IT initiatives involve improving processes (order, billing, service) and visualization (scorecarding, trending) for end users to be more efficient in engaging accounts. Some of these have actually migrated to targeting improvements towards customers rather than their logical placeholders like accounts. Similar strides have been made in the realm of other party-type (vendor, employee) as well as product data. They also tackle analyzing larger or smaller data sets and providing a visual set of clues on how to interpret historical or predictive trends on orders, bills, usage, clicks, conversions, etc.
If you think this is a tough enough proposition in itself, imagine the challenge of quantifying the financial benefit derived from understanding where your “hardware” is physically located, how it is configured, who maintained it, when and how. Depending on the business model you may even have to figure out who built it or owns it. All of this has bottom-line effects on how, who and when expenses are paid and revenues get realized and recognized. And then there is the added complication that these dimensions of hardware are often fairly dynamic as they can also change ownership and/or physical location and hence, tax treatment, insurance risk, etc.
Such hardware could be a pump, a valve, a compressor, a substation, a cell tower, a truck or components within these assets. Over time, with new technologies and acquisitions coming about, the systems that plan for, install and maintain these assets become very departmentalized in terms of scope and specialized in terms of function. The same application that designs an asset for department A or region B, is not the same as the one accounting for its value, which is not the same as the one reading its operational status, which is not the one scheduling maintenance, which is not the same as the one billing for any repairs or replacement. The same folks who said the Data Warehouse is the “Golden Copy” now say the “new ERP system” is the new central source for everything. Practitioners know that this is either naiveté or maliciousness. And then there are manual adjustments….
Moreover, to truly take squeeze value out of these assets being installed and upgraded, the massive amounts of data they generate in a myriad of formats and intervals need to be understood, moved, formatted, fixed, interpreted at the right time and stored for future use in a cost-sensitive, easy-to-access and contextual meaningful way.
I wish I could tell you one application does it all but the unsurprising reality is that it takes a concoction of multiple. None or very few asset life cycle-supporting legacy applications will be retired as they often house data in formats commensurate with the age of the assets they were built for. It makes little financial sense to shut down these systems in a big bang approach but rather migrate region after region and process after process to the new system. After all, some of the assets have been in service for 50 or more years and the institutional knowledge tied to them is becoming nearly as old. Also, it is probably easier to engage in often required manual data fixes (hopefully only outliers) bit-by-bit, especially to accommodate imminent audits.
So what do you do in the meantime until all the relevant data is in a single system to get an enterprise-level way to fix your asset tower of Babel and leverage the data volume rather than treat it like an unwanted step child? Most companies, which operate in asset, fixed-cost heavy business models do not want to create a disruption but a steady tuning effect (squeezing the data orange), something rather unsexy in this internet day and age. This is especially true in “older” industries where data is still considered a necessary evil, not an opportunity ready to exploit. Fact is though; that in order to improve the bottom line, we better get going, even if it is with baby steps.
If you are aware of business models and their difficulties to leverage data, write to me. If you even know about an annoying, peculiar or esoteric data “domain”, which does not lend itself to be easily leveraged, share your thoughts. Next time, I will share some examples on how certain industries try to work in this environment, what they envision and how they go about getting there.
So I missed Strata this year so I can only report back what I heard from my team. I was out on the road talking with customers while the gang was at Strata, talking to customers and prospective customers. That said, the conversations they had with new cool Hadoop companies were and my conversations were quite similar. Lots of talk about trials on Hadoop, but outside of the big internet firms, some startups that are focused on solving “big data” problems and some wall street firms, most companies are still kicking the Hadoop tires.
Which reminds me of a picture my neighbor took of a presentation that he saw on Hadoop. The presenter had a slide with a rehash of an old joke that went something like this (I am paraphrasing here as I don’t have the exact quote):
“Hadoop is a lot like teenage sex. Everyone says they do it, but most are not. And for those who are doing it, most of them aren’t very good at it yet. “
So if you haven’t gotten started on your Hadoop project, don’t worry, you aren’t as far behind as you think.
My wife invited my new neighbors over for dinner this past Saturday night. They are a French couple with a super cute 5 year old son. Dinner was nice, and like most ex-pats in the San Francisco Bay Area, he is in high tech. His company is a successful internet company in Europe, but have had a hard time penetrating the U.S. market which is why they moved to the Bay Area. He is starting up a satellite engineering organization in Palo Alto and he asked me where he can find good “big data” engineers. He is having a hard time finding people.
This is a story that I am hearing quite a bit with customers that I have been talking to as well. They want to start up big data teams, but can’t find enough skilled engineers who understand how to develop in PIG or HIVE or YARN or whatever is coming next in the Hadoop/map reduce world.
This reminds me of when I used to work in the telecom software business 20 years ago and everyone was looking at technologies like DCE and CORBA to build out distributed computing environments to solve complex problems that couldn’t be solved easily on a single computing system. If you don’t know what DCE or CORBA are/were, that’s OK. It is kind of the point. They are distributed computing development platforms that failed because they were too damn hard and there just weren’t enough people who could understand how to use them effectively. Now DCE and CORBA were not trying to solve the same problems as Hadoop, but the basic point still stands, they were damn hard and the reality is that programming on a Hadoop platform is damn hard as well.
So could Hadoop fail, just like CORBA and DCE. I doubt it, for a few key reasons. One… there is a considerable amount of venture and industrial investment going into Hadoop to make it work. Not since Java has there been such a concerted effort by the industry to try to make a new technology successful. Second, much of that investment is in providing graphical development environments and applications that use the storage and compute power of Hadoop, but hide its complexity. That is what Informatica is doing with PowerCenter Big Data Edition. We are making it possible for data integration developers to parse, cleanse, transform and integrate data using Hadoop as the underlying storage and engine. But the developer doesn’t have to know anything about Hadoop. The same thing is happening at the analytics layer, at the data prep layer and at the visualization layer.
Bit by bit, software vendors are hiding the underlying complexity of Hadoop so organizations won’t have to hire an army of big data scientists to solve interesting problems. They will still need a few of them, but not so many that Hadoop will end up like those other technologies that most Hadoop developers have never even heard of.
Power to the elephant. And more later about my dinner guest and his super cute 5 year old son.