Tag Archives: data warehouse
We have been looking at how data management issues can be classified, and in my last post I provided five categories, but broken them down into two groups: Systemic and System. The systemic issues are ones in which process or management gaps allow data flaws to be introduced. A good example occurs when consumers of reports from the data warehouse insist that the data sets are incomplete, and the root cause is that the processes in which the data is initially collected or created do not comply with the downstream requirement for capturing the missing values. (more…)
Data warehouses are applications– so why not manage them like one? In fact, data grows at a much faster rate in data warehouses, since they integrate date from multiple applications and cater to many different groups of users who need different types of analysis. Data warehouses also keep historical data for a long time, so data grows exponentially in these systems. The infrastructure costs in data warehouses also escalate quickly since analytical processing on large amounts of data requires big beefy boxes. Not to mention the software license and maintenance costs of such a large amount of data. Imagine how many backup media is required to backup tens to hundreds of terabytes of data warehouses on a regular basis. But do you really need to keep all that historical data in production?
One of the challenges of managing data growth in data warehouses is that it’s hard to determine which data is actually used, which data is no longer being used, or even if the data was ever used at all. Unlike transactional systems where the application logic determines when records are no longer being transacted upon, the usage of analytical data in data warehouses has no definite business rules. Age or seasonality may determine data usage in data warehouses, but business users are usually loath to let go of the availability of all that data at their fingertips. The only clear cut way to prove that some data is no longer being used in data warehouses is to monitor its usage.
With just a few days remaining in what has been an eventful year, I thought I’d take some time to reflect on the world of data quality as I’ve observed it over the past twelve months. While the idea of data quality improvement in general didn’t change much, the way that companies are viewing and approaching it most certainly have. Here are three areas that seemed to come up quite frequently:
Data governance awareness grew
In thinking about all the customer interactions that I was involved in throughout the year, it’s hard to come up with one where the topic of data governance didn’t surface. Whereas before, the topic of data governance only seemed to come up for companies with more mature data management organizations, now it seems everyone is looking to build a governance framework in conjunction with their data quality efforts. Furthermore, while previously the conversation was largely driven by IT, now it’s both IT and business stakeholders that are looking for answers to how data governance can help them drive better business outcomes. In increasingly competitive market conditions, we can only expect this trend to continue. Whether it’s focused on increasing revenue, driving out cost or managing risk and compliance, data quality with data governance is where companies of all sizes are turning to create and sustain a differentiated edge. Trends like big data will only make this need more acute. (more…)
Several years ago I had the fortunate opportunity to participate in a post-mortem study of a $100 million dollar project failure. No one likes to be associated with a project failure, but in this case it was fortunate since the size of the write-off was large enough that it forced the team to take a very hard look at root causes and not just do a cursory analysis. As a result we finally got to the heart of a challenge that has been plaguing data architects and designers for 20 years – how to effectively use canonical data models. (more…)
The devil, as they say, is in the detail. Your organization might have invested years of effort and millions of dollars in an enterprise data warehouse, but unless the data in it is accurate and free of contradiction, it can lead to misinformed business decisions and wasted IT resources.
We’re seeing an increasing number of organizations confront the issue of data quality in their data warehousing environments in efforts to sharpen business insights in a challenging economic climate. Many are turning to master data management (MDM) to address the devilish data details that can undermine the value of a data warehousing investment.
Consider this: Just 24 percent of data warehouses deliver “high value” to their organizations, according to a survey by The Data Warehousing Institute (TDWI). Twelve percent are low value and 64 percent are moderate value “but could deliver more,” TDWI’s report states. For many organizations, questionable data quality is the reason why data warehouses fall short of their potential. (more…)
Businesses have seen great success in using virtualization to gain greater efficiencies from their hardware and network resources. Now, the concept of virtualization has been extended to the data layer.
The bottom line is about providing a logical abstraction of all underlying data, so that it appears as one data source to consuming applications.
However, given that your data is often distributed, heterogeneous, and often error-ridden, it’s not enough to simply federate it and pass this off as data virtualization. The data you deliver to your end users must be data they can trust, however, traditional data federation approaches seem to ignore this fact. They simply propagate inconsistent and inaccurate data, quickly. So where is the gap?
In December, 2005 Sun Microsystems conducted an interview with Bill Inmon, the father of the data warehouse concept. He said, “ILM keeps a data warehouse from costing huge amounts of money and maintains good performance consistently throughout the data warehouse environment.” Four years later, the average size of a data warehouse has increased by 200%, surpassing the multi-terabyte size benchmark.
With these mammoth databases comes an increase in cost to manage them and a potential deterioration in performance. It is common practice to leverage techniques like indexing and database partitioning to address query performance issues with very large databases but those techniques do not address challenges associated with the raw volumes of data.