Tag Archives: database
Column-oriented Database Management Systems (CDBMS), also referred to as columnar databases and CBAT, have been getting a lot of attention recently in the data warehouse marketplace and trade press. Interestingly, some of the newer companies offering CDBMS-based products give the impression that this is an entirely new development in the RDBMS arena. This technology has actually been around for quite a while. But the market has only recently started to recognize the many benefits of CDBMS. So, why is CDBMS now coming to be recognized as the technology that offers the best support for very large, complex data warehouses intended to support ad hoc analytics? In my opinion, one of the fundamental reasons is the reduction in I/O workload that it enables. (more…)
Columnar Deduplication and Column Tokenization: Improving Database Performance, Security and Interoperability
For some time now, a special technique called columnar deduplication has been implemented by a number of commercially available relational database management systems. In today’s blog post, I discuss the nature and benefits of this technique, which I will refer to as column tokenization for reasons that will become evident.
Column tokenization is a process in which a unique identifier (called a Token ID) is assigned to each unique value in a column, and then employed to represent that value anywhere it appears in the column. Using this approach, data size reductions of up to 50% can be achieved, depending on the number of unique values in the column (that is, on the column’s cardinality). Some RDBMSs use this technique simply as a way of compressing data; the column tokenization process is integrated into the buffer and I/O subsystems, and when a query is executed, each row needs to be materialized and the token IDs replaced by their corresponding values. At Informatica for the File Archive Service (FAS) part of the Information Lifecycle Management product family, column tokenization is the core of our technology: the tokenized structure is actually used during query execution, with row materialization occurring only when the final result set is returned. We also use special compression algorithms to achieve further size reduction, typically on the order of 95%.
I was at an IT conference a few years ago. The speaker was talking about application testing. At the beginning of his talk, he asked the audience:
“Please raise your hand if you flew here from out of town.”
Most of the audience raised their hands. The speaker then said:
“OK, now if you knew that the airplane you flew on had been tested the same way your company tests its applications, would you have still flown on that plane?
After some uneasy chuckling, every hand went down. Not a great affirmation of the state of application testing in most IT shops. (more…)
In terms of data integration, the notion of data virtualization lets us think about collections of data or services as abstract entities. Thus the abstractions can be represented in a form that is most useful to the integration server or the data integration architect. It’s this notion of abstraction that provides for the grouping of related pieces of information. These groups are independent of their physical location and structure, and allow us to define and understand what meaningful operations can be performed on the data or services.
We leverage data virtualization for a few core reasons: (more…)
The ability to create abstract schemas that are mapped to back-end physical databases provides a huge advantage for those enterprises looking to get their data under control. However, given the power of data virtualization, there are a few things that those in charge of data integration should know. Here are a few quick tips.
Tip 1: Start with a new schema that is decoupled from the data sources. (more…)
I regularly receive questions regarding the types of skills data quality analysts should have in order to be effective. In my experience, regardless of scope, high performing data quality analysts need to possess a well-rounded, balanced skill set – one that marries technical “know how” and aptitude with a solid business understanding and acumen. But, far too often, it seems that undue importance is placed on what I call the data quality “hard skills”, which include; a firm grasp of database concepts, hands on data analysis experience using standard analytical tool sets, expertise with commercial data quality technologies, knowledge of data management best practices and an understanding of the software development life cycle. (more…)
Database partitioning and database archiving are both methods for improving application performance. Many IT organizations use one or the other, but using them together can provide additional incremental value to an organization.
Database partitioning is a well-known method to DBAs and is supported by most of the commercially available databases. The benefits of partitioning include: (more…)
Data services, data services, data services. Do I sound like a broken record? Forgive me if I seem obsessed with the topic, but I truly believe that technology can change your enterprise, and allow IT to finally get a handle on data in the shortest amount of time.
The real value lies in SOA data services. These services allow enterprises to place an easy-to-configure layer between the source physical databases and those that wish to consume the data, either applications or humans. If this seems simple, why, you are right! It is. Why is it so simple? It’s because the complexity is hidden from you, including the access mechanisms to the physical data, the transformation of schemas from physical to abstract, and even the management of data quality and integrity.
So where is the value? There are three core points to consider here: (more…)
Many people ask me what additional benefits, can archiving provide when you already partition your database. The answer is two-fold:
- Archiving allows you to eliminate more data volumes from being processed to further improve response time.
- Partitioning doesn’t necessarily reduce your backup window. The only way to shorten your backup window is to remove data from your production database by archiving (or purging) it to another location.
Let me expand on these points. (more…)
Last month, Informatica and EMC announced a strategic partnership at EMC’s annual user conference in Boston. This is a significant new relationship for both companies-which in itself is interesting. You would have thought that the company responsible for storing more data than just about anybody in the world and the company responsible for moving more data than anybody in the world would have come together many years ago. So why now? What’s different?
Virtualization changes everything. Customers have moved beyond virtualizing their infrastructure and their operating systems and are now trying to apply the same principles to their data. Whether we’re moving the data to the processing, or the processing to the data, it’s clear where data physically lives has become increasingly irrelevant. Customers want data as a service and they don’t want to be hung up on the artificial boundaries created by applications, databases, schemas, or physical devices. (more…)