Tag Archives: RDBMS
In my previous blog, I explained how Column-oriented Database Management Systems (CDBMS), also known as columnar databases or CBAT, offer a distinct advantage over the traditional row-oriented RDBMS in terms of I/O workload, deriving primarily from basing the granularity of I/O operations on the column rather than the entire row. This technological advantage has a direct impact on the complexity of data modeling tasks and on the end-user’s experience of the data warehouse, and this is what I will discuss in today’s post. (more…)
Column-oriented Database Management Systems (CDBMS), also referred to as columnar databases and CBAT, have been getting a lot of attention recently in the data warehouse marketplace and trade press. Interestingly, some of the newer companies offering CDBMS-based products give the impression that this is an entirely new development in the RDBMS arena. This technology has actually been around for quite a while. But the market has only recently started to recognize the many benefits of CDBMS. So, why is CDBMS now coming to be recognized as the technology that offers the best support for very large, complex data warehouses intended to support ad hoc analytics? In my opinion, one of the fundamental reasons is the reduction in I/O workload that it enables. (more…)
Columnar Deduplication and Column Tokenization: Improving Database Performance, Security and Interoperability
For some time now, a special technique called columnar deduplication has been implemented by a number of commercially available relational database management systems. In today’s blog post, I discuss the nature and benefits of this technique, which I will refer to as column tokenization for reasons that will become evident.
Column tokenization is a process in which a unique identifier (called a Token ID) is assigned to each unique value in a column, and then employed to represent that value anywhere it appears in the column. Using this approach, data size reductions of up to 50% can be achieved, depending on the number of unique values in the column (that is, on the column’s cardinality). Some RDBMSs use this technique simply as a way of compressing data; the column tokenization process is integrated into the buffer and I/O subsystems, and when a query is executed, each row needs to be materialized and the token IDs replaced by their corresponding values. At Informatica for the File Archive Service (FAS) part of the Information Lifecycle Management product family, column tokenization is the core of our technology: the tokenized structure is actually used during query execution, with row materialization occurring only when the final result set is returned. We also use special compression algorithms to achieve further size reduction, typically on the order of 95%.
In my previous blog I briefly mentioned the term “data temperature.” But what exactly does this term mean? Picture yourself logging to your bank website to look for a transaction in your checking account. Very frequently you want to look for pending transactions and debits and credits that happened in the last 10 days. Frequently you need to look further, maybe one month statement, to search for a check that you don’t remember was for what. Maybe once in a quarter, you need to get information about a debit that happened three months ago, about a subscription of a new magazine that is not coming to your mailbox. And of course, once a year you check yearly statements for your tax return. Give or take a few other scenarios, I am pretty sure I covered most of your use cases, right? (more…)
“The report of my death was an exaggeration.”
– Mark Twain
Ah yes, another conference another old technology is declared dead. Mainframe… dead. Any programming language other than Java…. dead. 8 track tapes …OK, well some things thankfully do die, along with the Ford Pinto that I used to listen to the Beatles Greatest Hits Red Album over and over again on that 8 track… ah yes the good old days, but I digress. (more…)
Informatica Data Replication has triggered a lot of interest and also a lot of questions about why replication is seeing such a resurgence in the market today. The answer is simple: the same conditions that caused its creation for mission-critical operational systems back in the early 1990′s are now happening with data warehousing as well. (more…)