For some time now, a special technique called columnar deduplication has been implemented by a number of commercially available relational database management systems. In today’s blog post, I discuss the nature and benefits of this technique, which I will refer to as column tokenization for reasons that will become evident.
Column tokenization is a process in which a unique identifier (called a Token ID) is assigned to each unique value in a column, and then employed to represent that value anywhere it appears in the column. Using this approach, data size reductions of up to 50% can be achieved, depending on the number of unique values in the column (that is, on the column’s cardinality). Some RDBMSs use this technique simply as a way of compressing data; the column tokenization process is integrated into the buffer and I/O subsystems, and when a query is executed, each row needs to be materialized and the token IDs replaced by their corresponding values. At Informatica for the File Archive Service (FAS) part of the Information Lifecycle Management product family, column tokenization is the core of our technology: the tokenized structure is actually used during query execution, with row materialization occurring only when the final result set is returned. We also use special compression algorithms to achieve further size reduction, typically on the order of 95%.
The scope of the column tokenization process can vary, depending on the specific implementation: some databases will create and maintain a single pool of unique values for the entire database, while others will create a hidden pool for each unit of page size or block size. Another major difference relates to how dynamic the process is: in some RDBMS’s, the cardinality of the column must be specified in advance in order for the system to select the correct Token ID size (a Token ID is usually an unsigned integer number stored into 1, 2, 4 or 8 bytes and 1 bit). Making this specification requires a survey of the data, which can be a time-consuming operation. Informatica FAS implementation is dynamic, in that the Token ID size is automatically selected during the tokenization process, and can vary within a given column. Finally, some databases perform multi-column tokenization, where a single Token ID is used to replace the values of multiple contiguous columns taken together. For example, if a table contains the contiguous columns “ProductId”, “ProductName” and “Product_Description”, a single Token ID might be used to replace the associated values in these three columns. Multi-column tokenization of this sort has some constraints: the columns must be contiguous, they can’t be updatable, and the table rows need to be materialized before being used to respond to a query.
The list of unique values created by the column tokenization process can be defined as a hidden “dictionary”, or can be a real database object created by the DBA. At Informatica, we call this structure a Domain. Once created, a given Domain can be associated with different Columns: for example a Domain defined to store IP addresses can be used by any column of any table in the database that contains IP address data. Having columns that share the same underlying domain enables quick comparisons, since integer values will be compared rather than character values. This can improve the overall operational performance of the database. Another interesting aspect of the tokenization process is the logic of Token ID assignment: Token IDs can be assigned in chronological order or in a Lexical order. The Lexical ordering transforms the underlying Token ID into a Lexical ID, which enables fast ordering.
A specific type of tokenization can be used for integer values, where the integer value itself can become the Token ID. This representation enables definition of Universal Token IDs, which can be exchanged between distributed databases that share the same underlying architecture but not the same value set. Often, a database will contain multiple columns that act as flags with only two possible values, for example Y and N or 1 and 0. For those columns, the underlying tokenization process can be associated with a bitmap, where 0 will associated with one value, and 1 with the other.
As I mentioned at the beginning of this post, Column Tokenization techniques have been in use for a long time, and have demonstrated great advantages for very large databases in terms of I/O workload reduction. Currently there are not many commercial databases that operate directly on tokenized structures to enable query performance improvement. The notion of Universal Token IDs, or pseudo-Universal Token IDs, is an interesting concept which could be used to improve and secure database data exchange and physical data partitioning.