SAP HANA Demystified

SAP HANA
Storage and Compression Techniques

"High Performance Analytic Appliance"

SAP's HANA is a technology which processes massive amount of real time data in the main memory to provide results from analyses and transactions.

It is an IMDB - Next generation database technology.

HANA defines it's advantage well with the help of the following Properties:

BIG DATA (Ever growing DATA)
Mobile (Real time)
Cloud (On Demand)
Groundbreaking yet not disruptive to existing landscapes.

To process massive amounts of data the HANA DB has to be able to hold that much data without compromising on performance. To combat this requirement, very efficient compression techniques have been put in place. Note that HANA DB uses a number of these techniques in succession to re-compresses already compressed data keeping the efficiency factor high.

Another point to be noted here is that, these compression techniques are so carefully chosen such that they work both on column based store as well as row based stored data.

Data Compression has a wide variety of uses in the IMDB (In Memory Database) Technology and keeps the IMCE(In Memory Computing Engine) fueled with data to avoid processor idle time.

With the help of compression, the whole database is kept in the Main Memory (RAM) which is the basis of HANA. Compression techniques can provide a compression rate of upto 20:1 i.e. 20 GB of data can fit into 1GB of Main memory.

Compressed data can be loaded into the CPU cache faster. This is because the limiting factor is the data transport between memory and CPU cache, and so the performance gain will exceed the additional computing time needed for decompression. Compression can speed up operations such as scans and aggregations if the operator is aware of the compression.

The compression techniques used are:

Run Length Encoding

Cluster Encoding

Dictionary Encoding

They are explained below:

Run Length Encoding: This algorithm consists of replacing large sequences of repeating data with only one item of this data followed by a counter showing how many times this item is repeated. The original column is replaced with a two-column list. The first column contains the values and the second column contains the counts of consecutive occurrences.

Note that from the new column the original column can be constructed back. This algorithm is particularly useful in data where a number of values are repeated.

Cluster Encoding: Cluster Encoding works by searching for multiple occurrences of the same sequence of values within the original column.The compressed column consists of a two-column list with the first column containing the elements of a particular sequence and the second column containing the row numbers where the sequence starts in the original column. Note that this algorithm replaces strings of characters in memory.

Dictionary Encoding: With dictionary encoding, the columns are stored as sequences of bit-coded integers. That means that a check for equality can be executed on the integers (for example during scans or join operations). This is much faster than comparing, for example, string values. This technique leads to high compression rates. e. g. in country codes or customer numbers.

Table columns which contain only a comparably small number of distinct values can be effectively be compressed by enumerating the distinct values and storing only their numbers.

This technique requires that an additional table, the dictionary, is maintained which in the first column contains the original values and in the second one the numbers representing the values.

SAP HANA Demystified

Friday, August 10, 2012

4 comments: