Random data compression – Is it possible? (Part 2)
This is continuation of my previous article on Random data compression possibilities.
Compression of unique values
As I indicated in previous article if for every 256 bytes each value appears only ones, then we can achieve the compression using any of the below techniques.
1. Using insertion sort – by remembering the position.
Using this technique for every 256 byte – we can save exactly 32 bytes, i.e for first element no need to remember any position, for next value one bit sufficient to remember weather current element inserted after or before previous number, for next two numbers we need two bits for each value, for next 4 numbers we need 3 bits for each number and so on until for last 128 values we need 8 bit for each value. It is constant save – i.e for every possible combination we will have 1792 bits or 224bytes. (We save 256 – 224 = 32bytes)
2. Using factorial of 256
In this approach, all possible variations can represented using factorial of 256, that will end up with 1684bits for every possible combination. It is constant save – i.e for every possible combination we will have 1684 bits or 210.5 bytes. (We save 256 – 210.5 = 45.5bytes)
3. Using reverse merge sort – Using this approach use a merge sort and store bit information of the list from where we picked the smaller number. Advantage of this technique is, when one list gets over, we don’t require to store or remember unsorting information for remaining element in other list, we just need to append it as illustrated below.
In best case scenario(when all elements are in ascending order) we just need (1/2*8bit) x 256 = 128 bytes for total 256 input bytes, in worst case we need 1793 bits as illustrated below.
Depending on the unique numbers appear in the list we can save between 128 byte(best case) to 31.75 byte (worst case)
Similarly we can use reverse quick sort technique, however merge sort is better than quick in storing minimal unsorting information, because in merge when one list gets over, we don’t require unsorting information for remaining elements in other list, this feature missing if we use quick sort, but still reverse quick sort can be used with lesser than 2048bits.
How to use above theory when all numbers are not unique?
Well above said theory works when all numbers are unique, but in real time such scenario is rare, instead there will be some duplicates and some unique numbers missing.
In fact random digits created by the RAND group contains on an average 100bytes duplicates(repeated) in every block of 256 numbers. in such data we need to patch the dataset by removing non unique, and inserting missing unique numbers, so that above reverse merge sort can be applied. When we remove duplicates we need to retain its position and actual value. (No need to retain the introduced missing numbers, which will be discarded after unsorting. But to patch this way we need lot of byte, e.g: if 100 numbers is duplicates (i.e 100 numbers are missing), we need to have 100*2=200 bytes. (very costly).
However there is better patch as below
1. Read 256 block of data, mark the duplicates/non unique, remember these relative to current position of the number (so we don’t require 8bits for position), Now we have missing unique numbers.
2. Rearrange the input data by eliminating duplicates so that unique numbers moves upper part of the list, fill missing unique numbers at the end of list. (make sure all missing numbers are filled in ascending order so that we can avoid storing unsorting information of missing numbers – which is of nu use for us – This way we can save few bytes)
3. Perform merge sort on the list excluding introduced missing numbers, now the compressed file will have Unsorting information + duplicate number positions + duplicate values.
4. If missing numbers are less than 32 byte we can actually store position using eight bits per position, if it is more than 32, then we use a bit for each byte indicating duplicate or not.
Using above technique I developed a compressor which can compress random data up to 42 duplicates or missing unique numbers. (I used million digit of Rand with some modification to maintain maximum 42 missing numbers in every 256 input bytes. However I am yet to find way to address another 60 bytes so that million digit of Rand group can be compressed at least by a byte.
I will fine tune my compressor and announce in next article in next year. (After Xmas/new year), next article I will introduce another technique of transition or state representation for random data.
Wish you all happy new year 2010.