Random data compression – One bit diff encoding
This is continuation of my previous article on Random data compression-How to use merge sort?.
As I mentioned earlier – it is hard or impossible(as of now) to compress million random digit. As per my analysis million random digit file contains around 90 – 110 bytes duplicates or repeated number within every block of 256 bytes. When input data is very pure or near pure(I mean uniqueness) or when input data is highly polluted or noise(I mean duplicates), then it is easy to compress. May be we can borrow the idea from chemistry. i.e if you refer periodic table lighter elements can be combined using nuclear fusion(Stars) or higher elements can be broken using nuclear fission(Uranium). In both cases energy released which we can relate to compression, however if you refer periodic table you will understand that elements with highly stable nucleus hard to break or fuse. e.g: Lead.
Same pattern I can see in million random digit where if uniqueness reaches extreme high or extreme low we can easily compress the data.
My so far developed compressor could achieve compression with doubles less than 42 or more than 156 in every 256 bytes.
Before I go further about my compression theory, I have to write one more article about one bit difference variable encoder. This is one of the known algorithm based on probability or possibilities.
About bijective you might have observed it is one on one mapping, however this fails to address relation within elements of donor domain set or acceptor domain set.
The one bit variant encoding based on the possibilities. i.e if we have 4 unique possibilities then we need 2 bits (00,01,10,11) to represent all possible values. How many bits do we require when we have 5 possibilities or say 6 possibilities?
In one of my earlier article I mentioned usage of insertion sort or bubble sort stating using this technique we can represent for every possible combination we will have 1792 bits or 224bytes. But I need to re factor this statement. We need much lesser bits.
Assume that you have a sorted list of 256 all unique values, when a random input comes lets say 26, you pick from the initialized list, actually you pick the position, Since it is first input and there are 256 possibilities we need eight bit to represent the position. For next random input we need to chose among remaining 255 possibilities, in this case exception for last position we need eight bit. Here it depends on the position which we pick from remaining elements. We can calculate required bits for best case scenario(when input are in sorted order) and worst case(when input are in reverse sorted order) as below.
If you refer the below illustration you will quickly understand that as possibilities changes required bits changes by maintaining uniqueness of bit stream.
The cells in yellow color are one bit diff encode based first entry in each row.
Using above calculation you can represent using 1545-1793bits(193.125-224.125bytes) for every 256 unique bytes.
Next article I will explain how I achieved compression for 42doubles.