## Random data compression – One bit diff encoding

This is continuation of my previous article on Random data compression-How to use merge sort?.

As I mentioned earlier – it is hard or impossible(as of now) to compress million random digit. As per my analysis million random digit file contains around 90 – 110 bytes duplicates or repeated number within every block of 256 bytes. When input data is very pure or near pure(I mean uniqueness) or when input data is highly polluted or noise(I mean duplicates), then it is easy to compress. May be we can borrow the idea from chemistry. i.e if you refer periodic table lighter elements can be combined using nuclear fusion(Stars) or higher elements can be broken using nuclear fission(Uranium). In both cases energy released which we can relate to compression, however if you refer periodic table you will understand that elements with highly stable nucleus hard to break or fuse. e.g: Lead.

Same pattern I can see in million random digit where if uniqueness reaches extreme high or extreme low we can easily compress the data.

My so far developed compressor could achieve compression with doubles less than 42 or more than 156 in every 256 bytes.

Before I go further about my compression theory, I have to write one more article about one bit difference variable encoder. This is one of the known algorithm based on probability or possibilities.

About bijective you might have observed it is one on one mapping, however this fails to address relation within elements of donor domain set or acceptor domain set.

The one bit variant encoding based on the possibilities. i.e if we have 4 unique possibilities then we need 2 bits (00,01,10,11) to represent all possible values. How many bits do we require when we have 5 possibilities or say 6 possibilities?

In one of my earlier article I mentioned usage of insertion sort or bubble sort stating using this technique we can represent for every possible combination we will have 1792 bits or 224bytes. But I need to re factor this statement. We need much lesser bits.

Assume that you have a sorted list of 256 all unique values, when a random input comes lets say 26, you pick from the initialized list, actually you pick the position, Since it is first input and there are 256 possibilities we need eight bit to represent the position. For next random input we need to chose among remaining 255 possibilities, in this case exception for last position we need eight bit. Here it depends on the position which we pick from remaining elements. We can calculate required bits for best case scenario(when input are in sorted order) and worst case(when input are in reverse sorted order) as below.

If you refer the below illustration you will quickly understand that as possibilities changes required bits changes by maintaining uniqueness of bit stream.

The cells in yellow color are one bit diff encode based first entry in each row.

Using above calculation you can represent using 1545-1793bits(193.125-224.125bytes) for every 256 unique bytes.

Next article I will explain how I achieved compression for 42doubles.

Keshav Shetty (author)said:Some of the readers raised query about meaning of doubles or duplicates.

Actually it is my mistake to use wrong word, correct word should non unique value.

When I say unique value, I meant the number (byte) not appeared before in the list, whereas doubles means the number already present in the list.

e.g: lets assume the input bytes are 8, 10, 15, 0, 6, 12, 15, 9, 11, 1, 12, 15

From the list 8, 10, 15, 0, 6, 12, 9, 11, 1 are unique.

Whereas 15,12,15 are doubles or duplicates(non unique).

Note: First 15 and 12 are counted in unique.

First appearance will not treated as non unique.

In the article I mentioned million random digit contains around 90-100 doubles for every block of 256 numbers. That means if it is 90 non unique number then 256-90=166 are unique numbers.

Here on wards I will use the word non unique, instead of doubles or duplicates.

Thanks & regards

Keshav K Shetty

Ibrahimsaid:Please explain:

1545-1793bits(193.125-224.125bytes)

Thanks

Ibrahimsaid:As far my results with different formated 256 unique values blocks, I achieved

Symbols = 256

Result 1554 bits = 194.25 bytes

Did you mean that one?

Keshav Shetty (author)said:Hi Ibrahim,

Best case happens when data are in descending order. In this case

for first input you need 8 bit = 1×8 = 8bit

for next 128 inputs we need 7 bit each = 128*7=896bits

for next 64 we need 6 bit = 64*6 = 384 and so on

total 1546 bits (or 193.25 bit)

In worst case i.e data are in ascending order

for first 128 inputs we need 8 bit each = 128*8=1024bits

for next 64 we need 7 bit = 64*7 = 448 and so on

total 1793 bits (or 224.125 bit)

Best way to visualize this is by applying diff encoding to the input and see observe bits it generates. (try for smaller data set)

Ibrahimsaid:Kashav,

Can you please provide the source that generates these bit-diff codes in unsigned 8bit integers, what I am using right now is your 256Btidff text file and created multi-dimensional array for every number of symbols rack.

for example a function

char get_bitdff_code(int total_symbols, int index);

so that this function returns the code, and we can pick bits from result and populate to output.

I am waiting for your kind response.

bye

Keshav Shetty (author)said:Hi Ibrahim,

I have different approach to build these 256 bit diff encoding where we initialize 1-D matrix, and go on reducing reorganizing for remaining element.

Thanks & regards

Keshav K Shetty

Ibrahimsaid:Here is my alternate email @apitalk.com, also attached to this comment. I will be waiting for your mail.

Thanks