Random data compression – diff encoding with merge sort
Last article didn’t go well with my readers, so I decided to elaborate with example.
Lets assume we have eight random input say 5, 1, 3, 1, 2, 3, 0, 5
In this input 4th, 6th and 8th element are non unique i.e they already appeared at least once.
For each number we need a identifier or a flag to identify it as unique or duplicate. So we will have bit info like 0 0 0 1 0 1 0 1
When 4th element appears which is duplicate of one of the past unique, that should exist within (5, 1, 3)
Similarly 6th element should exist within (5, 1, 3, 2) and same applies to 8th element which should exist within (5, 1, 3, 2, 0)

Lets go back to the algorithm mentioned in my previous article.
1. Read 256 bytes from the input (In the above example we will read 8 input of each 3 bit)
2. Sequential process and mark unique and duplicates, that results into 0 0 0 1 0 1 0 1. We need to store this info to be used at the time of decompression. For above 8 input we need 8bits (for 256 input we need 256bits).
3. For each marked as non unique data, remember the position using one bit diff encoding.
i.e for 4th element which is duplicate of 2nd element, using one bit diff encoding we get “01″ and for for 6th element we get “10″ for 8th element we get “000″. The complete table of one bit diff encoding available here
4. Remove non unique data and rearrange the list by moving forward, at the end empty space fill with missing numbers. Refer above image bottom array is rearranged one.
Now the array contains only unique elements, we can apply reverse merge sort to sort and store these bit info.
So compressed file contains unique or duplicate flags generated in step(2) + one bit diff encoding info of duplicate numbers generated in above step(3) + merge sort bit info.
While decompressing use merge sort bit info to re create unique array and use unique or duplicate flags to note which was duplicate and then use bit dif info refill duplicates.
I hope this example clears many of your doubts.
Next article will be interesting one, which is based on B Tree.










I get it.
I didn’t at first, now I do. Thank you for your clarifications.
Hi Keshav,
step 1 –> ok
step 2 –> ok
step 3 : not ok
4th element is duplicate of 2nd element:
Pos[4] 00 01 10 11 –> we take the 2nb element, that is to say “01″
6th element is duplicate of 3rd element:
Pos[6] 000 001 010 011 10 11 –> we take the third element, that is to say “010″
and you said you take “10″ ; how do you choose the right position ?
thank you
H Peter,
4th Element is duplicate which should be present in unique list of (5,1,3) or bit position(00,01,1) we get 01.
6th Element is duplicate which should be present in unique list of (5,1,3,2) or bit position(00,01,10,11) we get 10.
We should use only unique numbers in finding position and bypass duplicates
Kashav you asked me to try Mark Nelson’s file
Update: Maximum I can squeeze so far in single pass is 265 byes less than original good case after delta coding, and if I not coding prior, then 240 less.
So what I could do is 414976 bytes Mark Nelson file. I did not check multiple passes as it is just a game and not my actual work.
Let me know if you want plain C source, sorry I don’t have C++ one.
Bye
Leave your response!
About
Check my other blog at:
Subscribe
Get this Wordpress newsletter widget
for newsletter software
User menu
Pages
Categories
Blogroll
Recent Posts
Recent Comments
Most Commented
Most Viewed