Home » Data compression, Featured, Headline

Random data compression – diff encoding with merge sort

2 October 2010 by Keshav Shetty 6,001 views 4 Comments

Last article didn’t go well with my readers, so I decided to elaborate with example.

Lets assume we have eight random input say 5, 1, 3, 1, 2, 3, 0, 5

In this input 4th, 6th and 8th element are non unique i.e they already appeared at least once.

For each number we need a identifier or a flag to identify it as unique or duplicate. So we will have bit info like 0 0 0 1 0 1 0 1

When 4th element appears which is duplicate of one of the past unique, that should exist within (5, 1, 3)

Similarly 6th element should exist within (5, 1, 3, 2) and same applies to 8th element which should exist within (5, 1, 3, 2, 0)


Lets go back to the algorithm mentioned in my previous article.

1. Read 256 bytes from the input (In the above example we will read 8 input of each 3 bit)

2. Sequential process and mark unique and duplicates, that results into 0 0 0 1 0 1 0 1. We need to store this info to be used at the time of decompression. For above 8 input we need 8bits (for 256 input we need 256bits).

3. For each marked as non unique data, remember the position using one bit diff encoding.

i.e for 4th element which is duplicate of 2nd element, using one bit diff encoding we get “01” and for for 6th element we get “10” for 8th element we get “000”. The complete table of one bit diff encoding available here

4. Remove non unique data and rearrange the list by moving forward, at the end empty space fill with missing numbers. Refer above image bottom array is rearranged one.

Now the array contains only unique elements, we can apply reverse merge sort to sort and store these bit info.

So compressed file contains unique or duplicate flags generated in step(2) + one bit diff encoding info of duplicate numbers generated in above step(3) + merge sort bit info.

While decompressing use merge sort bit info to re create unique array and use unique or duplicate flags to note which was duplicate and then use bit dif info refill duplicates.

I hope this example clears many of your doubts.

Next article will be interesting one, which is based on B Tree.

1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 5.00 out of 5)


  • Jules Gilbert said:

    I get it.

    I didn’t at first, now I do. Thank you for your clarifications.

  • Peter said:

    Hi Keshav,
    step 1 –> ok
    step 2 –> ok
    step 3 : not ok

    4th element is duplicate of 2nd element:
    Pos[4] 00 01 10 11 –> we take the 2nb element, that is to say “01”
    6th element is duplicate of 3rd element:
    Pos[6] 000 001 010 011 10 11 –> we take the third element, that is to say “010”

    and you said you take “10” ; how do you choose the right position ?

    thank you

  • Keshav Shetty (author) said:

    H Peter,

    4th Element is duplicate which should be present in unique list of (5,1,3) or bit position(00,01,1) we get 01.
    6th Element is duplicate which should be present in unique list of (5,1,3,2) or bit position(00,01,10,11) we get 10.
    We should use only unique numbers in finding position and bypass duplicates

  • Ibrahim said:

    Kashav you asked me to try Mark Nelson’s file
    Update: Maximum I can squeeze so far in single pass is 265 byes less than original good case after delta coding, and if I not coding prior, then 240 less.

    So what I could do is 414976 bytes Mark Nelson file. I did not check multiple passes as it is just a game and not my actual work.

    Let me know if you want plain C source, sorry I don’t have C++ one.


Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.