## Random data compression – One bit diff encoding continued

This is continuation of my previous article on Random data compression-One bit dif encoding.

Sorry for the long gap between this and last article. I was too busy.

After my my previous article many had asked how I achieved compression up to 42 bytes duplicates.

Here we go with the algorithm I used.

1. Read 256 bytes from the input.

2. Sequential process and mark unique and duplicates, we need 256 bits or 32 bytes. (Can’t stop this loss)

3. For each marked as non unique data, remember the position using one bit diff encoding. (Details in previous article) Requires much lesser bits than 8bits depending on non unique data appearance. We will consume lesser bits if the non unique data appears at the beginning rather than at the end of the list.

4. Remove non unique data and rearrange the list by moving forward, at the end empty space fill with missing numbers.

5. Apply merge sort and store bit information of sorting data. (Refer how to use merge sort article)

While decompressing

1. Use merge sort data to recreate the unique data list.

2. Apply non unique flags to rearrange the list.

3. Use one bit dif encoding to fill duplicate or non unique numbers.

Using above technique I could compress random data up to 42 bytes. I repeat up to, not always. I mean if the non unique appears at the beginning of the list we can accommodate up to 42 bytes, whereas if the non unique appears at the end of the list we can hardly compress data up to 12~14bytes.

Note: This method is not effective for Million Random Digit from RAND. We need different approach.

Jules Gilbertsaid:Could you put this into C for us, please. It’s awfully hard to grasp.

Petersaid:Could you give a short example ? Explanations are not very clear, sorry to say that; my purpose is to understand your algorithm…

Thank you

Keshav Shetty (author)said:Hi Peter,

Please check my next article at http://blog.adityon.com/2010/10/random-data-compression-diff-encoding-with-merge-sort/

Thanks & regards

Keshav K Shetty

Dave Mullensaid:I’m having a bit of trouble with this step …

2. Sequential process and mark unique and duplicates, we need 256 bits or 32 bytes. (Canβt stop this loss)

At this stage, you have three possibilities for each of the bytes 0 to 255

a) Does not appear in the input stream

b) Appears once in the input stream

c) Appears more than once in the input stream

It would seem like your method does not account for case a) at all, as you seem to be just using a 0/1 flag for unique / duplicate ?

In fact, I’m curious to know how this method works at all … for every duplicate in a 256 byte block, by inference, there must be one missing value from the range.

Keshav Shetty (author)said:I guess you missed step(4) in above steps.i.e

“4. Remove non unique data and rearrange the list by moving forward, at the end empty space fill with missing numbers.”

When encoding you know which are missing number, when decoding you know it since they are at the end of unsorted list π

Please refer the article http://blog.adityon.com/2010/06/random-data-compression-is-it-possible-how-to-use-merge-sort/

You can even try this way

prepare a list of 256 numbers, out of which say 150 is unique and 100 are duplicates of 150. That means 106 missing, which you will fill at end of the unique list. Apply mergesort and store bit info.

While decoding apply reverse merge sort, you will get unique list with last 106 numbers are missing from your input π

Let me know if you need further info.

Ibrahimsaid:Hello Keshav

I succeeded compression upto 47 duplicates, I tried more than 134 different datasets with 209 unique symbols and 47 doubles. What I experimented was

a) range coding the tray allocated for bits marking on duplicates (taking the smallest and largest 0’s blocks within tray) and using bitdiff codes your provided), I can share that version that mostly shrinks down the 256 bits (32 bytes which you called unavoidable loss) down to about 12+ bytes

b) rearrange (reversible shuffle) the message (256 bytes) to bring source + duplicates more closer (costs < 1 byte)

Would like you to comment on it.

Thanks

Keshav Shetty (author)said:Hi Ibraham,

Apply your algorithm on first set of 256 input from Million random digit from Marl Nelson website.

Also let us how you could manage range encoding used to identify duplicates.

Ibrahimsaid:Hello Keshav

First, it is not my algorithm, secondly this popular million digit file’s first 256 bytes chunk has 162 symbols in total I have no solution for :). So far I can encode symbols 1..141 and 209..256.

As far squeezing the 256 bits tray I am marking symbols in tray not duplicates, and using your bit diff codes to later identify symbols. At some cases I dont consider duplicates rather I try rearranging them. I can post code if you want.

Thanks for reply.