Home » Data compression, Headline

Random data compression – Is it possible? How to use merge sort?

3 June 2010 by Keshav Shetty 18,437 views 7 Comments

This is continuation of my previous article on Random data compression possibilities.

Some of the readers asked how reverse merge sort (merge unsort) can be used to represent 256 unique values using 128bytes(best case) or 224.25(worst case).

As illustrated in previous article using merge sort we sort the random input and store bit information of the list from where we picked the smaller number. Actually we are reshuffling the original position and stored bit information represents how a input changed its position from original list to sorted list. Lets take an example of a random array of 16 numbers varying from 0…15.

MergeUnsort.jpg

As shown in above image after sorting, the position are reshuffled and stored bit information exactly represent from where the each element moved.

The encoding code is recursive function, to start applying merge from two list of size 2 to merge the two list of 128. You can refer the mergesort in action here.

The merge sort encoding code given below(C++).

   void Transform::mergeSort(unsigned int beginPosition, unsigned int endPosition) {
                unsigned int midPosition = beginPosition + (endPosition-beginPosition)/2;    
                if (midPosition>beginPosition && midPosition<endPosition) { // Recursive call until list is of size 2
                        mergeSort(beginPosition,midPosition);
                        mergeSort(midPosition+1,endPosition);
                }    
                unsigned int elementsInEachList = (endPosition-beginPosition+1)/2;    
                unsigned char list1[256];    
                unsigned char list2[256];    
                for(unsigned int i=beginPosition;i<beginPosition+elementsInEachList;i++) {
                        list1[i] = transformedByte[i];    
                }    
                for(unsigned int i=midPosition+1;i<midPosition+1+elementsInEachList;i++) {    
                        list2[i] = transformedByte[i];    
                }    
                unsigned int pendingInList1 = elementsInEachList;    
                unsigned int list1Pointer = beginPosition;
                unsigned int pendingInList2 = elementsInEachList;    
                unsigned int list2Pointer = midPosition+1;    
                unsigned int mainArrayPointer = beginPosition;    
                while(pendingInList1>0 && pendingInList2>0) {    
                        if (list1[list1Pointer]<list2[list2Pointer]) {    
                                transformedByte[mainArrayPointer] = list1[list1Pointer];    
                                mainArrayPointer++;
                                list1Pointer++;    
                                pendingInList1--;    
                                bitWriter.write(false); // Store bit information 0 as item picked from first list    
                        } else {    
                                transformedByte[mainArrayPointer] = list2[list2Pointer];    
                                mainArrayPointer++;    
                                list2Pointer++;    
                                pendingInList2--;    
                                bitWriter.write(true); // Store bit information 1 as item picked from second list    
                        }    
           }    
           if (pendingInList1>0) {    
                        while(pendingInList1>0) {    
                                transformedByte[mainArrayPointer] = list1[list1Pointer];    
                                mainArrayPointer++;    
                                list1Pointer++;    
                                pendingInList1--;    
                        }    
                } else {    
                        while(pendingInList2>0) {    
                                transformedByte[mainArrayPointer] = list2[list2Pointer];
                                mainArrayPointer++;    
                                list2Pointer++;    
                                pendingInList2--;    
                        }    
           }
        }

As you observed above when items in one list gets over (either first list or second list) for remaining element we don’t require sorting bit information.

Please note in above code random input is stored in transformedByte variable & initilized in separate method for each 256byte input. The object bitWriter is a separate class of type BitWriter(Utility class) which is initialized in constructor. BitWriter will accumulates bit information and writes to output file when accumulated bits crosses 1 byte.

The merge sort decoding code given below(C++).

 
  void Transform::reverseMergeSort(unsigned int beginPosition, unsigned int endPosition) {     
        unsigned int midPosition = beginPosition + (endPosition-beginPosition)/2;    
        if (midPosition>beginPosition && midPosition<endPosition) { // Recursive call until list is of size 2
                mergeSort(beginPosition,midPosition);
                mergeSort(midPosition+1,endPosition);
        }    
        unsigned int elementsInEachList = (endPosition-beginPosition+1)/2;    
        unsigned char list1[256];    
        unsigned char list2[256];    
        for(unsigned int i=beginPosition;i<beginPosition+elementsInEachList;i++) {
                list1[i] = inputPositionArray[i];    
        }    
        for(unsigned int i=midPosition+1;i<midPosition+1+elementsInEachList;i++) {    
                list2[i] = inputPositionArray[i];       
        }    
        unsigned int pendingInList1 = elementsInEachList;    
        unsigned int list1Pointer = beginPosition;
        unsigned int pendingInList2 = elementsInEachList;    
        unsigned int list2Pointer = midPosition+1;    
        unsigned int mainArrayPointer = beginPosition;    
        while(pendingInList1>0 && pendingInList2>0) {
                bool nextBit = bitReader.genNextBit(); // Read bit information from sorted encoding  
                if (nextBit==false) {    
                        inputPositionArray[mainArrayPointer] = list1[list1Pointer];    
                        mainArrayPointer++;
                        list1Pointer++;    
                        pendingInList1--;
                } else {    
                        inputPositionArray[mainArrayPointer] = list2[list2Pointer];    
                        mainArrayPointer++;    
                        list2Pointer++;    
                        pendingInList2--;
                }       
        }    
        if (pendingInList1>0) {    
                while(pendingInList1>0) {    
                        inputPositionArray[mainArrayPointer] = list1[list1Pointer];    
                        mainArrayPointer++;    
                        list1Pointer++;    
                        pendingInList1--;    
                }    
        } else {    
                while(pendingInList2>0) {    
                        inputPositionArray[mainArrayPointer] = list2[list2Pointer];
                        mainArrayPointer++;    
                        list2Pointer++;    
                        pendingInList2--;    
                }    
        }
}

Before calling reverseMergeSort inputPositionArray is initialized with 0,1,2.. as default initial position. The object bitReader is a separate class of type BitReader(Utility class) which is initialized in constructor and provides API to read a bit at a time(not byte), which internally reads a byte and gives a bit at a time. These utility classes are developed by me exclusively for above purpose.

After exiting reverseMergeSort the inputPositionArray does not represent original input, instead it represent the positions of input in original data. After exiting we have to go thru one more loop to construct original input as below.

   for(int i=0; i<256;i++) {   
       inputData[inputPositionArray[i]] = i;
   }

Now inputData contains original data.

I will inform in next article how I compressed random data in which we have maximum 42 duplicates(non unique) for every 256 input byte.

1 Star2 Stars3 Stars4 Stars5 Stars (3 votes, average: 4.67 out of 5)
Loading...

7 Comments »

  • Ibrahim said:

    Please post some about your tests on AMilliondigit bin file by Matt Mahoney and Mark Nelson. They said even single byte cannot be reduced.

    bye

  • Keshav Shetty (author) said:

    Hi Ibrahim,

    Thanks for visiting my blog, I guess you have not read other article I mentioned. Have a look at http://blog.adityon.com/category/data-compression

    Million random digit is not yet compressible. The reason is million random digit contains approximately 90-110 bytes of duplicates(doubles) for every block of 256bytes, where as my compression program works up to 42 duplicates within every block of 256bytes.
    I am sharing my finding so that others may comment and discuss their ideas, so that we can reach from 42 to 110 doubles.

    Thanks & regards
    Keshav K Shetty

  • Ernst Berg said:

    I enjoy that folks are thinking on this. I am one of those who have been inspired to try this “Million-Digit challenge and I too have good ideas.
    I have something to chat about this year.

    My take on compressing the Milliondigit.bin file is that what people have been saying is true; there isn’t any exploit as they knew it. So I decided to explore expressing the file in different ways with systems I constructed and I have worked at it at a hobbyist’s pace for many years. This has brought me to “one to one bijective transforms” this year.

    This last January I discovered that all binary strings can be cycled through all other binary strings bijectively of that length on a one to one level.

    Example: Given some 8 bit long string there are 255 strings it will cycle through and will become that original object again.

    Now the processing time goes up the longer the string so length of Million-Digit file is 3321928 bits long so the cycle is 2^3321928 strings or 9.363453492e+999999 strings if I have my math straight in my head today.. LOL I realize I posted some silly numbers the other day.. LOL I am a hobbyist so I guess I can be forgiven. Needless to say it takes an extremely long time per step/transform on a 2ghz core and there are a huge number of strings to step through.
    Cycling through all the strings of milliondigit.bin length on a 2 GHZ core takes forever..

    How would it compress?

    I am thinking the distance from our source string to some pattern string we can assume is a number we can represent with smaller data.
    That is how I see this bijective transform “compressing.”
    How can we lose if any string can be any other string and back with a simple index?

    I hope I am being clear.. I make silly mistakes too often LOL.

    I am at the point now that I could benefit from a mentor/friend..
    That and access to a fast processor. Do they make 100ghz processors?

    Thanks for the forum..

    Ernst_Berg@sbcglobal.net

  • Adityon » Blog Archive » Random data compression – One bit diff encoding continued - Blog by Keshav Shetty said:

    […] 5. Apply merge sort and store bit information of sorting data. (Refer how to use merge sort article) […]

  • Ibrahim said:

    Why is this sort necessary when we already have marked the duplicates? Sorry but I did not understand the purpose of thus sort, as far I see the input is just shuffled, I have confusion may be is that you are finding the missing values after this sort?

    bye

    Ibrahim

  • Keshav Shetty (author) said:

    Hi Ibrahim,
    with a bit we just know or marked duplicates, but we don’t have information of actual shuffling of non unique numbers.
    In order re generate original shuffling order we need to use sorting/unsorting technique.
    Please check the code sample I posted in my merge sort article.

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.