Home » Data compression, Headline

Random data compression – Is it possible? How to use merge sort?

This is continuation of my previous article on Random data compression possibilities.

Some of the readers asked how reverse merge sort (merge unsort) can be used to represent 256 unique values using 128bytes(best case) or 224.25(worst case).

As illustrated in previous article using merge sort we sort the random input and store bit information of the list from where we picked the smaller number. Actually we are reshuffling the original position and stored bit information represents how a input changed its position from original list to sorted list. Lets take an example of a random array of 16 numbers varying from 0…15.

MergeUnsort.jpg

As shown in above image after sorting, the position are reshuffled and stored bit information exactly represent from where the each element moved.

The encoding code is recursive function, to start applying merge from two list of size 2 to merge the two list of 128. You can refer the mergesort in action here.

The merge sort encoding code given below(C++).

   void Transform::mergeSort(unsigned int beginPosition, unsigned int endPosition) {
                unsigned int midPosition = beginPosition + (endPosition-beginPosition)/2;
                if (midPosition>beginPosition && midPosition<endPosition) { // Recursive call until list is of size 2
                        mergeSort(beginPosition,midPosition);
                        mergeSort(midPosition+1,endPosition);
                }
                unsigned int elementsInEachList = (endPosition-beginPosition+1)/2;
                unsigned char list1[256];
                unsigned char list2[256];
                for(unsigned int i=beginPosition;i<beginPosition+elementsInEachList;i++) {
                        list1[i] = transformedByte[i];
                }
                for(unsigned int i=midPosition+1;i<midPosition+1+elementsInEachList;i++) {
                        list2[i] = transformedByte[i];
                }
                unsigned int pendingInList1 = elementsInEachList;
                unsigned int list1Pointer = beginPosition;
                unsigned int pendingInList2 = elementsInEachList;
                unsigned int list2Pointer = midPosition+1;
                unsigned int mainArrayPointer = beginPosition;
                while(pendingInList1>0 && pendingInList2>0) {
                        if (list1[list1Pointer]<list2[list2Pointer]) {
                                transformedByte[mainArrayPointer] = list1[list1Pointer];
                                mainArrayPointer++;
                                list1Pointer++;
                                pendingInList1--;
                                bitWriter.write(false); // Store bit information 0 as item picked from first list
                        } else {
                                transformedByte[mainArrayPointer] = list2[list2Pointer];
                                mainArrayPointer++;
                                list2Pointer++;
                                pendingInList2--;
                                bitWriter.write(true); // Store bit information 1 as item picked from second list
                        }
           }
           if (pendingInList1>0) {
                        while(pendingInList1>0) {
                                transformedByte[mainArrayPointer] = list1[list1Pointer];
                                mainArrayPointer++;
                                list1Pointer++;
                                pendingInList1--;
                        }
                } else {
                        while(pendingInList2>0) {
                                transformedByte[mainArrayPointer] = list2[list2Pointer];
                                mainArrayPointer++;
                                list2Pointer++;
                                pendingInList2--;
                        }
           }
        }

As you observed above when items in one list gets over (either first list or second list) for remaining element we don’t require sorting bit information.

Please note in above code random input is stored in transformedByte variable & initilized in separate method for each 256byte input. The object bitWriter is a separate class of type BitWriter(Utility class) which is initialized in constructor. BitWriter will accumulates bit information and writes to output file when accumulated bits crosses 1 byte.

The merge sort decoding code given below(C++).


  void Transform::reverseMergeSort(unsigned int beginPosition, unsigned int endPosition) {
        unsigned int midPosition = beginPosition + (endPosition-beginPosition)/2;
        if (midPosition>beginPosition && midPosition<endPosition) { // Recursive call until list is of size 2
                mergeSort(beginPosition,midPosition);
                mergeSort(midPosition+1,endPosition);
        }
        unsigned int elementsInEachList = (endPosition-beginPosition+1)/2;
        unsigned char list1[256];
        unsigned char list2[256];
        for(unsigned int i=beginPosition;i<beginPosition+elementsInEachList;i++) {
                list1[i] = inputPositionArray[i];
        }
        for(unsigned int i=midPosition+1;i<midPosition+1+elementsInEachList;i++) {
                list2[i] = inputPositionArray[i];
        }
        unsigned int pendingInList1 = elementsInEachList;
        unsigned int list1Pointer = beginPosition;
        unsigned int pendingInList2 = elementsInEachList;
        unsigned int list2Pointer = midPosition+1;
        unsigned int mainArrayPointer = beginPosition;
        while(pendingInList1>0 && pendingInList2>0) {
                bool nextBit = bitReader.genNextBit(); // Read bit information from sorted encoding
                if (nextBit==false) {
                        inputPositionArray[mainArrayPointer] = list1[list1Pointer];
                        mainArrayPointer++;
                        list1Pointer++;
                        pendingInList1--;
                } else {
                        inputPositionArray[mainArrayPointer] = list2[list2Pointer];
                        mainArrayPointer++;
                        list2Pointer++;
                        pendingInList2--;
                }
        }
        if (pendingInList1>0) {
                while(pendingInList1>0) {
                        inputPositionArray[mainArrayPointer] = list1[list1Pointer];
                        mainArrayPointer++;
                        list1Pointer++;
                        pendingInList1--;
                }
        } else {
                while(pendingInList2>0) {
                        inputPositionArray[mainArrayPointer] = list2[list2Pointer];
                        mainArrayPointer++;
                        list2Pointer++;
                        pendingInList2--;
                }
        }
}

Before calling reverseMergeSort inputPositionArray is initialized with 0,1,2.. as default initial position. The object bitReader is a separate class of type BitReader(Utility class) which is initialized in constructor and provides API to read a bit at a time(not byte), which internally reads a byte and gives a bit at a time. These utility classes are developed by me exclusively for above purpose.

After exiting reverseMergeSort the inputPositionArray does not represent original input, instead it represent the positions of input in original data. After exiting we have to go thru one more loop to construct original input as below.

   for(int i=0; i<256;i++) {
       inputData[inputPositionArray[i]] = i;
   }

Now inputData contains original data.

I will inform in next article how I compressed random data in which we have maximum 42 duplicates(non unique) for every 256 input byte.

1 Star2 Stars3 Stars4 Stars5 Stars (2 votes, average: 4.50 out of 5)
Loading ... Loading ...

3 Comments »

  1. Please post some about your tests on AMilliondigit bin file by Matt Mahoney and Mark Nelson. They said even single byte cannot be reduced.

    bye

  2. Hi Ibrahim,

    Thanks for visiting my blog, I guess you have not read other article I mentioned. Have a look at http://blog.adityon.com/category/data-compression

    Million random digit is not yet compressible. The reason is million random digit contains approximately 90-110 bytes of duplicates(doubles) for every block of 256bytes, where as my compression program works up to 42 duplicates within every block of 256bytes.
    I am sharing my finding so that others may comment and discuss their ideas, so that we can reach from 42 to 110 doubles.

    Thanks & regards
    Keshav K Shetty

  3. I enjoy that folks are thinking on this. I am one of those who have been inspired to try this “Million-Digit challenge and I too have good ideas.
    I have something to chat about this year.

    My take on compressing the Milliondigit.bin file is that what people have been saying is true; there isn’t any exploit as they knew it. So I decided to explore expressing the file in different ways with systems I constructed and I have worked at it at a hobbyist’s pace for many years. This has brought me to “one to one bijective transforms” this year.

    This last January I discovered that all binary strings can be cycled through all other binary strings bijectively of that length on a one to one level.

    Example: Given some 8 bit long string there are 255 strings it will cycle through and will become that original object again.

    Now the processing time goes up the longer the string so length of Million-Digit file is 3321928 bits long so the cycle is 2^3321928 strings or 9.363453492e+999999 strings if I have my math straight in my head today.. LOL I realize I posted some silly numbers the other day.. LOL I am a hobbyist so I guess I can be forgiven. Needless to say it takes an extremely long time per step/transform on a 2ghz core and there are a huge number of strings to step through.
    Cycling through all the strings of milliondigit.bin length on a 2 GHZ core takes forever..

    How would it compress?

    I am thinking the distance from our source string to some pattern string we can assume is a number we can represent with smaller data.
    That is how I see this bijective transform “compressing.”
    How can we lose if any string can be any other string and back with a simple index?

    I hope I am being clear.. I make silly mistakes too often LOL.

    I am at the point now that I could benefit from a mentor/friend..
    That and access to a fast processor. Do they make 100ghz processors?

    Thanks for the forum..

    Ernst_Berg@sbcglobal.net

Have your say!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>