Random data compression – Is it possible? How to use merge sort?
Print This Post
Email This Post
This is continuation of my previous article on Random data compression possibilities.
Some of the readers asked how reverse merge sort (merge unsort) can be used to represent 256 unique values using 128bytes(best case) or 224.25(worst case).
As illustrated in previous article using merge sort we sort the random input and store bit information of the list from where we picked the smaller number. Actually we are reshuffling the original position and stored bit information represents how a input changed its position from original list to sorted list. Lets take an example of a random array of 16 numbers varying from 0…15.

As shown in above image after sorting, the position are reshuffled and stored bit information exactly represent from where the each element moved.
The encoding code is recursive function, to start applying merge from two list of size 2 to merge the two list of 128. You can refer the mergesort in action here.
The merge sort encoding code given below(C++).
void Transform::mergeSort(unsigned int beginPosition, unsigned int endPosition) {
unsigned int midPosition = beginPosition + (endPosition-beginPosition)/2;
if (midPosition>beginPosition && midPosition<endPosition) { // Recursive call until list is of size 2
mergeSort(beginPosition,midPosition);
mergeSort(midPosition+1,endPosition);
}
unsigned int elementsInEachList = (endPosition-beginPosition+1)/2;
unsigned char list1[256];
unsigned char list2[256];
for(unsigned int i=beginPosition;i<beginPosition+elementsInEachList;i++) {
list1[i] = transformedByte[i];
}
for(unsigned int i=midPosition+1;i<midPosition+1+elementsInEachList;i++) {
list2[i] = transformedByte[i];
}
unsigned int pendingInList1 = elementsInEachList;
unsigned int list1Pointer = beginPosition;
unsigned int pendingInList2 = elementsInEachList;
unsigned int list2Pointer = midPosition+1;
unsigned int mainArrayPointer = beginPosition;
while(pendingInList1>0 && pendingInList2>0) {
if (list1[list1Pointer]<list2[list2Pointer]) {
transformedByte[mainArrayPointer] = list1[list1Pointer];
mainArrayPointer++;
list1Pointer++;
pendingInList1--;
bitWriter.write(false); // Store bit information 0 as item picked from first list
} else {
transformedByte[mainArrayPointer] = list2[list2Pointer];
mainArrayPointer++;
list2Pointer++;
pendingInList2--;
bitWriter.write(true); // Store bit information 1 as item picked from second list
}
}
if (pendingInList1>0) {
while(pendingInList1>0) {
transformedByte[mainArrayPointer] = list1[list1Pointer];
mainArrayPointer++;
list1Pointer++;
pendingInList1--;
}
} else {
while(pendingInList2>0) {
transformedByte[mainArrayPointer] = list2[list2Pointer];
mainArrayPointer++;
list2Pointer++;
pendingInList2--;
}
}
}
As you observed above when items in one list gets over (either first list or second list) for remaining element we don’t require sorting bit information.
Please note in above code random input is stored in transformedByte variable & initilized in separate method for each 256byte input. The object bitWriter is a separate class of type BitWriter(Utility class) which is initialized in constructor. BitWriter will accumulates bit information and writes to output file when accumulated bits crosses 1 byte.
The merge sort decoding code given below(C++).
void Transform::reverseMergeSort(unsigned int beginPosition, unsigned int endPosition) {
unsigned int midPosition = beginPosition + (endPosition-beginPosition)/2;
if (midPosition>beginPosition && midPosition<endPosition) { // Recursive call until list is of size 2
mergeSort(beginPosition,midPosition);
mergeSort(midPosition+1,endPosition);
}
unsigned int elementsInEachList = (endPosition-beginPosition+1)/2;
unsigned char list1[256];
unsigned char list2[256];
for(unsigned int i=beginPosition;i<beginPosition+elementsInEachList;i++) {
list1[i] = inputPositionArray[i];
}
for(unsigned int i=midPosition+1;i<midPosition+1+elementsInEachList;i++) {
list2[i] = inputPositionArray[i];
}
unsigned int pendingInList1 = elementsInEachList;
unsigned int list1Pointer = beginPosition;
unsigned int pendingInList2 = elementsInEachList;
unsigned int list2Pointer = midPosition+1;
unsigned int mainArrayPointer = beginPosition;
while(pendingInList1>0 && pendingInList2>0) {
bool nextBit = bitReader.genNextBit(); // Read bit information from sorted encoding
if (nextBit==false) {
inputPositionArray[mainArrayPointer] = list1[list1Pointer];
mainArrayPointer++;
list1Pointer++;
pendingInList1--;
} else {
inputPositionArray[mainArrayPointer] = list2[list2Pointer];
mainArrayPointer++;
list2Pointer++;
pendingInList2--;
}
}
if (pendingInList1>0) {
while(pendingInList1>0) {
inputPositionArray[mainArrayPointer] = list1[list1Pointer];
mainArrayPointer++;
list1Pointer++;
pendingInList1--;
}
} else {
while(pendingInList2>0) {
inputPositionArray[mainArrayPointer] = list2[list2Pointer];
mainArrayPointer++;
list2Pointer++;
pendingInList2--;
}
}
}
Before calling reverseMergeSort inputPositionArray is initialized with 0,1,2.. as default initial position. The object bitReader is a separate class of type BitReader(Utility class) which is initialized in constructor and provides API to read a bit at a time(not byte), which internally reads a byte and gives a bit at a time. These utility classes are developed by me exclusively for above purpose.
After exiting reverseMergeSort the inputPositionArray does not represent original input, instead it represent the positions of input in original data. After exiting we have to go thru one more loop to construct original input as below.
for(int i=0; i<256;i++) {
inputData[inputPositionArray[i]] = i;
}
Now inputData contains original data.
I will inform in next article how I compressed random data in which we have maximum 42 duplicates(non unique) for every 256 input byte.

(2 votes, average: 4.50 out of 5)
(4.00 out of 5)
Please post some about your tests on AMilliondigit bin file by Matt Mahoney and Mark Nelson. They said even single byte cannot be reduced.
bye
7 June 2010 at 12:51 pm