## Random data compression – Lowerband & upper band filtering

Wish you all happy new year 2011.

in this article I will explain how a set of unique elements can be represented with less bits using filtering elements into two sets of upper and lower band.

Few users suggested to change the title, because title says “Random data” where as article describes about unique elements. Let me clarify that million random digit contains average 90-110 bytes of duplicate or non unique for every 256 bytes. If we could represent unique data set <=202 byte(worst case), we can use remaining 54 byte to represent duplicates. Refer previous article here |

Coming to filtering technique, In every set of size 2^n, we will have exactly 2 set where half elements belongs to upper part and half elements belongs to lower part. e.g: In a set 256 unique values we can filter first or upper array containing element between 0 to 127 and lower array 128 to 255. By parsing and remembering using a flag(false for upper and true for lower set) weather element moved to upper or lower set we can locate the original position. Apply this method for newly created set recursively.

Lets take the same the input used in previous article of size 16, i.e: (7, 12, 9, 3, 13, 6, 8, 15, 1, 4, 2, 11, 0, 14, 5, 10)

Element 0-7 belongs to upper set and 8-15 belongs to lower set. Upper set or lower set will have equal number of elements of size (input set size)/2. When a element moves to upper set we can remember position as false and if element moves to lower set we can remember position as true. Once either upper or lower set size touches the size of the element in new set (i.e actually half of the original input size) we can stop parsing further.

In the above example – Total number of element=16, Size of new subset = 8, Elements in first/upper set is 0-7 and elements in lower set 8-15. After applying first time filter we get following output.

This method is similar to reverse merge sort, difference is here we use top to bottom approach, whereas merge sort uses bottom to up.

Using this method we can represent 256 unique elements and save 128 byte(best case) to 31.75 byte (worst case) as illustrated below.

Worst case happens when last element of each set belongs to other set and vice versa.

As I said beginning this method not sufficient to address million random digit, where I need to save at least 54 byte for every 256 byte.

In the next article I will write about using unidirectional graph how we can represent set of unique elements. Till then please pass your comments.

Ibrahimsaid:1. You have overhead of 64 bytes already, as for each unit of 8bit you are writing 2 bits.

2. Why not try it on a dataset that doesn’t have 256 unique values? As this kind of distribution brining parts at nibbles level might increase entropy in one set.

Your comments on this are requested.

Bye

Ibrahimsaid:Correction for my last message:

dataset was (7, 12, 9, 3, 13, 6, 8, 15, 1, 4, 2, 11, 0, 14, 5, 10), which is clearly values at nibbles level, and if you intentionally put them for demonstration purposes then overhead is indeed 32 bytes = 256 bits.

Keshav Shetty (author)said:Hi Ibrahim,

Yes, The sample I provided is for demonstration purpose.

In order to use these technique we need large set of data where we can reduce/distribute these overheads more efficiently.

Thanks & regards

Keshav K Shetty