regards

]]>I have gone through your articles, I had similar idea but not as clear as of yours and in depth described. Using your statically generated symbol table I have reached to point where:

Total Symbols : 162

Hard Symbols : 90 //that are not duplicate

Max position of duplicate : 227

Max duplicates of single symbol : 4

Out Length bits : 1610 bits without encoding duplicates

Out Length Bytes : 201.25 bytes

————————————–

Task : 94 bytes to be encoded in 63

Now the task is encoded only 94 duplicates that are definitely known. You mentioned “If we could represent unique data set <=202 byte(worst case), we can use remaining 54 byte to represent duplicates.". Please explain, do you propose huffman codes or among statically generated that you posted, and remember total number of symbols in duplicates are 72 that is because total symbols are 162 and 90 are hard symbols that don't repeat. As per your bitdiffmap here is proposed codes for 72 symbols:

Pos[72] 0000000 0000001 0000010 0000011 0000100 0000101 0000110 0000111 0001000 0001001 0001010 0001011 0001100 0001101 0001110 0001111 001000 001001 001010 001011 001100 001101 001110 001111 010000 010001 010010 010011 010100 010101 010110 010111 011000 011001 011010 011011 011100 011101 011110 011111 100000 100001 100010 100011 100100 100101 100110 100111 101000 101001 101010 101011 101100 101101 101110 101111 110000 110001 110010 110011 110100 110101 110110 110111 111000 111001 111010 111011 111100 111101 111110 111111

regards

]]>About storing position I mentioned

“Read 256 block of data, mark the duplicates, remember these relative to current position of the number (so we don’t require 8bits for position)”

It is the same thing you are referring, i.e if the third item is duplicate, then that value present in earlier two numbers, so we need just 1 bit to store the position.

About the implementation I already have a working copy with max 42 duplicates can be compressed, I want to fine tune further and announce in this blog next week.

Thanks & regards

Keshav K Shetty

Regarding: “How to use above theory when all numbers are not unique?”.

There are some more effective solution to store information about unique/notunique number positions.

You can use combinatorics (I use combinadic – wikipedia have nice rticle about that).

Let’s say 100 notunique and 156 unique values: it will take much less bits than 256 – but there must be specified count of unique numbers which takes 7 or 8 bits.

The way to store information about repeated values. You must take unique value positions to represent notunique number.

Let’s say we have sequence (0-255):

4 6 10 32 6 85 10 2 …

we know which number is unique – to represent this I’ll use bit sequence to show unique numbers:

1 1 1 1 0 1 0 1 …

all zeroes (not unique) could be stored in using much less bits than originally. In this example second number 6 could be stored using 2 bits, because there are only 4 unique values, and so further…

The only thing I did not use is sorting for data storing, but this field is very interesting for me.

The only one advice – try to implement (program) this, before go crazy about good results on paper 😉

Best regards!

Raimonds

]]>