Googles appar
Huvudmeny

Post a Comment On: cbloom rants

"09-02-12 - Encoding Values in Bytes Part 3"

8 Comments -

1 – 8 of 8
Blogger Fabian 'ryg' Giesen said...

"furthermore in practice you probably don't want to eat the cost of the divide and modulo in the encoder"
For fixed divisors (i.e. "mod" isn't a function parameter, or when inlining) and with speed optimization enabled, all compilers you're likely to be using will synthesize div/mod using integer multiplies. It's only with variable divisors that they actually generate divide instructions these days.

September 2, 2012 at 4:20 PM

Blogger Turtle said...

Assuming the usage context is compression using byte based coding then Google Snappy has an interesting approach for literal lengths.

They have 6 bits in the byte for the length. Values up to 60 are stored as length-1. For values above that the code stores the number of bytes that follow - 1 to 4 (for codes 60-63). Those bytes form a little endian length.

Though oddly that little endian number stores length-1 - I'd have thought length-60 would have been better so the single extra byte would encode lengths of 1-256 + 60 rather than just 1-256. Doesn't make much difference after the first extra byte.

For literal runs the fixed break @ 60 makes sense - longer runs of literals are uncompressed data anyway.

Only makes sense for literals though. For match lengths or offsets (assuming an LZ77 type encoder), improving the range while still keeping byte based coding is interesting.

September 3, 2012 at 12:52 AM

Blogger cbloom said...

@Turtle - 6 bits for the initial literal runlen sounds like bad code design. You really only want 3-4 for the initial value.

Maybe I'll add two notes to the original post.

September 3, 2012 at 9:04 AM

Blogger won3d said...

There's LZ4, which seems to be generally better than Snappy.

http://fastcompression.blogspot.com/2011/05/lz4-explained.html

One IMO important kind of byte-oriented variable-length encoding is UTF-8, if only because it is ubiqutious. There are many inefficiencies, but they are generally reasonable.

I actually use it as a variable-length encoding for my JS compressed mesh format, since it is "free" in a browser. Being byte-oriented like this is also nice if the data might be further compressed by a text compressor like GZIP (which is also "free" in a browser).

Great blog series.

September 4, 2012 at 1:18 PM

Blogger cbloom said...

Reminding myself : UTF-8 is basically the unary-flag-bit encoding (mod 128) but with some extra redundancy added to make payload bytes distinguishable from first bytes.

September 4, 2012 at 10:51 PM

Blogger won3d said...

So I think Protocol Buffer varints or MIDI encoding is more similar to MOD 128, in that both use the sign bit to indicate whether more bytes are coming. Proto is in LSB order and MIDI is in MSB order (In general, you don't really have endian issues, since you only ever read one byte at a type, which is another nice property of byte encodings as a serialization format).

Anyway, UTF-8 is different because the length is unary encoded in the first byte (we call this a prefix varint encoding). You probably know this. As a result, UTF-8 gets only 5 more bits per byte because you lose a bit from the first byte and you only get 6 more bits per new byte. For text processing, it is nice to be able to rewind to the previous character.

You can take the prefix varint coding one more step by grouping them together into a shared tag byte, or a group varint. For a while, we were using this:

http://stackoverflow.com/questions/2982957/looking-for-more-details-about-group-varint-encoding-decoding-presented-in-jef

We don't use it that much anymore, but if it were important, decoding might be faster using a tag lookup table and pshufb to decode 4 words in parallel. Haven't tried this.

September 5, 2012 at 10:09 AM

Blogger cbloom said...

Ah, I hadn't though about the fact that web people are interested in these codes for putting numbers in web protocols and such.

Yeah, for these purposes I've been discounting the arrangement of the bits and considering any encoding scheme which produces the same output lengths to be the same.

Unary prefix codes are especially fast to decode because you can use countlz. Perhaps an addendum on this is warranted.

September 5, 2012 at 10:44 AM

Blogger won3d said...

Well, I wouldn't really say for "web" coding, but certainly for "network." This distinction is getting somewhat blurred by things binary AJAX requests, WebRTC, etc. Of course, you're still having to do the decoding in JS, which makes bit-bashing somewhat painful enough that I would still avoid it.

Interestingly, the varint codings that we tend to use don't actually do the obvious exclusion you're talking about. UTF-8 is like this; the following byte sequence (I put a gap between tag and payload):

110 00000
10 000000

is actually an illegal encoding of 0, not a legal encoding of 128.

September 5, 2012 at 2:56 PM

You can use some HTML tags, such as <b>, <i>, <a>

This blog does not allow anonymous comments.

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.