undefined

points

[-]

I think these are different use cases. If you talk about SIMD, you talk about the CPU and efficient processing of large numbers of integers. I think that when a solution like this crops up, it's about storage or transmission, and dense packing at the cost of non-uniformity. It's more like time-series databases pack numbers by delta encoding.

by kstenerud1 hours ago|

parent|

[-]

The thing is, most real-world numbers will fit within 1-3 bytes (even at 7 bits per byte), so ultradense packing doesn't actually buy much outside of benchmarks.

I spent WAYYYYYYYY too much time exploring this...

by the-lazy-guy1 hours ago|

parent|

prev|

[-]

Stored and transmitted data also has to be decoded. With modern datacenter hardware bottleneck is often CPU rather than network or disk (SSD). It depends on specific properties of the data. (I used to work on search index implementation which is about decoding and intersecting large amounts of hit-lists; and right SIMD-friendly varint encoding is obviously crucial)

by jaen1 hours ago|

prev|

[-]

This doesn't seem particularly hard to SIMD, especially when the CPU architecture has "compress/expand" horizontal instructions. The first byte fully encodes the length, which is not harder than the continuation bits of (U)LEB128. It's a basically a common length-prefixed encoding with an extra subtract added in, so someone has probably figured out an efficient algorithm.

It might be slightly more instructions than some other serial VL (variable-length) integer codec choices, but overall I don't think it's more difficult.

The very efficient SIMD VL codecs tend to stripe (separate) the control and data bits, so they're in a different design space anyway.

by kstenerud1 hours ago|

parent|

[-]

It can't be done, because the next bytes are dependent upon the first byte (which only works in limited circumstances, and where you have constant spacing between the values).

ULEB128 works in SIMD because there's only one dependent bit per byte, so you can speculatively decode and then correct later cheaply. Bijou requires you to check the first byte and then branch based on the value using all 8 bits in the decision matrix (to handle branches 0-247, 248, 249, 250, 251, 252, 253, 254, 255). This absolutely DESTROYS any parallelization opportunities.

Not to mention that non-canonical sized ints (3, 5, 6, 7) have abysmal performance compared to unaligned 2, 4, and 8 byte reads on modern processors.

by jaen51 minutes ago|

parent|

[-]

Right, I think we have a slightly different definition of SIMD: You mean byte-parallel, I mean "doable with SIMD instructions". I also didn't imply the performance would be better than other methods...

Even though decoding the lengths must be serial (since's there's no unambiguous way to differentiate a tag and data byte), it's still doable within the wider SIMD registers, so there's some theoretical efficiency gain to be had (depending on the shape of the data).

On a general note, the continuation bit and prefix byte forms are equivalent, you just broadcast the prefix byte and compare against an increasing vector to convert it to a mask. Yeah, there's probably more fiddly SIMD if there are multiple prefixes in the register, but doable (it's just not byte-parallel, you eg. unroll the serial decode loop 8 times or whatever your maximum output byte width is, and mask out).

Simplified:

  // Just maps a byte to its position in the register
  __m128i idx = _mm_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15);
  // Broadcast the prefix
  __m128i nn = _mm_set1_epi8((char)prefix_byte);
  // Get applicable locations: prefix_byte contains the length, if byte_pos < len, the corresponding byte will be set
  __m128i m = _mm_cmpgt_epi8(nn, idx);
  // If you *really* want a high-bit mask:
  m = _mm_and_si128(m, _mm_set1_epi8((char)0x80));

by kstenerud41 minutes ago|

parent|

[-]

Yeah, sorry, I didn't say that very well. Single value decoding of Bijou values is of course trivial in SIMD, but the performance benefits of SIMD come from deterministic boundaries across a window. ULEB128's continuation bit is fixed position, so it's data independent. One pmovmskb gives you every boundary in the window.

Interleaved Bijou has no such signal (tag and payload bytes both span 0x00–0xFF), so finding the boundaries is a dependent per-value walk with no opportunities for parallelism.

by jaen25 minutes ago|

parent|

[-]

There's still speculation though - if eg. most values are of 1 or 2-byte length, you can speculate that any control-valued byte is actually control. You can even do a compensation pass to try to fix some amount of mis-speculations, and then bomb out if that fails.

With that, it's mostly byte-parallel (though data-dependent as I mentioned).

by 59 minutes ago|

parent|

prev|

[-]

deleted