undefined

points

[-]

For coding tasks 27B is reported to be much more effective, altho you can probably only run 4b or 5b quants @ this memory.

Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.

by pixelesque6 hours ago|

prev|

[-]

Thank you - I'll give that a go!

by julianlam6 hours ago|

prev|

[-]

May I ask why the M instead of XL?

Obviously bigger != better but I don't know what the differences are.

by DiabloD33 hours ago|

parent|

[-]

These are dynamic quants, and they're basically just an indication of how far away from the desired quant it is allowed to go to achieve the goal. Generally, unsloth's toolchain moves quants up, rarely down.

* _0 and _1 do not use K quant and scales 32x32 blocks according to the original (B)F16 values; _0 scales the block using the original max and min values. _1 does this per row instead of per block.

* K quants do something similar, but now splits blocks into subblocks inside a superblock where the superblock has min/max scaling, but the subblocks also have scaling in the range of the superblock's scaling and are stored using less bits.

* K's M, L, XL are just how aggressively the subblocks and their scaling factors are chosen. Generally, it puts a max on how far you can deviate from the chosen quant to maintain the desired quality, but also gives them a bigger budget to perform that excursion in. XL most aggressively tries to preserve the intended quality, while S does the least.

* Dynamic quant on top of this scales entire layers, full of blocks, according to how much they effect various measurements (such as KLD and perplexity).

That said, there is no reason K_S is even produced by anyone, same with Q_0, Q_1, and I_NL. People should no longer be using those. M only is meaningful if you're trying to restrict the upper bounds: K_XL can reach BF16 for some weights, but rarely; people think this has a speed implication for hardware that has native 8bit in their tensor units (but it doesn't).

Unless you're specifically trying to cure a problem, stick with K_XL.

by rao-v1 hours ago|

parent|

[-]

Hey some of us are on hardware (gfx906 based Radeon MI50s with 32GB of stupidly fast VRAM and basically no compute) that inference significantly faster with Q_0 and Q_1 quants

by srcrip1 hours ago|

parent|

prev|

[-]

You seem to understand this stuff pretty well, any recommendations on resources (blogs, YouTube channels, whatever) for software engineers that want to keep up with this stuff on this kind of level?

A lot of the content about AI out there is kind of produced to the lowest common denominator. Basically a never ending scheme of get rich quick/passive income kinds of AI content.