upvote
Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors
reply
They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types
reply
Just start with q4_k_m and figure out the rest later.
reply