undefined

upvote

points

by nextaccountic15 hours ago |

upvote

by nl15 hours ago|

[-]

It can.

It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.

Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).

reply

upvote

by aesthesia14 hours ago|

[-]

Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?

reply

upvote

by stymaar11 hours ago|

[-]

Sure, but does llama-cpp support that?

reply

upvote

by thomascountz13 hours ago|

[-]

Yes, you can use constrained decoding like logit masking to force all invalid tokens in the vocabulary to -inf, and effectively be removed from selection. I believe llama.cpp exposes this by accepting a formatted grammar.

reply

upvote

by mijoharas7 hours ago|

[-]

This was my thought as well. I'm surprised that it's not being used here (afaict)

reply