upvote
The model is open weights, so you can download it from the link given at the top.

Then you can run it using some inference backend, e.g. llama.cpp, on any hardware supported by it.

However, this is a big model so even if you quantize it you need a lot of memory to be able to run it.

The alternative is to run it much more slowly, by storing the weights on an SSD. There have already been published some results about optimizing inference to work like this, and I expect that this will become more common in the future.

There are cases when running slowly a better model can still be preferable to running quickly a model that gives poor results.

reply