undefined

upvote

points

by jjcm13 hours ago |

upvote

by najarvg12 hours ago|

[-]

Thanks for sharing the link to your instance. Was blazing fast in responding. Tried throwing a few things at it with the following results: 1. Generating an R script to take a city and country name and finding it's lat/long and mapping it using ggmaps. Generated a pretty decent script (could be more optimal but impressive for the model size) with warnings about using geojson if possible 2. Generate a latex script to display the gaussian integral equation - generated a (I think) non-standard version using probability distribution functions instead of the general version but still give it points for that. Gave explanations of the formula, parameters as well as instructions on how to compile the script using BASH etc 3. Generate a latex script to display the euler identity equation - this one it nailed.

Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response

reply

upvote

by jjcm12 hours ago|

[-]

> Was blazing fast in responding.

I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.

reply

upvote

by ineedasername10 hours ago|

[-]

consumer hardware?

That... pft. Nevermind, I'm just jealous

reply

upvote

by jjcm9 hours ago|

[-]

Look it was my present to myself after the Figma IPO (worked there 5 years). If you want to feel less jealous, look at the stock price since then.

reply

upvote

by abrookewood9 hours ago|

[-]

Holy hell ... that's a monster of a card

reply

upvote

by najarvg12 hours ago|

[-]

I must add that I also tried out the standard "should I walk or drive to the carwash 100 meters away for washing the car" and it made usual error or suggesting a walk given the distance and health reasons etc. But then this does not claim to be a reasoning model and I did not expect, in the remotest case, for this to be answered correctly. Ever previous generation larger reasoning models struggle with this

reply

upvote

by jjcm12 hours ago|

[-]

I ran it through a rudimentary thinking harness, and it still failed, fwiw:

    The question is about the best mode of transportation to a car wash located 100 meters away. Since the user is asking for a recommendation, it's important to consider practical factors like distance, time, and convenience.

    Walking is the most convenient and eco-friendly option, especially if the car wash is within a short distance. It avoids the need for any transportation and is ideal for quick errands.
    Driving is also an option, but it involves the time and effort of starting and stopping the car, parking, and navigating to the location.
    Given the proximity of the car wash (100 meters), walking is the most practical and efficient choice. If the user has a preference or if the distance is longer, they can adjust accordingly.

reply

upvote

by AnthonBerg4 hours ago|

[-]

As someone whose brain was addled by exposure to art history, I strongly support the suggested pelican on bicycle.

reply

upvote

by adityashankar12 hours ago|

[-]

here's the google colab link, https://colab.research.google.com/drive/1EzyAaQ2nwDv_1X0jaC5... since the ngrok like likely got ddosed by the number of individuals coming along

reply

upvote

by qingcharles7 hours ago|

[-]

Thanks, that works. I only tested the 1.7B. It has that original GPT3 feel to it. Hallucinates like crazy when it doesn't know something. For something that will fit on a GTX1080, though, it's solid.

We're only a couple of years into optimization tech for LLMs. How many other optimizations are we yet to find? Just how small can you make a working LLM that doesn't emit nonsense? With the right math could we have been running LLMs in the 1990s?

reply

upvote

by jjcm12 hours ago|

[-]

Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.

reply

upvote

by adityashankar12 hours ago|

[-]

The link didn't work for me personally, but that may be a bandwidth issue with me fighting for a connection in the EU

reply

upvote

by andai10 hours ago|

[-]

Thanks. Did you need to use Prism's llama.cpp fork to run this?

reply

upvote

by jjcm9 hours ago|

[-]

Yep.

reply

upvote

by andai9 hours ago|

[-]

Could you elaborate on what you did to get it working? I built it from source, but couldn't get it (the 4B model) to produce coherent English.

Sample output below (the model's response to "hi" in the forked llama-cli):

X ( Altern as the from (.. Each. ( the or,./, and, can the Altern for few the as ( (. . ( the You theb,’s, Switch, You entire as other, You can the similar is the, can the You other on, and. Altern. . That, on, and similar, and, similar,, and, or in

reply

upvote

by freakynit8 hours ago|

[-]

I have older M1 air with 8GB, but still getting ober 23 t/s on 4B model.. and the quality of outputs is on par with top models of similar size.

1. Clone their forked repo: `git clone https://github.com/PrismML-Eng/llama.cpp.git`

2. Then (assuming you already have xcode build tools installed):

  cd llama.cpp
  cmake -B build -DGGML_METAL=ON
  cmake --build build --config Release -j$(sysctl -n hw.logicalcpu)

3. Finally, run it with (you can adjust arguments):

  ./build/bin/llama-server -m ~/Downloads/Bonsai-8B.gguf --port 80 --host 0.0.0.0 --ctx-size 0 --parallel 4 --flash-attn on --no-perf --log-colors on --api-key some_api_key_string

Model was first downloaded from: https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

reply

upvote

by freakynit7 hours ago|

[-]

To the author: why is this taking 4.56GB ? I was expecting this to be under 1GB for 4B model. https://ibb.co/CprTGZ1c

And this is when Im serving zero prompts.. just loaded the model (using llama-server).

reply

upvote

by jjcm8 hours ago|

[-]

I did this: https://image.non.io/2093de83-97f6-43e1-a95e-3667b6d89b3f.we...

Literally just downloaded the model into a folder, opened cursor in that folder, and told it to get it running.

Prompt: The gguf for bonsai 8b are in this local project. Get it up and running so I can chat with it. I don't care through what interface. Just get things going quickly. Run it locally - I have plenty of vram. https://huggingface.co/prism-ml/Bonsai-8B-gguf/tree/main

I had to ask it to increase the context window size to 64k, but other than that it got it running just fine. After that I just told ngrok the port I was serving it on and voila.

reply

upvote

by rjh2911 hours ago|

[-]

I reminds me of very early ChatGPT with mostly correct answers but some nonsense. Given its speed, it might be interesting to run it through a 'thinking' phase where it double checks its answers and/or use search grounding which would make it significantly more useful.

reply

upvote

by uf00lme12 hours ago|

[-]

The speed is impressive, I wish it could be setup for similar to speculative decoding

reply

upvote

by abrookewood9 hours ago|

[-]

man, that is really really quick. What is your desktop setup??? GPU?

reply

upvote

by jjcm9 hours ago|

[-]

It is fast, but I do have good hardware. A few people have asked for my local inference build, so I have an existing guide that mirrors my setup: https://non.io/Local-inference-build

reply

upvote

by pdyc9 hours ago|

[-]

thanks, i tested it, failed in strawberry test. qwen 3.5 0.8B with similar size passes it and is far more usable.

reply

upvote

by algoth13 hours ago|

[-]

Does asking it to think step by step, or character by character, improves the answer? It might be a tokenization+unawareness of its own tokenization shortcomings

reply

upvote

by pdyc2 hours ago|

[-]

no it did not with character by character it concluded 2 :-)

reply

upvote

by selcuka8 hours ago|

[-]

Interesting. Qwen 3.5 0.8B failed the test for me.

reply

upvote

by hmokiguess12 hours ago|

[-]

wow that was cooler than I expected, curious to embed this for some lightweight semantic workflows now

reply

upvote

by tristanMatthias12 hours ago|

[-]

[dead]

reply