upvote
the data gets downloaded via curl from huggingface - sure you can make your own data, simply dump all text you want the model to be trained on into "corpus.txt" and skip "make data".

As the tokenizer adds substantial complexity, this implementation does not include any tokenziation logic and works on raw bytes. Feel free to add your own tokenzier with the help of the coding model of your choice.

You can stop the training using CTRL+C You can train on as little memory as you have. Simply reduce batch size and/or model dimensions in train.c You can change the context window size in train.c via the "seq_len" variable.

Regarding Ruby, LORA and quantization I'll have to refer you to the coding agent of your choice.

reply
Meybe add a simple step betwen start and train:

convert text data to binary data. This help converting a differend data.

(please make 8 bit format, 16, 32 bit format)

reply