undefined

points

by himata41135 days ago |

comments

by logicprog5 days ago|

[-]

DSv4 is nearly in the 2t range, but yes you're generally right

by himata41135 days ago|

parent|

[-]

MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t on typical systems would be infuriantingly slow, you could do 4t on nvidias room-scale solution, but for a reasonable training speed / batch size it caps around 3t.

by sosodev5 days ago|

parent|

[-]

Do you have any resources to share regarding independent expert training? I was under the impression that it's not feasible.

by himata41135 days ago|

parent|

[-]

concept is similar to how it works in inference, instead of performing regressive writes to the entire model you run the whole model, but part of the model can live in system memory and get swapped in/out on demand. So only XB parameters are active in training.

edit: I am not really sure if it works like that. I haven't looked too deep into deepseek v4 pro specifically.

by axpy9065 days ago|

prev|

[-]

We’ll see it distilled first.

by OtomotO5 days ago|

prev|

[-]

Ah, American Hubris ... I don't blame you, Hollywood is the world's greatest propaganda machinery of all times.