undefined

upvote

points

by lostmsu10 hours ago |

upvote

by xbar9 hours ago|

[-]

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.

reply