Hacker News
new
past
comments
ask
show
jobs
points
by
lostmsu
10 hours ago
|
comments
by
xbar
9 hours ago
|
[-]
0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.
reply