Hacker News
new
past
comments
ask
show
jobs
points
by
rao-v
15 hours ago
|
comments
by
ACCount37
15 hours ago
|
[-]
Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.
reply
by
rao-v
8 hours ago
|
parent
|
[-]
To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.
reply