undefined

points

[-]

Everything can be represented as f(), a full scale SotA transformer model is also just f(context). That does not mean one layer is sufficient. It all depends on the level of expressivity required by this f to be a good model.

by getnormality5 hours ago|

prev|

[-]

What you're suggesting seems to go implausibly far beyond what the paper says.

RL post-training alters the parameters of the transformer, while your f(manifold) idea seems to suggest that a new layer on top would suffice, no need to alter the transformer itself at all.

It would be extremely handy if that were so, but I'm guessing it isn't, or it would be the prevailing approach.

by wrs3 hours ago|

parent|

[-]

The manifold is in the middle (“small input space is expanded onto a big manifold and contracted again”) so f(manifold) would need to be in the middle too.

by earthnail6 hours ago|

prev|

[-]

Took me a short time to understand what you mean with "autoencoders on steroids", but I believe you mean they are autoencoders with an inverse bottleneck - an intermediate representation that isn't smaller, but that's much larger than the input space. Is my understanding of your comment correct?

by usernametaken296 hours ago|

parent|

[-]

Kind of. Autoencoders don’t need to have an embedding that’s smaller than the input. Their only requirement is that they compress information and thus create reconstruction loss. Typically however they are not trained this way because they don’t converge.. transformers do the same thing, but they can squeeze much more bits of information through one pass because the way they are designed. This holds true even for decoder only networks because they’re still doing the same thing

by earthnail39 minutes ago|

parent|

[-]

If the embedding isn’t smaller than the input, how is it compressing information? It might lose information in its mapping to the embedding space, but in my understanding, the definition of compression means it has to use less bits than the original to hold the same information. As such, the embedding space must be smaller.

by soraki_soladead6 hours ago|

prev|

[-]

I might be misunderstanding your point but this conflates the distinguishing features of each. you mention expansion but autoencoders canonically compress their inputs. autoencoders have an explicit encoder and decoder. most transformers we interact with these days (LLMs) are decoder only. the manifold isn't typically something the model is applied to directly. we apply the function/model to the latent representations. those are what live on the manifold.

by usernametaken295 hours ago|

parent|

[-]

Now that’s interesting.. what exactly distinguishes latent representations and the manifold? IMHO, those are the same, and you’re constructing a piecewise function of the manifold itself. Decoders also produce manifolds much in the same way, with the distinction being that the encoder isn’t learned but static after initialisation. So fundamentally it is still DOING the same operation.

by soraki_soladead5 hours ago|

parent|

[-]

The latent representations of the data are like points on a surface. That surface is the manifold. We don't typically have the full manifold and can only sample points from it by embedding data into it.

Worth noting a different manifold "exists" after each transformation (e.g. layer). You only sample from the same manifold when you apply the same transformation(s).

by CuriouslyC4 hours ago|

parent|

[-]

Also worth noting that in reality manifolds will be "spiky" in very high dimension, so the idea of a "surface" is best understood through patterns of distance between samples in embedding space and way they collapse in low D.