What if you still have to obtain the best result possible for given coefficient/tokenization budget?
I think that my comment express general case, while yours provide some exceptions.
>What if you need to reduce number of layers
Delete some.
> and/or width of hidden layers?
Randomly drop x% of parameters. No doubt there are better methods that entail distillation but this works.
> would the process of "layers to add" selection be considered training?
Er, no?
> What if you still have to obtain the best result possible for given coefficient/tokenization budget?
We don't know how to get "the best result possible", or even how to define such a thing. We only know how to throw compute at an existing network to get a "better" network, with diminishing returns. Re-using existing weights lowers the amount of compute you need to get to level X.