You need to train independently and merge rarely. The problem is the merge step. Weights are too entangled, you are not going to get an improvement commensurate to the effort. Otherwise, everyone would do it. It is an open research problem.
replyThat sounds like the way. Everyone trains their own small problems to maximally compressed weights and then merges.
reply