I don't see how this is an architectural problem though. The problem is that music datasets are highly multimodal, and the training process is relying almost entirely on this dataset instead of incorporating basic musical knowledge to allow it to explore a bit further. That's what happens when computer scientists aim to "upset" a field without consulting with experts in said field.
reply