CNNs excel in vision tasks where you have limited compute, limited memory, limited data, and want something that works super well and quick. People usually don't hook CNNs up to a transformer to get language understanding either, you have to train bespoke CNNs for specific tasks
ViTs excel where you're unbounded in compute + data and also want text understanding or have a conversation about an image