To do this lookup in the first place, you pull a number of bits from the virtual/physical address you're looking up, which tells you what bucket to start at. The minimum page size determines how many bits you can use from these addresses to refer to unique buckets. If you don't have a lot of bits, then you can't count very high (6 bits = 2^6 = 64 buckets) -- so to increase the size of the cache, you need to instead increase the associativity, which makes latency worse. For L1 cache, you basically never want to make latency worse, so you are practically capped here.
Platforms like Apple Silicon instead set the minimum page size to 16k, so you get more bits to count buckets (8 bits = 256 buckets). Thus you can increase the size of the cache while keeping associativity low; L1 cache on Apple Silicon is something crazy like 192kb, and L2 (for the same reasons) is +16MB. x86 machines and software, for legacy reasons, are very much tied to 4k page size, which puts something of a practical limit on the size of their downstream caches.
Look up "Virtually Indexed, Physically Tagged" (VIPT) caches for more info if you want it.
Otoh L1 sizes hasn't increased since my first processor, those CPU designers probably know more than I do.
Divide the cache into "meta-caches" indexed by the virtual bits and treat them as separate from the L2's point of view. Duplicate the data and if somebody writes back invalidate all the other copies. The hardware already exists for doing this on any multicore system. Sure, you will end up duplicating data sometimes and it will actually be slower if you're actually writing to aliased locations. But is this happening often enough to be a problem compared to generally having a bigger cache?
It sounds to me like an engineering tradeoff that might or might not make sense, not a hard limit which at least was what I think was being asserted. But as I also said, L1 sizes hasn't increased in a while and smart people are working on it, so there is probably something I don't know.
(Of course, it only took a few years for this to be rectified with the byte-word extension, which became required by ~all "real software" that supported Alpha)
It's also one of the only architectures Windows NT supported that didn't have 4KB pages, along with Itanium. I've wondered how (or if?) it handled programs that expect 4KB pages, especially in the x86 translation subsystem.