undefined

points

[-]

Anthropic is obviously also aware of the benefits of MoE and distilling a larger model into a smaller one, so they could run a model of the same size as Alibaba's for the same inference cost if they want to. Or they can run a slightly larger model for slightly higher cost. They definitely aren't running a much larger model (except potentially as a teacher for distillation training) because then they wouldn't be able to hit the output speeds they're hitting.