If you're going to go with linguistic self-report and a single item, you really want something like an 11-point Likert scale. A smart design might get e.g. a person's rating of "blue-ness vs. green-ness" on an 11-point scale, then determine the optimal cutpoint via e.g. clustering, logistic regression, or some other method, to really get something meaningful.
IMO, growing up, unless you lived under a rock, it seems obvious to me that you will have experienced different people pointing at the same colour and uttering very different colour labels (pink vs. red, blue vs. green, black vs. deep blue/purple, etc) from the labels you might have applied yourself. Differing/shared colour perception isn't exactly a rare kind of topic (almost is like the canonical stoner topic, also common online), so I'd be a bit surprised if this demo is actually introducing anyone to this concept already. Any excitement is surely from other implications people think the demo has.
But unfortunately there are no interesting implications from what this site shows. Yes, it demonstrates the boring fact that: "it isn't clear how different people assign different color labels to the same physical stimuli" (and yes, this is FALSELY assuming that everyone's monitors/screens are the same too), but if you didn't already know this... I'm not sure exactly what social context you could have possibly grown up in.
We are usually not specific in our day-to-day language, and this exposes/clarifies the issue.
And you would get some number arguing how "several" is a distinct category in the same way this post has people talking about cyan.