They are using the “gold standard for the
evaluation of expert medical computing systems” not a proxy for what a doctor actually does when diagnosing someone.
It may have some utility after diagnosis, but this test doesn’t demonstrate utility for patients.