So there are absolutely a bunch of tasks that could be evaled/benchmarked, but "hallucination rate" isn't particularly applicable/interesting as a metric of how good the tool is
that said, we do use various LLMs (mostly local, fine-tuned, small, for things like NER/parsing/metadata comparison, etc.). and they can and do hallucinate, but we have very hard constraints on the validation, so any extraction results that don't match 1:1 back to the input text are discarded for example. so again, rather than hallucination risk we prefer hard constraints