This being said, here are my top recommendations:
1. Build your system against real targets. Had HackerRank continually tested their prompt against 2-3 real resumes that were scored by hand, I think some of the issues would have immediately popped out. The people who built the prompt thought they could magically skip the hard part of articulating a preferred decision making process by having the LLM do it. But LLMs are much better at scaling a pre-existing decision-making process rather than inventing one, let alone the same one, from scratch, every time.
2. Think about what it would take to get motivated undergraduate interns to do the task from end to end, step by step. That's essentially what your workflow will need.
3. If the LLM can't do a step or sub-task reliably, then it's time to decompose those subtasks into even smaller chunks.
I'm sorry I can't be more helpful!
I recommend reading Hamel.dev posts. Here’s an example: https://hamel.dev/blog/posts/evals/