upvote
One would believe a model scoring this high on SWEBench could maximize F1 score for a precision recall problem easily. What's the missing part?
reply
In this case, being distilled is sort of existential to them. The false positives would just be losing some revenue (depending if profitable, not even losing profit).
reply