I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.
Finally: very nice project, congrats! :)
[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...