For duplicate detection I am using DBSCAN
https://scikit-learn.org/stable/modules/generated/sklearn.cl...
and found some parameters where I get almost no false positives but a lot of duplicates get missed when I lowered the threshold to make clusters I started getting false positives fast. I don't find duplicates are a big problem in my system with the 110 feeds I have and the subjects I am interested in, but insofar as they are a problem there tend to be structured relationships between articles: that is, site A syndicates articles from site B but for some reason articles from site A usually get selected and site B articles don't. An article from Site A often links to one or more articles, often that I don't have a feed for, and it would be nice if the system looked at the whole constellation. Stuff like that.
Effective clustering is the really interesting technology Google News has had for a long time.
Edit: I just looked around for your YOShInOn RSS reader code and couldn't find it. I did find a number of references it looks like you've made to it on various forums, etc over the years.
You mean the k-means for diversity or DBSCAN for duplicates? Either way it is about 10 lines of scikit-learn code. Send me an email.
Nuzzle did something similar for Twitter but shut down (https://daringfireball.net/linked/2021/05/05/nuzzel).
That would be a good addition to feed readers, especially for news feeds.
You specify your interests as free form text, it ranks articles by how closely they match, and you can consume your Scour feed as an RSS feed to read it in NNW.
Disclaimer: I’m the developer