upvote
My YOShInOn RSS reader uses an SBERT model for classification (will I upvote this or not?) and large-scale clustering (20 k-means clusters and show me the top N in each cluster so I get a diversity of different articles.)

For duplicate detection I am using DBSCAN

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

and found some parameters where I get almost no false positives but a lot of duplicates get missed when I lowered the threshold to make clusters I started getting false positives fast. I don't find duplicates are a big problem in my system with the 110 feeds I have and the subjects I am interested in, but insofar as they are a problem there tend to be structured relationships between articles: that is, site A syndicates articles from site B but for some reason articles from site A usually get selected and site B articles don't. An article from Site A often links to one or more articles, often that I don't have a feed for, and it would be nice if the system looked at the whole constellation. Stuff like that.

Effective clustering is the really interesting technology Google News has had for a long time.

reply
I have been attempting this exact sort of clustering solution for a few years now (on and off as a side project). Do you have source code available, or more detailed explanations/resources of how to approach this?

Edit: I just looked around for your YOShInOn RSS reader code and couldn't find it. I did find a number of references it looks like you've made to it on various forums, etc over the years.

reply
The technical report on YOShInOn is about 2 years overdue!

You mean the k-means for diversity or DBSCAN for duplicates? Either way it is about 10 lines of scikit-learn code. Send me an email.

reply
Both. Just sent an email. Thanks!
reply
That was partially the original promise of Fever, which is the API many RSS services still support and that somehow lives on.

Nuzzle did something similar for Twitter but shut down (https://daringfireball.net/linked/2021/05/05/nuzzel).

That would be a good addition to feed readers, especially for news feeds.

reply
You should try Scour (https://scour.ing)!

You specify your interests as free form text, it ranks articles by how closely they match, and you can consume your Scour feed as an RSS feed to read it in NNW.

Disclaimer: I’m the developer

reply
I haven't used it much but I think Iconfactory's Tapestry[0] does some of this.

[0]: https://usetapestry.com/

reply