Do you need to take it to a million in the same place? Is that still "small"?
Why not have 2000 hand curated directories instead?
For example, I have several non-commercial, personal websites that I think anyone would agree are "small web", but each of them fails the Kagi inclusion criteria for a different reason. One is not a blog, another is a blog but with the wrong cadence of posts, etc.
1) The requirement that it needs to be a blog. There's plenty of small-web sites of people who obsess over really wonderful and wacky stuff (e.g., https://www.fleacircus.co.uk/History.htm) but don't qualify here.
2) The requirement that it needs to be updated regularly. Same as above - I get that infrequently updated websites don't generate a "daily morning" feed, but admitting them wouldn't harm in any way.
3) Blanket ban on Substack-like platforms while allowing Blogspot, Wordpress.com, YouTube, etc. Bloggers follow trends, so you're effectively excluding a significant proportion of personal blogs created in the last six years, including the stuff that isn't monetized or behind interstitials. The outcomes are pretty weird: for example, noahpinionblog.blogspot.com is on your list, but noahpinion.blog is apparently no longer small web.
2) 'Regularly' means posted in the last 2 years to be included
3) Substack has an annoying subcribe popup and ads/popups are against the spirit of what this represents
So a similarity-based graph/network of webpages should cluster good with good, bad with bad. That is what I've seen so far, anyway.
With that, you just need to enter the graph in the right place, something that is fairly trivial.