upvote
Crawling is much more difficult than it used to be. Significantly more content is behind a login, Javascript is required for way more than it should be, and almost the entire web is behind cloudflare or another type of captcha.
reply
These things are actually fairly small problems.

The parts that absolutely require JS can't be reliably linked to and nobody indexes that stuff. Most apparent SPA:s serve a HTML alternative if you don't claim to be a web browser in the UA.

Cloudflare and the like are also fairly easy to deal with as long as your crawler is well behaved. You can register the fingerprint and mostly get access to cf:ed websites.

reply
I think there are two factors that helped Google. First, the search engine landscape back then was absolutely abysmal. I'm sure someone will chime in saying that it's abysmal today as well, but the reality is that 99%+ of consumer searches get good results today. And that's simply because the nature of search has changed: we have billions of people using the internet, and they overwhelmingly just search for products to buy, local restaurants that offer takeout, or for familiar pop content to watch or listen to. And there's some SEO spam there, but also pretty fierce quality assurance by search engines.

Second, the internet was different: when all nerds declared that Google is good, that was CNN-grade newsworthy (and CNN used to matter a lot more back then), simply because the internet seemed kinda important, but there was no other authority on the topic. Today, that's not the case. If you need someone to opine on the internet on air, you invite some political pundit or a business analyst.

So no, I don't think you can repeat the success of Google the same way. It was a product of its time.

reply
Google maps is probably a big moat that's very hard to replicate. You can't as easily just crawl all of that data. It's not easy to generate directions. The average user doesn't want to use your search engine for one thing and Google for everything else, they just want a one stop shop for search.
reply
We have Marginalia which serves a specific use-case: https://about.marginalia-search.com/
reply
That's what I was expecting this submission to be about, although to be honest I'm not certain that Marginalia would want the influx of a fastcompany sized tire kicking
reply
To be fair I'm on a colocated server now. No more apartment hosting for me.
reply
More to the point, it's a shame that we can't collectively grok (dammit, they took that from us too) concepts like "personal" and/or "curated" directories, e.g. individual and group wikis and so forth on perhaps more directed topics with lists of good links.
reply
Other than the obvious (but surmountable) technical challenges with crawling and indexing, trying to establish "goodness" for a given user is tough. For a blogger it will be "hey, you are reading this so you probably like what I like". That's often true but as soon as you try to have a centralized service with arbitrary users, it is hard to do anything better than filtering purely commercial content.
reply
what you mean we can't? there are a lot of curated content directories out there.
reply
Right, I suppose I mean "getting more people to think about why a few of these bookmarked for your favorite topics, especially tied to a trustworthy person, is a million times better than just hitting up Google."

Or, perhaps, a "a better Google should just take you to these."

Something like that.

reply
Among other things, I think crawling is a lot harder now.
reply
Google basically invented the modern cloud in order to efficiently use the hardware necessary to actually build those search engine indices. It's not really a question of implementing a good algorithm and away we go.
reply
Provided they have the kind of massive government support Google has had from the get-go, sure!
reply
The actual underlying problem has changed altogether. Pagerank is easily gamed by SEO.

Search candidates and rankings now require assessment by LLM. Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.

Crawling too requires innovative approaches to bypass server filters.

I doubt any independent person can afford to run a vector database or LLMs at immense scale.

reply
> users want the results intelligently synthesized into a text response with references rather than as raw results.

The reason I pay for Kagi is that I specifically don't want this to occur.

reply
If you pay for a service (web search) that 99.9% use for free, you're an extreme outlier, and not necessarily a justifiable one either. After all, DDG, Google and various others still have raw results for free.
reply
How much do you technologically relate to the average person on the street though?

Every person I have seen (outside the tiny tech bubble) google something has just read the AI overview without skipping a beat.

reply
That's worrisome since I've seen those be for-sure wrong a pretty high percentage of the time.

[EDIT] Incidentally, are there any sites that do actual web search any more, better than Yandex? I'd rather avoid a Russian site if I can, but there are whole topics where it's impossible to find anything useful on heavily "massaged" allegedly-Web-search-but-not-really sites like Google and DDG (Bing), but I can find what I want on page 1 or 2 of a Yandex search. Is Kagi as good as that, or is their index simply ignoring a whole bunch of the Web like so many others? I don't mind paying.

reply
Google "Web" results (not the default results you get when you search) still seem okay for me. You can force them with the udm=14 url trick, or select the "Web" tab in the results. No AI, no images or shopping results, and slightly better text results.
reply
Yep, same here. Ask it "should I wash venison tenderloin" and you get an initial "No, because" followed by a generally "yes its important to clean including with water" in the longer description. Wow a self contradictory answer! Good job!
reply
We’re being force fed them. I’m an AI hater and I catch myself reading those sometimes.

Yes, people want the answer directly. Google wants you to stay on their site to read some mishmash. I think the ideal would be to immediately go to the source’s site.

reply
At this point the web is also so centralized you only need 3 bookmarks these days (your news, youtube and Amazon)

A search is just learning what you don't know and AI does a better job than search has ever done for me - and I'm in tech.

reply
> users want the results intelligently synthesized into a text response with references rather than as raw results

This leads directly to another big change.

People used to submit their sites to search engines and now they might actively block search engines. So a search engine author might have to spend a lot of effort in adversarial games.

reply
>Pagerank

Also a lot of site owners are reluctant to link out. So much so that 'nofollow' had been reduced to a hint rather than a directive.

reply
> Moreover, as a default, users want the results intelligently synthesized into a text response with references rather than as raw results.

Citation needed

reply
You mean all the users of chat services aren't evidence? Chat services increasingly incorporate web links for references in their responses, and this is as the users seek. The tide continues to shift from traditional search to LLM synthesis.
reply
I suspect there are more users of traditional search than there are of llm chat apps.
reply
I suspect that chat apps dominate (80+%?) the under-20 demographic, and have a sizable chunk of the under-30 demographic. Within the next five years it will probably represent 50+% of total search traffic. Maybe it already does. It makes sense that any search site that wants to be in the game tomorrow would keep racing down the AI chat path.
reply