In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.
Now you can install it easily with `pixi`.
It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?
I found this example in the README quite confusing:
SELECT * FROM documents
WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
ORDER BY content <@> 'search terms'
LIMIT 10;
That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.
All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.
I think I can also setup partitions for this, but while you’re here… I’m very excited to start to roll this out.
P.S. been shipping it for a while https://www.freshports.org/databases/pg_textsearch/ :)
My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.
https://www.anthropic.com/engineering/contextual-retrieval
Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.
It's also a great way to combine fuzzy search with stricter phrase or term matching. As opposed to fiddling with boosts or otherwise trying to combine results.
Elastic has a decent explanation of how RRF works.
https://www.elastic.co/docs/reference/elasticsearch/rest-api...
Simple enough that you can just hack this into a select statement. Or do some easy post processing.
My own querylight-ts library implements bm25, vector search, rrf and more for in browser search. I've been experimenting with that in the last few weeks. Very effective if you want to add a bit more advanced search to your website. Having decent bm25 indexing in postgresql opens a lot of new possibilities. They already had vector search and trigram support. And of course traditional wild card based matching, normalization functions, etc. Bm25 adds a big missing piece.
There's still value to having your search index separated from your main datastore. What you query is not necessarily what you store. That's why people have ETL pipelines to extract, transform (crucial) and load. Even if your search index is going to be postgresql, you might want to think about how to pump data around and what happens when you change your mind about how you want to query and index your data. Migrating your single source of truth is probably going to be an anti pattern there. Honestly, ETL is the one thing I see a lot of companies architect wrong when they consult me on how to improve/fix their search solutions. Classic probing question "When is the last time you reindexed your data?". If the answer is "a long time ago", they have no effective ETL capability basically. That's usually the first problem to sort out with clients like that. Even if it's just a separate table in the same DB, how you rebuild that is crucial to experimenting with new querying and indexing strategies.
In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.
In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.
And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)
`TIMESTAMPTZ` / `TIMESTAMP WITH TIME ZONE` exists?
also shoutout to tj for being super responsive on github issues!
What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.
TigerData: > pg_textsearch v1.0 is freely available via open source (Postgres license)
They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.