Show HN: Postgres extension for BM25 relevance-ranked full-text search

upvote

Show HN: Postgres extension for BM25 relevance-ranked full-text search

(github.com)

157 points

by tjgreen18 hours ago |

upvote

by i_have_to_speak6 hours ago|

[-]

This looks like a great addition to the Postgres ecosystem. When adding specialized extensions like this, it's always worth keeping an eye on how they impact overall system performance, especially memory usage and lock contention as the dataset grows. For anyone testing this out, I'd recommend using an open-source tool like *pgmetrics* (https://pgmetrics.io) to get a baseline and then monitor how the new indexes and search workloads affect your underlying metrics. It’s zero-dependency and gives you a very deep look into the internals without much overhead.

reply

upvote

by esperent4 hours ago|

[-]

Is this your tool? Should mention that if so.

reply

upvote

by jascha_eng15 hours ago|

[-]

FWIW TJ is not your average vibe coder imo: https://www.linkedin.com/in/todd-j-green/

In september he burned through 3000$ in API credits though, but I think that's before we finally bought max plans for everyone that wanted it.

reply

upvote

by ellefire5 minutes ago|

[-]

Is this meant to inspire confidence or fear?

reply

upvote

by shreyssh14 hours ago|

[-]

Nice work. pg_search has been on my radar for a while, having BM25 natively in Postgres instead of bolting on Elasticsearch is a huge DX win. Curious about the index build time on larger datasets though. I'm working with ~2M row tables and the bottleneck for most Postgres extensions I've tried isn't query speed, it's the initial indexing. Any benchmarks on that?

reply

upvote

by tjgreen14 hours ago|

[-]

Yep, there are numbers in the blog post and repo. We are able to index MS-MARCO v2 (138M documents, around 50GB of raw data) in a bit under 18 minutes.

reply

upvote

by tjgreen14 hours ago|

[-]

For 2M scale dataset, you should be able to index in about 1 minute on low-end hardware. See the MS-MARCO v1 (8M documents) numbers, measured on cheap Github runners.

reply

upvote

by diwank8 hours ago|

[-]

had a bad experience with pg_search (paradedb) in the past

reply

upvote

by droelf59 minutes ago|

[-]

I just packaged it for pg-forge, nice stuff: https://github.com/wolfv/pg-forge

Now you can install it easily with `pixi`.

reply

upvote

by simonw15 hours ago|

[-]

This is really cool. I've built things on PostgreSQL ts_vector() FTS in the past which works well but doesn't have whole-index ranking algorithms so can't do BM25.

It's a bit surprising to me that this doesn't appear to have a mechanism to say "filter for just documents matching terms X and Y, then sort by BM25 relevance" - it looks like this extension currently handles just the BM25 ranking but not the FTS filtering. Are you planning to address that in the future?

I found this example in the README quite confusing:

  SELECT * FROM documents
  WHERE content <@> to_bm25query('search terms', 'docs_idx') < -5.0
  ORDER BY content <@> 'search terms'
  LIMIT 10;

That -5.0 is a magic number which, based on my understanding of BM25, is difficult to predict in advance since the threshold you would want to pick varies for different datasets.

reply

upvote

by tjgreen14 hours ago|

[-]

I actually don't love this example either, for the reasons you mention, but at some point we had questions about how to filter based on numeric ranking. Thanks for the reminder to revisit this.

Re filtering, there are often reasonable workarounds in the SQL context that caused me to deprioritize this for GA. With your example, the workaround is to apply post-filtering to select just matches with all desired terms. This is not ideal ergonomics since you may have to play with the LIMIT that you'll need to get enough results, but it's already a familiar pattern if you're using vector indexes. For very selective conditions, pre-filtering by those conditions and then ranking afterwards is also an option for the planner, provided you've created indexes on the columns in question.

All this is just an argument about priorities for GA. Now that v1.0 is out, we'll get signal about which features to prioritize next.

reply

upvote

by mbreese13 hours ago|

[-]

While we’re talking about filtering — is there a way to set a WHERE clause when you’re setting up the index? I’ve been working on this a lot recently for a hybrid vector search in pg. One of the things that I’m running up against is setting a good BM25 index for a subset of a table (the where clause). I have a document subsets with very different word frequencies, so I’m trying to make sure that the search works on a set subset.

I think I can also setup partitions for this, but while you’re here… I’m very excited to start to roll this out.

reply

upvote

by tjgreen13 hours ago|

[-]

Partitions would be one option, and we've got pretty robust partitioned table support in the extension. (Timescaledb uses partitioning for hypertables, so we had to front-load that support). Expression indexes would be another option, not yet done but there is a community PR in flight: https://github.com/timescale/pg_textsearch/pull/154

reply

upvote

by susupro112 hours ago|

[-]

[dead]

reply

upvote

by kev0094 hours ago|

[-]

The "term positions" caveat seems like a major limitation for human oriented searches of logs or products or whatever. I don't see it mentioned in what's next, will you address it in some future release or is it out of scope for your intended use cases?

P.S. been shipping it for a while https://www.freshports.org/databases/pg_textsearch/ :)

reply

upvote

by andai14 hours ago|

[-]

Can you explain this in more detail? Is this for RAG, i.e. combining vector search with keyword search?

My knowledge on that subject roughly begins and ends with this excellent article, so I'd love to hear how this relates to that.

https://www.anthropic.com/engineering/contextual-retrieval

Especially since what Anthropic describes here is a bit of a rube Goldberg machine which also involves preprocessing (contextual summarization) and a reranking model, so I was wondering if there's any "good enough" out of the box solutions for it.

reply

upvote

by tjgreen14 hours ago|

[-]

Yes, hybrid search is one of the main current use cases we had in mind developing the extension, but it works for old-fashioned standalone keyword-only search as well. There is a lot of art to how you combine keyword and semantic search (there are entire companies like Cohere devoted to just this step!). We're leaving this part, at least for now, up to application developers.

reply

upvote

by zephyrwhimsy14 hours ago|

[-]

Input quality is almost always the actual bottleneck. Teams spend months tuning retrieval while feeding HTML boilerplate into their vector stores.

reply

upvote

by hrmtst938372 hours ago|

[-]

Plenty of folks obsess over retrieval but daily quries are word salad and the input is still garbage.

reply

upvote

by robotswantdata13 hours ago|

[-]

“Just use Postgres” greybeards right again. Looking forward to giving this a go soon

reply

upvote

by mattbessey13 hours ago|

[-]

Please oh please let GCP add this to the supported managed Postgres extensions...

reply

upvote

by tjgreen13 hours ago|

[-]

A little birdie told me that efforts are underway to support the extension in Alloy, at least!

reply

upvote

by 7thpower9 hours ago|

[-]

I hope there is an even smaller bird that will bring this to cloudsql for us plebs.

reply

upvote

by gmassman14 hours ago|

[-]

Very exciting! Congrats on the release, this will be a huge benefit to all folks building RAG/rerank systems on top of Postgres. Looking forward to testing it out myself.

reply

upvote

by jillesvangurp4 hours ago|

[-]

If you have the indexing built into postgresql, you can do some pretty nifty things inside of postgres. One thing that comes to mind is doing reciprocal rank fusion as part of a complex query. RRF is a popular strategy for implementing hybrid lexical and vector search. It simply reranks the results in both result sets based on the position of results in both lists. If vector search and lexical search (BM25 or otherwise) produce the same result in a high place, it gets ranked higher. Results missing from one or the other rank lower. Etc.

It's also a great way to combine fuzzy search with stricter phrase or term matching. As opposed to fiddling with boosts or otherwise trying to combine results.

Elastic has a decent explanation of how RRF works.

https://www.elastic.co/docs/reference/elasticsearch/rest-api...

Simple enough that you can just hack this into a select statement. Or do some easy post processing.

My own querylight-ts library implements bm25, vector search, rrf and more for in browser search. I've been experimenting with that in the last few weeks. Very effective if you want to add a bit more advanced search to your website. Having decent bm25 indexing in postgresql opens a lot of new possibilities. They already had vector search and trigram support. And of course traditional wild card based matching, normalization functions, etc. Bm25 adds a big missing piece.

There's still value to having your search index separated from your main datastore. What you query is not necessarily what you store. That's why people have ETL pipelines to extract, transform (crucial) and load. Even if your search index is going to be postgresql, you might want to think about how to pump data around and what happens when you change your mind about how you want to query and index your data. Migrating your single source of truth is probably going to be an anti pattern there. Honestly, ETL is the one thing I see a lot of companies architect wrong when they consult me on how to improve/fix their search solutions. Classic probing question "When is the last time you reindexed your data?". If the answer is "a long time ago", they have no effective ETL capability basically. That's usually the first problem to sort out with clients like that. Even if it's just a separate table in the same DB, how you rebuild that is crucial to experimenting with new querying and indexing strategies.

reply

upvote

by 3abiton14 hours ago|

[-]

This is pretty much my case right now. BM25 is so useful in many cases and having with with postgres is neat!

reply

upvote

by hmokiguess11 hours ago|

[-]

I hope someone from Neon is watching this thread, definitely want to play with this asap

reply

upvote

by piskov13 hours ago|

[-]

On a tangent note it’s amazing how hard it is to have a good case-insensitive search in Postgres.

In SQL Server you just use case-insensitive collation (which is a default) and add an index (it’s the only one non-clustered) and call it a day.

In postgres you need to go above and beyond just for that. It’s like postgres guys were “nah dog, everybody just uses lowercase; you don’t need to worry of people writing john doe as John Doe)”.

And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)

reply

upvote

by arnitdo3 hours ago|

[-]

> And don’t get me started with storing datetime with timezone (e.g “4/2/2007 7:23:57 PM -07:00“). In sql server you have datetimeoffset; in Postgres you fuck off :-)

`TIMESTAMPTZ` / `TIMESTAMP WITH TIME ZONE` exists?

reply

upvote

by nopurpose56 minutes ago|

[-]

They dont store original timezone

reply

upvote

by legedemon2 hours ago|

[-]

citext doesn't solve your problem of case-insensitive search?

reply

upvote

by landsman4 hours ago|

[-]

Postgres is really amazing backbone for products. Love it.

reply

upvote

by Unical-A14 hours ago|

[-]

Impressive benchmarks. How does the BM25 implementation handle high-frequency updates (writes) while maintaining search latency? Usually, there's a trade-off between ingest speed and search performance in Postgres-based full-text search.

reply

upvote

by tjgreen14 hours ago|

[-]

There is indeed such a tradeoff. The architecture is designed with an eye towards making this tradeoff tunable (frequency of memtable spills, aggressiveness of compaction) but the work here is not yet finished. We chose to prioritize optimizing bulk-indexing and query performance for GA, since this is already enough for many applications. I'm excited to get to the point where we have brag-worthy benchmark numbers for high-frequency updates as well!

reply

upvote

by maweaver13 hours ago|

[-]

I've been doing some RAG prototypes with hybrid search using pg_textsearch plus pgvector and have been very pleased with the results. Happy to see a 1.0 release!

reply

upvote

by adsharma10 hours ago|

[-]

Do you have LongMemEval numbers for pgvector vs pgvector+ hybrid search?

reply

upvote

by jackyliang14 hours ago|

[-]

VERY excited about this, literally just looking to build hybrid search using Postgres FTS. When will this be available on Supabase?

reply

upvote

by tjgreen14 hours ago|

[-]

You'll have to ask Supabase!

reply

upvote

by diwank8 hours ago|

[-]

we have been using pg_textsearch in production for a few weeks now, and it's been fairly stable and super speedy. we used to use paradedb (aka pg_search -- it's quite annoying that the two or so similarly named), but paradedb was extremely unstable, led to serious data corruption a bunch of times. in fact, before switching to pg_textsearch, we just switched over to plain trigram search coz paradedb was tanking our db so often...

also shoutout to tj for being super responsive on github issues!

reply

upvote

by timedude14 hours ago|

[-]

When is this available on AWS in Aurora? Anyone from AWS here, add it pronto

reply

upvote

by bradfox213 hours ago|

[-]

Thank you!! Goodbye manticore if this works.

reply

upvote

by devmor13 hours ago|

[-]

This is really cool to see! I've been using BM25+sqlite-vec for contextual search projects for a little while, it's a great performance addition.

reply

upvote

by esafak6 hours ago|

[-]

interesting documents:

https://timescale.github.io/pg_textsearch/benchmarks/compari...

https://github.com/timescale/pg_textsearch/blob/main/ROADMAP...

reply

upvote

by 15 hours ago|

[-]

deleted

reply

upvote

by weiyong10242 hours ago|

[-]

[dead]

reply

upvote

by benjiro300015 hours ago|

[-]

[dead]

reply

upvote

by 18 hours ago|

[-]

deleted

reply

upvote

by cucupac10 hours ago|

[-]

[dead]

reply

upvote

by zephyrwhimsy14 hours ago|

[-]

[dead]

reply

upvote

by gplprotects15 hours ago|

[-]

> ParadeDB, is guarded behind AGPL

What a wonderful ad for ParadeDB, and clear signal that "TigerData" is a pernicious entity.

reply

upvote

by lsaferite14 hours ago|

[-]

You: > "TigerData" is a pernicious entity

TigerData: > pg_textsearch v1.0 is freely available via open source (Postgres license)

They deemed AGPL untenable for their business and decided to create an OSS solution that used a license they were comfortable with and they are somehow "pernicious"? Perhaps take a moment to reflect on your characterization of a group that just contributed an alternative OSS project for a specific task. Not only that, but they used a VERY permissive license. I'd argue that they are being a better OSS community member for selecting a more permissive license.

reply

upvote

by gplprotects14 hours ago|

[-]

[flagged]

reply

upvote

by tjgreen14 hours ago|

[-]

Okay then!

reply

upvote

by gplprotects14 hours ago|

[-]

[flagged]

reply

upvote

by nathanmills11 hours ago|

[-]

If you agree with something, you can just upvote it. I don't see what your comment adds to the conversation.

reply

upvote

by Ultimatt4 hours ago|

[-]

The irony of there being a downvote button too.

reply