Why Vector DBs Are the Wrong Abstraction – and What We Built Instead
23 points by Void_ 4 months ago | 8 comments- Santas 4 months agoMulti-tenancy: how do you keep it fair? If some dude in another team runs a crazy expensive query, does that just wreck everyone else’s perf? We’ve seen this happen in other distributed systems where ‘fair scheduling’ is more of a wish than reality.
- tmvst 4 months agoWe already mix vector and text search with Elastic + Vespa. What exactly makes your thing better? Not trying to be rude, but why should we care about Yet Another Search Engine?
- jerguslejko 4 months agodisclaimer: co-founder of topk here :)
not rude at all, it's a common sentiment. TopK gives you all the capabilities under one roof -- text, vectors, filters, embeddings, re-ranking, all behind a single API (so single SDK, single vendor, single bill, single observability plane, etc). This simplifies your integration (less code on your part) and gives you more flexibility to define custom scoring rules (elastic-style) but over text + vectors + any other pre-computed factors. Take a look at [this](https://docs.topk.io/concepts/unified-retrieval#custom-scori...) docs page which goes into detail if you are interested.
I'd be curious to learn about your setup though. Vespa comes with lexical search so what made you choose this combo? Was Elastic already present in your stack before adopting Vespa?
- jerguslejko 4 months ago
- mscavnicky 4 months agoDataFusion wasn’t a fit because it doesn’t do external indexes. But why not extend it?
- marekgalovic 4 months agoWe actually tried to extend DataFusion at first but ultimately decided against it since we can get most of the value by using Arrow and its compute kernels directly. DataFusion also executes filters in a way that rebuilds the underlying arrays (data copy) and requires strict schema, which is not a good fit for our schemaless document-oriented model. In the end, switching from DataFusion to Reactor gave us 3x better latencies.
- marekgalovic 4 months ago
- tsc 4 months agoso you’re using Arrow but also ‘custom layouts’ — what was missing? Arrow’s pretty flexible, and we’ve been able to get it to do some weird stuff in memory without having to hack it. what broke for you?
- Equiet 4 months agoYou’re saying vector DBs are the wrong abstraction, but companies keep throwing money at them. Why? Are they just slow to catch on, or are there legit cases where vectors actually make sense?
- marekgalovic 4 months agoFounder of TopK here. There are legit use cases for vector-based retrieval (e.g. semantic search, recommendations, multi-modal search, etc.) but that only requires supporting vectors as a data type, not building the whole database around vectors as a first-class citizen (which is what vector DBs do). In practice, you also want to combine multiple vectors, text filters, and metadata alongside custom scoring functions to optimize relevance in your domain, which is not possible with a database built around a vector index.
- marekgalovic 4 months ago
- 4 months ago