RAG at scale: Synchronizing and ingesting billions of text embeddings
160 points by picohen 1 year ago | 55 comments- dluc 1 year agoWe are also developing an open-source solution for those who would like to test it out and/or contribute, it can be consumed as a web service, or embedded into .NET apps. The project is codenamed "Semantic Memory" (available in GitHub) and offers customizable external dependencies, such as using Azure Queues, RabbitMQ, or other alternatives, and options for Azure Cognitive Search, Qdrant (with plans to include Weaviate and more). The architecture is similar, with queues and pipelines.
We believe that enabling custom dependencies and logic, as well as the ability to add/remove pipeline steps, is crucial. As of now, there is no definitive answer to the best chunk size or embedding model, so our project aims to provide the flexibility to inject and replace components and pipeline behavior.
Regarding Scalability, LLM text generators and GPUs remain a limiting factor also in this area, LLMs hold great potential for analyzing input data, and I believe the focus should be less on the speed of queues and storage and more on finding the optimal way to integrate LLMs into these pipelines.
- ddematheu 1 year agoThe queues and storage are the foundation on which some of these other integrations can be built on top. Agree fully on the need for LLMs within the pipelines to help with data analysis.
Our current perspective has been on leveraging LLMs as part of async processes to help analyze data. This only really works when your data follows a template where I might be able to apply the analysis to a vast number of documents. Alternatively it becomes too expensive to do at a per document basis.
What types of analysis are you doing with LLMs? Have you started to integrate some of these into your existing solution?
- dluc 1 year agoCurrently we use LLMs to generate a summary, used as an additional chunk. As you might guess, this can take time, so we postpone the summarization at the end (the current default pipeline is: extract, partition, gen embedding, save embeddings, summarize, gen embeddings (of the summary), save emb)
Initial tests though are showing that summaries are affecting the quality of answers, so we'll probably remove it from the default flow and use it only for specific data types (e.g. chat logs).
There's a bunch of synthetic data scenarios we want to leverage LLMs for. Without going too much into details, sometimes "reading between the lines", and for some memory consolidation patterns (e.g. a "dream phase"), etc.
- ddematheu 1 year agoMakes sense. Interesting on the fact that summaries affect quality sometimes.
For synthetic data scenarios are you also thinking about synthetic queries over the data? (Try to predict which chunks might be more used than others)
- ddematheu 1 year ago
- dluc 1 year ago
- bradneuberg 1 year agoReally interesting library.
Is anyone aware of something similar but hooked into Google Cloud infra instead of Azure?
- dluc 1 year agowe could easily add that if there's interest, e.g. using Pub/Sub and Cloud Storage. If there are .NET libraries, should be straightforward implementing some interfaces. Similar considerations for the inference part, embedding and text generation.
- derekperkins 1 year agoGCP also has a hosted vector db https://cloud.google.com/vertex-ai/docs/vector-search/overvi...
- derekperkins 1 year ago
- dluc 1 year ago
- CharlieDigital 1 year agoWhy .NET apps specifically?
- dluc 1 year agoMultiple reasons, some are subjective as usual in these choices. Customers, performance, existing SK community, experience, etc.
However, the recommended use is running it as a web service, so from a consumer perspective the language doesn't really matter.
- dluc 1 year ago
- ddematheu 1 year ago
- juxtaposicion 1 year agoWe’re also building billion-scale pipeline for indexing embeddings. Like the author, most of our pain has been scaling. If you only had to do millions, this whole pipeline would be a 100 LoC. but billions? Our system is at 20k LoC and growing.
The biggest surprise to me here is using Weavite at the scale of billions — my understanding was that this would require tremendous memory requirements (of order a TB in RAM) which are prohibitively expensive (10-50k/m for that much memory).
Instead, we’ve been using Lance, which stores its vector index on disk instead of in memory.
- ddematheu 1 year agoCo-author of article here.
Yeah a ton of the time and effort has gone into building robustness and observability into the process. When dealing with millions of files, a failure half way through it is imperative to be able to recover.
RE: Weaviate: Yeah, we needed to use large amounts of memory with Weaviate which has been a drawback from a cost perspective, but that from a performance perspective delivers on the requirements of our customers. (on Weaviate we explored using product quantization. )
What type of performance have you gotten with Lance both on ingestion and retieval? Is disk retrieval fast enough?
- juxtaposicion 1 year agoDisk retrieval is definitely slower. In-memory retrieval typically can be ~1ms or less, whereas disk retrieval on a fast network drive is 50-100ms. But frankly, for any use case I can think of 50ms of latency is good enough. The best part is that the cost is driven by disk not ram, which means instead of $50k/month for ~TB of RAM you're talking about $1k/mo for a fast NVMe on a fast link. That's 50x cheaper, because disks are 50x cheaper. $50k/mo for an extra 50ms latency is a pretty clear easy tradeoff.
- juxtaposicion 1 year ago
- bryan0 1 year agowe've been using pgvector at the 100M scale without any major problems so far, but I guess it depends on your specific use case. we've also been using elastic search dense vector fields which also seems to scale well, but of course its pricey but we already have it in our infra so works well.
- ddematheu 1 year agoWhat type of latency requirements are you dealing with? (i.e. look up time, ingestion time)
Were you using postgres already or migrated data into it?
- juxtaposicion 1 year agoI'd love to know the answer here too!
I've ran a few tests on pg and retrieving 100 random indices from a billion-scale table -- without vectors, just a vanilla table with an int64 primary key -- easily took 700ms on beefy GCP instances. And that was without a vector index.
Entirely possibly my take was too cursory, would love to know what latencies you're getting bryan0!
- bryan0 1 year agowe have look up latency requirements on the elastic side. on pgvector it is currently a staging and aggregation database so lookup latency not so important. Our requirement right now is that we need to be able to embed and ingest ~100M vectors / day. This we can achieve without any problems now.
For future lookup queries on pgvector, we can almost always pre-filter on an index before the vector search.
yes, we use postgres pretty extensively already.
- juxtaposicion 1 year ago
- omneity 1 year agoWhat size are your embeddings?
- bryan0 1 year ago384 dims. we're using: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
- bryan0 1 year ago
- ddematheu 1 year ago
- esafak 1 year agoWhat kind of retrieval performance are you observing with Lance?
- juxtaposicion 1 year agoFor a "small" dataset of 50M and 0.5TB in size with 20 results get around 50-100ms.
- juxtaposicion 1 year ago
- ddematheu 1 year ago
- dkatz23238 1 year agoWhat statistics/metrics are used to evaluate RAG systems? Is there any paper that systematically compares different RAG methods (chunkings, models, ect)? I would assume that such metric would be similar to something used for evaluating summarization or question and answering but I am curious to know if there are specific methods/metrics used to evaluate RAG systems.
- joewferrara 1 year agoThis is a great article about the technical difficulties of building a RAG system at scale from an engineering perspective. Performance is about speed and compute. A topic that is not addressed is how to evaluate a RAG system where performance is about whether the RAG system is retrieving the correct context and answering questions accurately. A RAG system should be built so that the different parts (retriever, embedder, etc) can easily be taken out and modified to improve the performance of the RAG system at answering questions accurately. Whether a RAG system is answering questions accurately should be assessed during development and then continuously monitored.
- ddematheu 1 year agoCo-author of the article here.
You are right. Retrieval accuracy is important as well. From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?
In our current architecture, all the different pieces within the RAG ingestion pipeline are modifiable to be able to improve loading, chunking and embedding.
As part of our development process, we have started to enable other tools that we don't talk as much in the article about including a pre processing and embeddings playground (https://www.neum.ai/post/pre-processing-playground) to be able to test different combinations of modules against a piece of text. The idea being that you can establish you ideal pipeline / transformations that can then be scaled.
- visarga 1 year agoDid you consider pre-processing each chunk separately to generate useful information - summary, title, topics - that would enrich embeddings and aid retrieval? Embeddings only capture surface form. "Third letter of second word" won't match embedding for letter "t". Info has surface and depth. We get depth through chain-of-thought, but that requires first digesting raw text with an LLM.
Even LLMs are dumb during training but smart during inference. So to make more useful training examples, we need to first "study" them with a model, making the implicit explicit, before training. This allows training to benefit from inference-stage smarts.
Hopefully we avoid cases where "A is B" fails to recall "B is A" (the reversal curse). The reversal should be predicted during "study" and get added to the training set, reducing fragmentation. Fragmented data in the dataset remains fragmented in the trained model. I believe many of the problems of RAG are related to data fragmentation and superficial presentation.
A RAG system should have an ingestion LLM step for retrieval augmentation and probably hierarchical summarisation up to a decent level. It will be adding insight into the system by processing the raw documents into a more useful form.
- ddematheu 1 year agoNot at scale. Currently we do some extraction for metadata, but pretty simple. Doing LLM based pre-processing of each chunk like this can be quite expensive especially with billions of them. Summarizing each document before ingestion could cost thousands of dollars when you have billions.
We have been experimenting with semantic chunking (https://www.neum.ai/post/contextually-splitting-documents) and semantic selectors (https://www.neum.ai/post/semantic-selectors-for-structured-d...) but from a scale perspective. For example, if we have 1 millions docs, but we know they are generally similar in format / template, then we can bypass having to use an LLM to analyze them one by one and simply help create scripts to extract the right info.
We think there are clever approaches like this that can help improve RAG while still being scalable.
- dartos 1 year agoDo you have any more resources on this topic? I’m currently very interested in scaling and verifying RAG systems.
- ddematheu 1 year ago
- janalsncm 1 year ago> From an accuracy perspective, any tools you have found useful in helping validate retrieval accuracy?
You’ll probably want to start with the standard rank-based metrics like MRR, nDCG, and precision/recall@K.
Plus if you’re going to spend $$$ embedding tons of docs you’ll want to compare to a “dumb” baseline like bm25.
- visarga 1 year ago
- ac2u 1 year agoYeah, especially if you're experimenting with training and applying a matrix to the embeddings generated by an off the shelf model to help it surface subtleties unique to your domain.
- ddematheu 1 year ago
- typest 1 year agoIt seems to me that RAG is really search, and search is generally a hard problem without an easy one size fits all solution. E.g., as people push retrieval further and further in the context of LLM generation, they're going to go further down the rabbit hole of how to build a good search system.
Is everyone currently reinventing search from first principles?
- zby 1 year agoI am convinced that we should teach the LLMs to use search as a tool instead of creating special search that is useful for LLMs. We now have a lot of search systems and LLMs can in theory use all kind of text interface, the only problem is with the limited context that LLMs can consume. But is is quite orthogonal to what kind of index we use for the search. In fact for humans it is also be useful that search returns limited chunks - we already have that with the 'snippets' that for example Google shows - we just need it to tweak a bit for them to be maybe two kind of snippets - shorter as they are now and longer.
You can use LLMs to do semantic search using a keyword search - by telling the LLM to come up with a good search term that would include all the synonymes. But if vector search in embeddings really gives better results than keyword search - then we should start using it in all the other search tools used by humans.
LLMs are the more general tool - so adjusting them to the more restricted search technology should be easier and quicker to do instead of doing it the other way around.
By the way - this prompted me to create my Opinionated RAG wiki: https://github.com/zby/answerbot/wiki
- isaacfung 1 year agoDepends on what you mean by search. Do you consider all Question Answering as search?
Some questions require multi-hop reasoning or have to be decomposed into simpler subproblems. When you google a question, often the answer is not trivially included in the retrieved text and you have to process(filter irrelevant information, resolve conflicting information, extrapolate to cases not covered, align the same entities referred to with two different names, etc), forumate an answer for the original question and maybe even predict your intent based on your history to personalize the result or customize the result in the format you like(markdown, json, csv, etc).
Researchers have developed many different techniques to solve the related problems. But as LLMs are getting hyped, many people try to tell you LLM+vector store is all you need.
- fkyoureadthedoc 1 year agoWe're using a product from our existing enterprise search vendor, which they pitch an NLP search. Not convinced it's better than the one we already had consider we have to use an intermediate step of having the LLM turn the user's junk input into a keyword search query, but it's definitely more expensive...
- mrfox321 1 year agoYour intuition on search being implemented is correct.
It's still TBD on whether these new generations of language models will democratize search on bespoke corpuses.
There's going to be a lot of arbitrary alchemy and tribal knowledge...
- ddematheu 1 year agoTo some degree. The amount of data that will be brought into search solutions will be enormous, seems like a good time to try to reimagine what that process might look like
- antupis 1 year agoAlso this is search for LLM not for humans so optimal solution will be different. Or even with models it is not that hard to imagine that Mistral-8b will need different results than GPT4 which has 1.76 trillion parameters.
- zby 1 year agoI think this is premature optimisation. LLMs are the general tool here - in principle we should try first to adjust LLMs to search instead of doing it the other way around.
But really I think that LLMs should use search as just one of their tools - just like humans do. I would call it Tool Augmented Generation. And also be able to reason through many hops. A good system answer the question _What is the 10th Fibonacci number?_ by looking up the definition in wikipedia, writing code for computing the sequence, testing and debugging it and executing it to compute the 10th number.
- zby 1 year ago
- antupis 1 year ago
- zby 1 year ago
- wanderingmind 1 year agoAre there any good implementations of using RAG within postgresql ecosystem? I have seen blogposts from supabase[0] and timescale db[1] but not a full fledged project. The full text search is very good within postgres at the moment and having semantic search within the same ecosystem is quiet helpful atleast for simple usecases.
[0] https://supabase.com/docs/guides/database/extensions/pgvecto...
[1] https://www.timescale.com/blog/postgresql-as-a-vector-databa...
- losteric 1 year agoIsn't RAG "just" dynamically injecting relevant text in a prompt? What more would one implement to achieve RAG, beyond using Postgres' built in full text or knn search?
- wanderingmind 1 year agowhat i'm looking for is a neat python library (or equivalent) that integrates end to end say with postgres/pgvector using sqlalchemy, enables parallel processing of large number of documents, create interfaces for embeddings using openai/ollama etc. It looks like FastRAG [0] from intel looks close to what i'm envisioning but it doesnt appear to have integration to postgres ecosystem yet i guess.
- ddematheu 1 year agoThrough the platform (Neum AI) we support the ability to do this with Postgres, it is just a cloud platform so not a python library.
Curious on what type of customization are you looking to add that you would want something like a library?
- ddematheu 1 year ago
- 1 year ago
- wanderingmind 1 year ago
- avthar 1 year agoTimescale recently released Timescale Vector [0] a scalable search index (DiskANN) and efficient time-based vector search, in addition to all capabilities of pgvector and vanilla PostgreSQL. We plan to add the document processing and embedding creation capabilities you discuss into our Python client library [1] next, but Timescale Vector integrates with LangChain and LlamaIndex today [2], which both have document chunking and embedding creation capabilities. (I work on Timescale Vector)
[0]: https://www.timescale.com/blog/how-we-made-postgresql-the-be... [1]: https://github.com/timescale/python-vector [2]: https://www.timescale.com/ai/#resources
- antupis 1 year agoOr generally what are good vector dbs have tried LlaMaindex, pinecone and milvus but all kinda sucked different way.
- ddematheu 1 year agoWhat about then sucked?
- ddematheu 1 year ago
- losteric 1 year ago
- vimota 1 year agoThanks for writing this up! I'm working on a very similar service (https://embeddingsync.com/) and I implemented almost the same as you've described here, but using a poll-based stateful workflow model instead of queueing.
The biggest challenge - which I haven't solved as seamlessly as I'd like - is supporting updates / deletes in the source. You don't seem to discuss it in this post, does Neum handle that?
- ddematheu 1 year agoCo-author of the article here.
We do support updates for some sources. Deletes not yet. For some sources we do polling which is then dumped on the queues. For other we have listeners that subscribe to changes.
What are the challenges you are facing in supporting this?
- vimota 1 year agoSimilar to you, for polling you only see new data not the deletion events so I can't delete embeddings unless I keep track of state and do a diff. To properly support that you/I would need effectively CDC, which gets more complex for arbitrary / self-serve databases.
- vimota 1 year ago
- ddematheu 1 year ago
- bluelightning2k 1 year agoGood article BUT I can't fathom that people would use a managed service to generate and store embeddings.
The openAI or replicate embeddings APIs are already a managed service... You would still need to self managing it all just into a different API.
And dealing with embeddings is the kind of fun work every engineer wants to do anyway.
Still a good article but very perplexing how the company can exist
- raverbashing 1 year agoSounds like the same people who use langchain's "Prompt replacement" methods instead of, you know, just use string formatting
https://python.langchain.com/docs/modules/model_io/prompts/p...
- ddematheu 1 year agoSome engineers find it fun, other might not. Same as everything.
IMO the fun parts are actually prototyping and figuring out the right pattern I want to use for my solution. Once you have done that, scaling and dealing with robustness tends to be a bit less fun.
- raverbashing 1 year ago
- joelthelion 1 year agoCan anyone who has used such systems for some time comment on their usefulness? Is it something you can't live with, a nice to have, or something you tend to forget is available after a while?
- vtuulos 1 year agohere's how we solved engineering challenges related to RAG using open-source Metaflow: https://outerbounds.com/blog/retrieval-augmented-generation/
- arzelaascoli 1 year agoWe also shared an article about how we run these indexing jobs at scale at deepset with kubernetes, SQS, s3 and KEDA.
TL;DR: Queue upload events via SQS, upload files to s3, scale consumers based on queue length with keda and use haystack to turn files into embeddings.
This also works for arbitrary pipelines with your models, custom nodes (python code snippeds) and is pretty efficient.
Part1 (application&architecture): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Part2 (scaling): https://medium.com/@ArzelaAscoli/scaling-nlp-indexing-pipeli... Example code: https://github.com/ArzelaAscoIi/haystack-keda-indexing
We actually also stared with celery, but moved to SQS to improve the stability.