Show HN: Neum AI – Open-source large-scale RAG framework

155 points by picohen 1 year ago | 30 comments
Over the last couple months we have been supporting developers in building large-scale RAG pipelines to process millions of pieces of data.

We documented our approach in an HN post (https://news.ycombinator.com/item?id=37824547) a couple weeks ago. Today, we are open sourcing the framework we have developed.

The framework focuses on RAG data pipelines and provides scale, reliability, and data synchronization capabilities out of the box.

For those newer to RAG, it is a technique to provide context to Large Language Models. It consists of grabbing pieces of information (i.e. pieces of news articles, papers, descriptions, etc.) and incorporating them into prompts to help contextualize the responses. The technique goes one level deeper in finding the right pieces of information to incorporate. The search for relevant information is done through the use of vector embeddings and vector databases.

Those pieces of news articles, papers, etc. are transformed into a vector embedding that represents the semantic meaning of the information. These vector representations are organized into indexes where we can quickly search for the pieces of information that most closely resembles (from a semantic perspective) a given question or query. For example, if I take news articles from this year, vectorize them, and add them to an index, I can quickly search for pieces of information about the US elections.

To help achieve this, the Neum AI framework features:

Starting with built-in data connectors for common data sources, embedding services and vector stores, the framework provides modularity to build data pipelines to your specification.

The connectors support pre-processing capabilities to define loading, chunking and selecting strategies to optimize content to be embedded. This also includes extracting metadata that is going to be associated to a given vector.

The generated pipelines support large scale jobs through a high throughput distributed architecture. The connectors allow you to parallelize tasks like downloading documents, processing them, generating embedding and ingesting data into the vector DB.

For data sources that might be continuously changing, the framework supports data scheduling and synchronization. This includes delta syncs where only new data is pulled.

Once data is transformed into a vector database, the framework supports querying of the data including hybrid search using the available metadata added during pre-processing. As part of the querying process, the framework provides capabilities to capture feedback on retrieved data as well as run evaluations against different pipeline configurations.

Try it out and if interested in chatting more about this shoot us an email founders@tryneum.com

  • hrpnk 1 year ago
    Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...
    • smokel 1 year ago
      You seem to have found a very polite way of highlighting this.

      I assume that in a not so distant future, a malware scanner will detect this and disallow one to run this locally.

      • ddematheu 1 year ago
        Yeah, we were playing around with doing some semantic chunking. Works okay for some use cases. We have some ideas to go further on that.

        Generally we have found that recursive chunking and character chunking tend to be short sighted.

        • hrpnk 1 year ago
          Don't you find it dangerous to just run the code w/o any sanitizing?

          Why not capture a few strategies that the LLM returns as code that can be properly audited (and ran locally improving the overall performance)?

          • ddematheu 1 year ago
            It is dangerous, part of the reason that we haven't productized that further. One of the ideas we had to productize the capabilities further was to leverage edge / lambda functions to compartmentalize the code generated. (Plus it becomes a general extensibility for folks that are not using semantic code generation and simply want to write their own code.)

            The idea of auditing the strategy is interesting. The flow that we have used for the semantic chunkers up to date has been along these lines where we : 1) Use the utility to generate the code snippets (and do some manual inspection) 2) Test the code snippets against some sample text 3) Validate the results

          • treprinum 1 year ago
            Why not use Stanford Stanza?
        • ramoz 1 year ago
          Very bearish on these frameworks and abstractions.

          Yes, obviously useful for prototyping and creating hype articles & tweets with fun examples. However any engineer is capable of doing their own rag with the same effort (minimal data extraction using the ancient pdf/scrape tools that are still open sota, or use cloud ocr for best —-> brute force chunking —-> embed —-> load in Ann with complementary metadata store)

          Anyone doing prod needs to know the intricacies and make advanced engineering decisions. There’s a reason there aren’t similar end-to-end abstractions over creating Lucene (solr/elastic) indexes. Hmm, why not after many decades? …

          In reality, the RAG tech is not entirely novel— it’s etl. Which in reality, complex etl is often a serious data curation effort. LLMs are the closest thing to enabling better data curation, and as long as you aren’t competing with open ai (arguably any commercial system is) then you can use chatgpt to create your chunks.

          Beyond this embedding strategies are nice to abstract but the best approach to embeddings still remains to create your own and figure out contextual integration on your own. Creating your own can also just be fine-tuning. Inference is often an ensemble depending on your use case.

          • ddematheu 1 year ago
            I don't disagree with all your points. That said, what we have built has proven useful for us as we have built pipelines for customers and think it might be useful for others.

            Probably the main point I disagree with you is that RAG is just ETL. If that was the case, all of the AI apps people are building would be AMAZING because we solved the ETL problem years ago. Yet, app after app being released have issues like hallucinations and incorrect data. IMO the second you insert a non-deterministic entity in the middle of an ETL pipeline, it is no longer just ETL. To try to add value here, our focus has been on adding capabilities to the framework around data synchronization (which is actually more of a vector management problem), contextualization of data through metadata and retrieval (this part being were we have spent the least time to date, but are currently spending the most)

            • hobs 1 year ago
              The problem is that business people have been just rubbing together libraries for decades making money (though maybe not the MOST) and will see these frameworks as a "simple" way to accelerate development, when in fact most of them are opinionated (and bad) ETL. Just put langchain together and it's done, right?

              I went through building a RAG pipeline for a company and brought up at each stage how there's been no tuning, no efficacy testing for different scenarios, no testing of different chunking strategies, just the most basic work done and they released it almost immediately. Surprisingly to not much fan fare.

              It doesn't really matter

            • westurner 1 year ago
              DAIR.AI > Prompt Engineering Guide > Technics > Retrieval Augmented Generation (RAG) https://www.promptingguide.ai/techniques/rag

              https://github.com/topics/rag

              • eigenvalue 1 year ago
                Cool. Do you do any of the relevance calculations directly, or is that all handled by Weaviate? If so, is there any way to influence that part of it, or is it something of a black box?
                • picohen 1 year ago
                  Relevance calculations are handled by the vector db but we try to improve such relevance with the use of metadata (you will see how our components have "selectors" so that metadata can flow all the way to the vector database at the vector level and have an influence when results/scores get retrieved at search time)
                  • eigenvalue 1 year ago
                    Got it. I'd encourage you to expose more of that functionality at the level of your application if possible. I think there is a lot of potential in using more than just cosine similarity, especially when there are lots of candidates and you really want to sharpen up the top few recommendations to the best ones. You might find this open-source library I made recently useful for that:

                    https://github.com/Dicklesworthstone/fast_vector_similarity

                    I've had good results from starting with cosine similarity (using FAISS) and then "enriching" the top results from that with more sophisticated measures of similarity from my library to get the final ranking.

                • J_Shelby_J 1 year ago
                  How does the improve upon retrieval compared to just using any vector db and semantic search?
                  • ddematheu 1 year ago
                    Co-founder here :)

                    Today, it is mostly about convenience. We provide abstractions in the form of a pipeline that encompasses a data source, embed and sink definition. This means that you don't have to think about embedding your query or what class you used to add the data into the vector DB.

                    In the future, we have some additional abstractions that we are adding that will add more convenience. For example, we are working on a concept of pipeline collections so that you can search across multiple indexes but get unified results. We are also adding more automation around metadata given that as part of the pipeline configuration we know what metadata was added and examples of it, so we can help translate queries into hybrid search. I think about it as a self-query retriever from Langchain or Llama Index but that automatically has context of the data at hand. (no need to provide attributes)

                    Are there any specific retrieval capabilities you are looking for?

                  • matmulbro 1 year ago
                    this sums up current wave of AI 'companies':

                    submissions by this user (https://news.ycombinator.com/submitted?id=picohen):

                    Show HN: Neum AI – Open-source large-scale RAG framework (github.com/neumtry)

                    Show HN: ElectionGPT – easy-to-consume information about U.S. candidates (electiongpt.ai)

                    Efficiently sync context for your LLM application (neum.ai)

                    Show HN: Neum AI – Improve your AI's accuracy with up-to-date context (neum.ai)

                  • rmonvfer 1 year ago
                    How is this any different from LlamaIndex [1]?

                    [1] https://www.llamaindex.ai

                    • ddematheu 1 year ago
                      LlamaIndex is pretty awesome.

                      There are a couple areas where we think we are driving some differentiation.

                      1. The management of metadata as a first class citizen. This includes capturing metadata at every stage of the pipeline.

                      2. Be infra ready. We are still evolving this point, but we want to add abstractions that can help developers apply this type of framework to a large scale distributed architecture.

                      3. Enable different types of data synchronization natively. So far we enable both full and delta syncs, but have work in the pipeline to bring in abstractions for real-time syncing. 3.

                    • alchemist1e9 1 year ago
                      If someone is about to start their project using Haystack would you suggest they instead look at Neumtry?
                      • picohen 1 year ago
                        Well, of course I'm biased on the answer :). But to give a not-so-biased answer, I would first try to understand what the project is about and whether RAG is a priority in it. If the project is leveraging agents and LLMs without worrying too much on context/up-to-date data then Haystack could be a good option. If the focus is to eventually use RAG then our framework could help.

                        Additionally, there might be a potential route where both are used, depending on the use case.

                        Feel free to dm if you want to chat further on this!

                        • zansara 1 year ago
                          Actually Haystack is very focused on RAG lately, just have a look at the latest blog articles: https://haystack.deepset.ai/blog

                          (Disclaimer: I am a Haystack maintainer)

                          • alchemist1e9 1 year ago
                            A bit odd that they might not be aware of that. Any ideas from this project you see that might benefit Haystack?
                          • alchemist1e9 1 year ago
                            I understood Haystack as doing RAG but your comment seems to define it differently than my understanding.
                        • imperio59 1 year ago
                          Why MySQL and not PostgreSQL or Redis on the roadmap for sources?
                          • picohen 1 year ago
                            Postgres is already available :)
                            • picohen 1 year ago
                              In fact, we will be releasing a blog post in the next few days of how we do real-time syncing for a RAG application with postgres hosted on supabase
                          • omarfarooq 1 year ago
                            Have you guys connected with MemGPT?