Show HN: Chat with your data using LangChain, Pinecone, and Airbyte
220 points by mtricot 1 year ago | 59 commentsA few of our team members at Airbyte (and Joe, who killed it!) recently played with building our own internal support chat bot, using Airbyte, Langchain, Pinecone and OpenAI, that would answer any questions we ask when developing a new connector on Airbyte.
As we prototyped it, we realized that it could be applied for many other use cases and sources of data, so... we created a tutorial that other community members can leverage [http://airbyte.com/tutorials/chat-with-your-data-using-opena...] and the Github repo to run it [https://github.com/airbytehq/tutorial-connector-dev-bot]
The tutorial shows: - How to extract unstructured data from a variety of sources using Airbyte Open Source - How to load data into a vector database (here Pinecone), preparing the data for LLM usage along the way - How to integrate a vector database into ChatGPT to ask questions about your proprietary data
I hope some of it is useful, and would love your feedback!
- jmorgan 1 year agoLangChain supports local LLMs like Llama 2 with Ollama (https://github.com/jmorganca/ollama) as of this morning, in both their Python and Javascript versions:
https://python.langchain.com/docs/integrations/llms/ollama
This can be a great option if you'd like to keep your data local versus submitting it to a cloud LLM, with the added benefit of saving costs if you're submitting many questions in a row (e.g. in batches)
- mtricot 1 year agoI am sure we can build something around that. Going to take a look at it. Thanks for mentioning it.
- hnhg 1 year agoThanks. How would this differ from running Llama2 through the Huggingface-Langchain integration? I haven't tried it but it looked like the way to go until you shared this.
- zwaps 1 year agoThis one is for Mac
- wahnfrieden 1 year agoHuggingface launched swift/mac today or in recent days
- wahnfrieden 1 year ago
- zwaps 1 year ago
- zarazas 1 year agoCan llama 2 also be used to create the embeddings?
- bestcoder69 1 year agoHow well does it work?
- mtricot 1 year ago
- bnchrch 1 year agoAlways so happy to see a tutorial with actual substance.
So much on LLMs lately is mostly blog spam for SEO but this actually is information dense and practical. Definitely bookmarking this for tonight.
Also really happy to see a bonus section on pulling in data from third party websites. I think this is where LLMs get really interesting. Not only is data much easier to query with these new models, its also orders of magnitude easier to ingest from traditionally malformated sources.
- BoorishBears 1 year agoIt's a little weird writing a comment worded like this if you work at Airbyte isn't it?
- bnchrch 1 year agoYou know what, reading it, this is a very fair callout. Im sorry.
Last week I had seen a draft of this and thought to myself all these same things.
It was a well written tutorial, and that it was exactly the type of format I wouldve hopped for as a engineer. To the point, with a lot of examples.
It really is the same style I like to write in. So when I saw we were starting to share this I wanted to support what I thought was great work and writing.
But it was really unfair not to put a disclaimer on that I do work here.
Honestly, really sorry.
Also I owe Michel an apology because I dont think he realized either.
- kaveet 1 year agoThat's disappointing to see, and posted within 10 minutes no less.
- bnchrch 1 year agoHey Kaveet, I dont know who have all seen the original post but I know you have so I wanted to say I'm sorry. This was my fault. I wanted to support good work as a member of the engineering community but I shouldve added a disclaimer about employment.
- bnchrch 1 year ago
- bnchrch 1 year ago
- hubraumhugo 1 year agoYou're right, LLMs are really good at extracting and structuring data from third party sources. I've been working on a "Zapier for data extraction" for this reason: https://kadoa.com
- omarfarooq 1 year agoLooks great! Does it work behind a login?
In any case, just signed up for the waitlist. Would love to get bumped up if possible!
- omarfarooq 1 year ago
- bnchrch 1 year agoA late but still very applicable disclaimer: I do work here!
My original comment had the intent of my own personal opinion as someone who reads/writes tutorials as fun.
But I owe an apology to the readers because I did not add any disclosure, and honestly I shouldve
- mtricot 1 year agoLet me know how that works out for you and if you would add anything to this tutorial!
- BoorishBears 1 year ago
- wanderingmind 1 year agoWhen there are so many awesome FOSS vector databases available, I wonder what motivated the airbyte team to use Pinecone, the one database that is anti-FOSS?
- ramesh31 1 year agoI don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.
- mtricot 1 year agoWhen reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.
- Airbyte has two self-hosted options: OSS & Enterprise
- Langchain: OSS
- OpenAI: you can host an OSS model if you want to
- Pinecone: there are OSS/self-hosted alternatives
- mtricot 1 year ago
- _pdp_ 1 year agoA fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.
- mtricot 1 year agoThanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.
The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.
- rahimnathwani 1 year agoI'm curious: did you have ChatGPT lightly edit this comment before posting? A few things about the style (like the final sentence) sound similar to GPT-4 output.
- fakedang 1 year agoWe are reaching peak HN. Animated discussions on everything by chatbots, while the humans are the lurkers.
- replwoacause 1 year agoNot the person you’re asking but yeah…that looks likes ChatGPT
- oefnak 1 year agoJust remember ChatGPT always likes to end with an unsolicited warning and a smile. :-)
- oefnak 1 year ago
- fakedang 1 year ago
- mtricot 1 year ago
- sandGorgon 1 year agohi folks, when will you have pgvector as a destination ? we (https://github.com/arakoodev/edgechains) work with a lot of enterprises and they would not move away from using redis or pgvector even as their vector store. Is there a way where we can leverage that ?
Second, for a LOT of enterprises, they want to use non-openai embedding models (minilm, GTE, BGE), will you support that. For e.g. in Edgechains we natively support BGE and minilm. Would you be able to support that ?
- amanivan 1 year agoThis is cool, I would like access to the code contents as well, not just the issue. Is that possible with airbyte? If so, how?
- anupsurendran 1 year agoI feel that there are too many moving pieces here especially for prototyping. There was a much more simpler app recently I took a look at on a recent hackernews post : https://news.ycombinator.com/item?id=36894142
They still have work to do with different connectors (e.g. PDF etc) but the realtime simple document pipeline is what helps a lot.
- gz5 1 year agoVery well written and illustrated, thank you.
When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?
- mtricot 1 year agoIt depends.
Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.
For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.
- mtricot 1 year ago
- r_thambapillai 1 year agoHow are you thinking about preventing customer PII making it to OpenAI?
- mtricot 1 year agoFor the purpose of the tutorial that we built, it really comes down to the type of data that you're using.
If you have data with PII:
One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.
The option that jmorgan mention is relevant here, using a "self-hosted" model.
- frankfrank13 1 year agoThis is always the first good question to ask about any chat bot IMO
- 1 year ago
- mtricot 1 year ago
- swyx 1 year agocongrats team!
what was the thinking behind choosing to support "Vector Database (powered by LangChain)" instead of directly supporting Pinecone Chroma et al directly as you do in the other destinations? when is direct integration the right approach vs when is it better to have an (possibly brittle, but faster time to market) integration of an integration?
- mtricot 1 year agoGreat question :) We want to get to value as fast as possible. I am certain that at some point we will need to go deeper with those integrations and they will likely require to be separate destinations. It will also depend on how they differentiate from each others, we will need more granularity with configurations.
- zarazas 1 year agoI ak playing around with langchain the last days as well and when I checked right all langchain is really doing for you is giving you a guideline about recommended steps for a vector assisted LLM. In your example it actually just adds some text to the prompt like: "Answer the following question with the context provided here, If you dont find the right info dont make something up" sth. Along those lines
- mritchie712 1 year agohave you considered supporting pgvector? I'd imagine that'd be easier since you already have pg as a destination.
- mtricot 1 year agoOn the roadmap! We want to get more clarity on how to fit the Embedding part in the ELT model. Once we figure it out we will add it to PG.
- mtricot 1 year ago
- zarazas 1 year ago
- rschwabco 1 year agoA version supporting Pinecone directly is coming soon!
- mtricot 1 year ago
- amelius 1 year agoI like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.
- mtricot 1 year agoIsn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)
- kingforaday 1 year agoThe next great debate. MonolithicAI vs Micro-serviceAI.
- mtricot 1 year ago
- johndhi 1 year agoHow large of a dataset can I submit? I have hundreds of thousands of words of text.
- mtricot 1 year agoShouldn't have any limits here. Can you let us know how it goes?
- johndhi 1 year agohmm, as a person of low technical savvy, do you expect there will be a point at which I can upload a large text file and have you do all the work to let me chat with it? I'd pay for that today if it exists, but can't put a ton of effort into building/implementing something myself.
- tomr75 1 year agochatpdf..?
- tomr75 1 year ago
- johndhi 1 year ago
- mtricot 1 year ago
- zby 1 year agoSo I guess after all those discussions we are still stuck with LangChain for everything to do with LLMs.
- BoorishBears 1 year agoI spend all day talking to people shipping AI products and approximately zero of them actually use LangChain.
LangChain doesn't make sense for a ton of reasons, but the top few are the code quality being horrid, the scope being ill defined, and the fact that most of the tasks it does are better done with a prompt that was designed for your exact use case.
- electrondood 1 year agoWe're actually shipping "AI products" and love LangChain.
Your criticism doesn't line up with anything in my experience. Nearly all of the prompts you use are your own, and you can customize any of the prompts used under the hood for chains like routing.
- electrondood 1 year ago
- electrondood 1 year agoWe're using it in production for several products, and are quite happy with it.
- BoorishBears 1 year ago
- everythingmeta 1 year agonice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.
Any plans to write a tutorial for fine-tuning local models?
- mtricot 1 year agoNot at the moment but let me bring that to the team so we can brainstorm what it could look like.
- mtricot 1 year ago
- croes 1 year agoWhy is the OpenAI from the article title missing?
- mtricot 1 year agoNo good reason. Does "it made the post's title too long" work?
- replwoacause 1 year agoWorks for me!
- replwoacause 1 year ago
- mtricot 1 year ago