Silkenweb Example: Hackernews Clone

Show HN: Chat with your data using LangChain, Pinecone, and Airbyte

220 points by mtricot 1 year ago | 59 comments

Hi HN,

A few of our team members at Airbyte (and Joe, who killed it!) recently played with building our own internal support chat bot, using Airbyte, Langchain, Pinecone and OpenAI, that would answer any questions we ask when developing a new connector on Airbyte.

As we prototyped it, we realized that it could be applied for many other use cases and sources of data, so... we created a tutorial that other community members can leverage [http://airbyte.com/tutorials/chat-with-your-data-using-opena...] and the Github repo to run it [https://github.com/airbytehq/tutorial-connector-dev-bot]

The tutorial shows: - How to extract unstructured data from a variety of sources using Airbyte Open Source - How to load data into a vector database (here Pinecone), preparing the data for LLM usage along the way - How to integrate a vector database into ChatGPT to ask questions about your proprietary data

I hope some of it is useful, and would love your feedback!

jmorgan 1 year ago
LangChain supports local LLMs like Llama 2 with Ollama (https://github.com/jmorganca/ollama) as of this morning, in both their Python and Javascript versions:
https://python.langchain.com/docs/integrations/llms/ollama
This can be a great option if you'd like to keep your data local versus submitting it to a cloud LLM, with the added benefit of saving costs if you're submitting many questions in a row (e.g. in batches)
- mtricot 1 year ago
  I am sure we can build something around that. Going to take a look at it. Thanks for mentioning it.
- hnhg 1 year ago
  Thanks. How would this differ from running Llama2 through the Huggingface-Langchain integration? I haven't tried it but it looked like the way to go until you shared this.
  - zwaps 1 year ago
    This one is for Mac
    - wahnfrieden 1 year ago
      Huggingface launched swift/mac today or in recent days
- zarazas 1 year ago
  Can llama 2 also be used to create the embeddings?
- bestcoder69 1 year ago
  How well does it work?
bnchrch 1 year ago
Always so happy to see a tutorial with actual substance.
So much on LLMs lately is mostly blog spam for SEO but this actually is information dense and practical. Definitely bookmarking this for tonight.
Also really happy to see a bonus section on pulling in data from third party websites. I think this is where LLMs get really interesting. Not only is data much easier to query with these new models, its also orders of magnitude easier to ingest from traditionally malformated sources.
- BoorishBears 1 year ago
  It's a little weird writing a comment worded like this if you work at Airbyte isn't it?
  - bnchrch 1 year ago
    You know what, reading it, this is a very fair callout. Im sorry.
    Last week I had seen a draft of this and thought to myself all these same things.
    It was a well written tutorial, and that it was exactly the type of format I wouldve hopped for as a engineer. To the point, with a lot of examples.
    It really is the same style I like to write in. So when I saw we were starting to share this I wanted to support what I thought was great work and writing.
    But it was really unfair not to put a disclaimer on that I do work here.
    Honestly, really sorry.
    Also I owe Michel an apology because I dont think he realized either.
  - kaveet 1 year ago
    That's disappointing to see, and posted within 10 minutes no less.
    - bnchrch 1 year ago
      Hey Kaveet, I dont know who have all seen the original post but I know you have so I wanted to say I'm sorry. This was my fault. I wanted to support good work as a member of the engineering community but I shouldve added a disclaimer about employment.
- hubraumhugo 1 year ago
  You're right, LLMs are really good at extracting and structuring data from third party sources. I've been working on a "Zapier for data extraction" for this reason: https://kadoa.com
  - omarfarooq 1 year ago
    Looks great! Does it work behind a login?
    In any case, just signed up for the waitlist. Would love to get bumped up if possible!
- bnchrch 1 year ago
  A late but still very applicable disclaimer: I do work here!
  My original comment had the intent of my own personal opinion as someone who reads/writes tutorials as fun.
  But I owe an apology to the readers because I did not add any disclosure, and honestly I shouldve
- mtricot 1 year ago
  Let me know how that works out for you and if you would add anything to this tutorial!
wanderingmind 1 year ago
When there are so many awesome FOSS vector databases available, I wonder what motivated the airbyte team to use Pinecone, the one database that is anti-FOSS?
ramesh31 1 year ago
I don't know what datasets you guys are working with that have no issues being shared in plain text across three separate proprietary paid services, but this is a nonstarter for me.
- mtricot 1 year ago
  When reading the tutorial, we are describing one stack to build a specific app. But the stack is made of building blocks that you can replace with others if you need to.
  - Airbyte has two self-hosted options: OSS & Enterprise
  - Langchain: OSS
  - OpenAI: you can host an OSS model if you want to
  - Pinecone: there are OSS/self-hosted alternatives
  - samspenc 1 year ago
    > - OpenAI: you can host an OSS model if you want to
    Just to confirm: you mean models like Facebook's Llama 2 and variants right? Since OpenAI hasn't released any OSS models.
    - mtricot 1 year ago
      correct
  - zarazas 1 year ago
    What about the embedding?
_pdp_ 1 year ago
A fantastic starting point for beginners! Personally, I believe this tutorial provides a solid foundation, but there's so much more to explore. Building something truly effective involves tackling various nuanced situations and special cases. While querying records in Pinecone can sometimes give you the right results, it can also be a bit unpredictable, depending on what and how you query. You might want to check out options like Weaviate, or even delve into the world of sparse indexes for an added layer of complexity. The models themselves have their own quirks too. For example, GPT3.5 Turbo tends to respond well when given clear instructions at the beginning of the context, while GPT4, although more flexible, still comes with its own set of challenges. Despite this, I'm genuinely excited about the push to highlight the potential of LLM applications (more of that, please!). Just remember, while tutorials like this are a great step, achieving seamless results might require some hands-on experience and learning along the way.
- mtricot 1 year ago
  Thanks! I agree with your point. There is a lot of tuning that needs to happen, including context aware splitting and any other kind of transformation before the unstructured data gets indexed. This is one of the big challenge of productionizing LLM apps with external data. So far we are using internally since the team as experience dealing with building these connectors and that becomes a great co-pilot.
  The great thing we get by plugging this whole stack together is that we get all the refreshed data as more issues/connectors get created.
- rahimnathwani 1 year ago
  I'm curious: did you have ChatGPT lightly edit this comment before posting? A few things about the style (like the final sentence) sound similar to GPT-4 output.
  - fakedang 1 year ago
    We are reaching peak HN. Animated discussions on everything by chatbots, while the humans are the lurkers.
  - replwoacause 1 year ago
    Not the person you’re asking but yeah…that looks likes ChatGPT
    - oefnak 1 year ago
      Just remember ChatGPT always likes to end with an unsolicited warning and a smile. :-)
sandGorgon 1 year ago
hi folks, when will you have pgvector as a destination ? we (https://github.com/arakoodev/edgechains) work with a lot of enterprises and they would not move away from using redis or pgvector even as their vector store. Is there a way where we can leverage that ?
Second, for a LOT of enterprises, they want to use non-openai embedding models (minilm, GTE, BGE), will you support that. For e.g. in Edgechains we natively support BGE and minilm. Would you be able to support that ?
amanivan 1 year ago
This is cool, I would like access to the code contents as well, not just the issue. Is that possible with airbyte? If so, how?
anupsurendran 1 year ago
I feel that there are too many moving pieces here especially for prototyping. There was a much more simpler app recently I took a look at on a recent hackernews post : https://news.ycombinator.com/item?id=36894142
They still have work to do with different connectors (e.g. PDF etc) but the realtime simple document pipeline is what helps a lot.
gz5 1 year ago
Very well written and illustrated, thank you.
When using a local vector db, what is the security model between my data and Airbyte? For example, do I need to permit Airbyte IPs into my enviro, and is there a VPN type option for private connectivity?
- mtricot 1 year ago
  It depends.
  Airbyte comes in 3 flavors: OSS, Cloud, Enterprise.
  For OSS & Enterprise, data doesn't leave your infra since Airbyte is running in your infrastructure. For Cloud, you would have to allow some IPs to allow us to access your local db.
r_thambapillai 1 year ago
How are you thinking about preventing customer PII making it to OpenAI?
- mtricot 1 year ago
  For the purpose of the tutorial that we built, it really comes down to the type of data that you're using.
  If you have data with PII:
  One option would be to use Airbyte and bring the data into files/local db rather than directly to the vector store, add an extra step that strips the data from all PII and then configure Airbyte to move the clean file/record to the vector store.
  The option that jmorgan mention is relevant here, using a "self-hosted" model.
- frankfrank13 1 year ago
  This is always the first good question to ask about any chat bot IMO
- 1 year ago
swyx 1 year ago
congrats team!
what was the thinking behind choosing to support "Vector Database (powered by LangChain)" instead of directly supporting Pinecone Chroma et al directly as you do in the other destinations? when is direct integration the right approach vs when is it better to have an (possibly brittle, but faster time to market) integration of an integration?
- mtricot 1 year ago
  Great question :) We want to get to value as fast as possible. I am certain that at some point we will need to go deeper with those integrations and they will likely require to be separate destinations. It will also depend on how they differentiate from each others, we will need more granularity with configurations.
  - zarazas 1 year ago
    I ak playing around with langchain the last days as well and when I checked right all langchain is really doing for you is giving you a guideline about recommended steps for a vector assisted LLM. In your example it actually just adds some text to the prompt like: "Answer the following question with the context provided here, If you dont find the right info dont make something up" sth. Along those lines
  - mritchie712 1 year ago
    have you considered supporting pgvector? I'd imagine that'd be easier since you already have pg as a destination.
    - mtricot 1 year ago
      On the roadmap! We want to get more clarity on how to fit the Embedding part in the ELT model. Once we figure it out we will add it to PG.
- rschwabco 1 year ago
  A version supporting Pinecone directly is coming soon!
amelius 1 year ago
I like to keep my tools simple so just give me a single AI that can do everything, browse through my data, generate pictures and give me suggestions in my code editor, etc. etc., instead of a different AI for every tool out there.
- mtricot 1 year ago
  Isn't it the dream? Today there is a lot of stack that needs to be built to enable what you're describing. This is actually what we are doing with that post. What foundations do we need to build so that the UX for the end user is what you're describing. Will take some time to get there :)
- kingforaday 1 year ago
  The next great debate. MonolithicAI vs Micro-serviceAI.
johndhi 1 year ago
How large of a dataset can I submit? I have hundreds of thousands of words of text.
- mtricot 1 year ago
  Shouldn't have any limits here. Can you let us know how it goes?
  - johndhi 1 year ago
    hmm, as a person of low technical savvy, do you expect there will be a point at which I can upload a large text file and have you do all the work to let me chat with it? I'd pay for that today if it exists, but can't put a ton of effort into building/implementing something myself.
    - tomr75 1 year ago
      chatpdf..?
zby 1 year ago
So I guess after all those discussions we are still stuck with LangChain for everything to do with LLMs.
- BoorishBears 1 year ago
  I spend all day talking to people shipping AI products and approximately zero of them actually use LangChain.
  LangChain doesn't make sense for a ton of reasons, but the top few are the code quality being horrid, the scope being ill defined, and the fact that most of the tasks it does are better done with a prompt that was designed for your exact use case.
  - electrondood 1 year ago
    We're actually shipping "AI products" and love LangChain.
    Your criticism doesn't line up with anything in my experience. Nearly all of the prompts you use are your own, and you can customize any of the prompts used under the hood for chains like routing.
- electrondood 1 year ago
  We're using it in production for several products, and are quite happy with it.
everythingmeta 1 year ago
nice to see a tutorial that recognizes the case where the underlying data can change and the embedding needs to be updated.
Any plans to write a tutorial for fine-tuning local models?
- mtricot 1 year ago
  Not at the moment but let me bring that to the team so we can brainstorm what it could look like.
croes 1 year ago
Why is the OpenAI from the article title missing?
- mtricot 1 year ago
  No good reason. Does "it made the post's title too long" work?
  - replwoacause 1 year ago
    Works for me!