After All Is Said and Indexed – Unlocking Information in Recorded Speech

57 points by jeadie 2 years ago | 13 comments
  • rektide 2 years ago
    Hadn't heard of the thing they were putting their data into, Marqo, a "tensor search for humans" , https://github.com/marqo-ai/marqo
    • jeadie 2 years ago
      Its a great tool. Unlike vectorDBs alone, Marqo helps the full process that alot of people end up wanting to use vectorDBs for (e.g. have structured data, use LLMs to create embeddings, and perform search/CRUD on embeddings + original data).
    • jeadie 2 years ago
      A really interesting blog post I found using LLMs for audio search which I think is a pretty nifty/new idea.

      I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems, but with Marqo it doesn't seem too hard.

      • thomasahle 2 years ago
        > I've found it cumbersome using some of the new vector DBs (chroma, faiss, etc) to make end to end systems

        What parts are cumbersome?

        • jeadie 2 years ago
          Most people, like me, who end up needing to use vector DBs, are wanting to use LLMs on a specific, often private dataset/use case. Typically one starts with something like unstructured JSON data, then need to pick and manage LLMs to create embeddings, then store these and the original JSON data in a vectorDB. Then the application is some variety of CRUD operations + searching over both the original data and the embeddings.

          Chroma, Pinecone, I guess FAISS/HNSWlib/etc only handle vector operations. Really what I'd want, which Marqo does, is handle everything end to end.

      • notjulianjaynes 2 years ago
        This is interesting but what problem does it solve better than CTRL+F-ing a transcript? It seems like this would be a worse solution for when the precise way someone says something could be important (ex. journalists parsing an interview, students studying their recorded lectures) and that it would be most useful if you were working with a large volume of recorded audio, such as customer service calls. This makes me somewhat uncomfortable, but perhaps I am not fully understanding how it works.

        Edit: wording

        • jeadie 2 years ago
          Being able to handle and ask questions of audio data is a pretty big field. https://www.assemblyai.com/, for example, is a company entirely dedicated to audio intelligence. They have some great example use cases on their page.
          • UncleEntity 2 years ago
            > This is interesting but what problem does it solve better than CTRL+F-ing a transcript?

            Producing the transcript?

            Being able to classify and search data seems like a pretty big deal these days too.

          • password4321 2 years ago
            Both speaker and speech recognition are done in the article using huggingface.

            Is there anything as good ready to use on-prem for the diarization (speaker recognition)?

            I've heard good things about whisper(.cpp) for speech recognition and vosk used to be king of that hill...

            • rolisz 2 years ago
              Diarization can be done on premise using pyannote (what they use in the article). Huggingface offers a library to run things locally and an API to run things on their cloud. Pyannote is available under an MIT licence
              • boredemployee 2 years ago
                vosk is really good, but also a good example of an open source project with great potential, but doesn't scale up because the person behind it is a douchebag.

                documentation is poor, and what you find is sparsed outdated shit on the web, so it's really hard to find help.

              • moneywoes 2 years ago
                How does this compare to using Whisper and feeding that into a vector DB and querying with a LLM

                Pardon the dumb question I only have an elementary understanding

                • jeadie 2 years ago
                  Not a dumb question at all! Essentially what can do Marqo, and this blog shows, is that there is alot of logic and work to do what you said (i.e. pass raw data into LLM, get embeddings, store in vector DB, then query both embeddings and original data).
                • 2 years ago
                  • 2 years ago
                    • 2 years ago