A simple search engine from scratch

296 points by bertman 1 month ago | 59 comments
  • franczesko 1 month ago
    On the topic of search engines, I really liked classes by David Evans. The task was also building a simple search engine from scratch. It's really for beginners, as the emphasis is on coding in general, but I've found it to be very approachable.

    https://www.cs.virginia.edu/~evans/courses/

  • ktallett 1 month ago
    I always wonder if the days of search engines for specific topics could return. With LLM's providing less than accurate results in some areas, and Google, bing, etc being taken over by adverts or well organised SEO, there feels like a place for accurate, specialised search.
    • wolfgang42 1 month ago
      Yeah, the (relative) rise of Kagi and Marginalia show that from a technical perspective, this is within the grasp of a dedicated hobbyist.[1] If Google continues their current trajectory, and overwhelming numbers of AI crawlers don’t cause an unsurmountable rise in CAPTCHA pages, I hope to see an upsurgence of niche search engines that focus on some specialty small enough that one or a few people can curate the content and produce a much better experience than the current crop of general Web search engines.

      Self-plug: I run such a search engine (for programmers) in my living room, at <https://search.feep.dev/>. I don’t spend a ton of time maintaining it, so I’m interested to see what someone really dedicated could do.

      [1] I wrote a 2004-vs-2014 comparison, and things have only gotten better since then: https://search.feep.dev/blog/post/2022-07-23-write-your-own

      • iLoveOncall 1 month ago
        Please, Kagi doesn't even have 50,000 active members, it's definitely not "rising" to become a serious contender at any sort of market share, it's a micro-project. You just feel it's bigger than that because for some reason all of its 50,000 users post relentlessly about it on HN.
        • wolfgang42 1 month ago
          Hence the (relative), yes. Did “dedicated hobbyist” not tip you off that I wasn’t thinking about how to maximize market share?
      • cyanydeez 1 month ago
        Just gotta build a search engine that properly contextualizes scams, bait & switch sites, SEO, and the rest, and you're back in business.

        To do that, you probably still need humans to properly curate the dataset, essentially hire 100 librarians and setup a work flow for them to continually prune results.

        Right now, everything is all batch processes. None of these LLMs use active feedback since there's no real models using updates.

        • datadrivenangel 1 month ago
          The curation of an index of resources is what's needed for niche search
          • cyanydeez 1 month ago
            i know the answer is never distributed services, but if one could build a sufficiently complex SDK to make like a Blue Sky but for niche search indexes, you could chain a bunch of vetted resources together.
            • dcist 1 month ago
              WestLaw and Lexis Nexis provide this for legal search, but quite frankly, these services are subpar. It's amazing that these two companies rake in hundreds of millions but they are both slower than Google, Bing, Yandex, or any LLM service (ChatGPT, Claude, Gemini, etc.) while scouring a universe of text that is orders of magnitude smaller. The user experience is also terrible (you have to login and specify a client each and every time you attempt to use the service and both services log you out after a short -- in my opinion -- period of inactivity, creating friction and needless annoyance to the user). There's an opportunity there.
              • ahi 1 month ago
                LN and Westlaw's real service is their ubiquity. Every law student has access to it and every firm expects proficiency. While they generally suck, the last time I used it (looong time ago), their boolean search was quite nice. That kind of text search has mostly been replaced by non-deterministic black boxes which aren't great for legal research.
                • ktallett 1 month ago
                  I haven't personally used the mentioned services as they aren't in my field, however what is the accuracy of their results? Are they double checked? I don't find LLMs particularly accurate in my field (that's being kind), if anything I find they make up sources that simply don't exist.

                  I mean poor UX has no excuse but slow speed can be reasoned if it makes the quality of the service better.

                • cosmicgadget 1 month ago
                  My hope is that content self-indexes so instead curation it just has to be aggregated.
                  • econ 1 month ago
                    Depends how tiny the niche is. A few dozen domains is easily done by hand and worth having.
                  • raydenvm 1 month ago
                    Which is not scalable, right?
                    • 1 month ago
                      • cosmicgadget 1 month ago
                        It's scalable if you are okay with not searching exhaustively.
                    • fanwood 1 month ago
                      I already directly search on Wikipedia for most topics (with a search shortcut on URL bar)
                      • ktallett 1 month ago
                        Wikipedia is useful up to a point for sure. I feel whether it could be a expansion of Wikipedia in it's current use case, but for emerging research and niche topics it can sometimes be less useful.
                    • snowstormsun 1 month ago
                      Nice idea, but this approach does not handle out of vocabulary words well which is one major motivation for using a vector-based search. It might not perform significantly better compared to lexical matching like tf-idf or BM25, and being slower because of linear complexity. But cool regardless.
                      • netdevphoenix 1 month ago
                        It is supposed to be a simple search engine. Keyword: simple.

                        As long as it does what it is meant to, as a simple search engine, it seems fine

                        • snowstormsun 1 month ago
                          Using tfidf or bm25 would actually be simpler than a vector search.

                          I understand this is just for fun, just wanted to point that out.

                          • LunaSea 1 month ago
                            TF/IDF does not support out-of-vocabulary keywords as far as I know.
                        • cosmicgadget 1 month ago
                          Or since OP has both the cosine similarity matching and naive matching, a heuristic combination of the two since they address each other's weaknesses.
                          • janalsncm 1 month ago
                            Vector based approaches either don’t handle OOV terms at all or will perform poorly, depending on implementation. If you limit to alphanumeric trigrams for example you can technically cover all terms but badly depending on training data.
                            • haasisnoah 1 month ago
                              How would you handle those in wordvec?

                              And isn’t a big advantage that synonyms are handled correctly. This implementation still has that advantage.

                            • leumassuehtam 1 month ago
                              The author has a nice series on compiling a Lisp [0], but unfortunately his search engine fails to find it by querying it with "lisp" or "Lisp".

                              [0] https://bernsteinbear.com/blog/compiling-a-lisp-0/

                              • tekknolagi 1 month ago
                                I wonder if that's just not in the top 10k words :/
                                • freilanzer 1 month ago
                                  And it's unfinished since 2020.
                                • sp0rk 1 month ago
                                  The SVG equation is very difficult to read if you're using a dark OS theme because the blog uses the OS preference for dark/light theme (and doesn't seem to give an option to change it manually, either.)
                                  • dheera 1 month ago
                                    On the side, not criticizing OP but I hate the word "cosine similarity" and I wish people would just call it a "normalized dot product" because anyone who took sophomore-level university calculus would get it, but instead we all invented another word
                                    • tekknolagi 1 month ago
                                      Fixed, I think? Let me know
                                      • DylanSp 1 month ago
                                        Works now (I noticed the same issue).
                                    • kaycebasques 1 month ago
                                      > The idea behind the search engine is to embed each of my posts into this domain by adding up the embeddings for the words in the post.

                                      Ah, OK! I never really grokked how to use word-level embeddings. Makes more sense now.

                                      • skarz 1 month ago
                                        Is 'grokked' a common verb now? I had never even heard the word until Musk's AI.
                                    • cosmicgadget 1 month ago
                                      This was a really nice read. Now I have no excuse not to upgrade my blog search. I do feel that I'll have a ton of long tail words like 'prank'.
                                      • vojtechrichter 1 month ago
                                        I really like people playing around with technology many take for granted, without understanding its core, underlying princliples
                                        • swyx 1 month ago
                                          this embeds words with word2vec, which is like 10 years old. at least use BERT or sentencetransformers :)
                                          • gthompson512 1 month ago
                                            I have been thinking a bit lately about how much sense that makes compared to just using word vectors, since traditional queries are super short and often keyword based(like searching for "ground beef" when wanting "ground beef recipes I can cook easily tonight") and so lack most of the context that BERT or similar gives you. I know there are methods like using seperate embeddings for queries and such, but maybe a basic word based search could be more useful, especially with something like fastText for out of vocabulary terms.
                                          • curtisszmania 1 month ago
                                            [dead]
                                            • potato-peeler 1 month ago
                                              [dead]