Show HN: I'm building a personal web search engine

8 points by a5huynh 3 years ago | 3 comments
  • a5huynh 3 years ago
    Hey HN, I'm building an open source search platform that lives on your device, indexing what you want, exposing it to you in a super simple & super fast interface.

    I took the idea of adding "site:reddit.com" to your Google searches and expanded on it with the idea of "lenses" to add context to your search query and give the crawler direction in terms of what to crawl & index. This means that all queries are run locally, it does not relay your search to any 3rd-party search engine. Think of it as your personal bookcase at home vs. the Library of Congress.

    It's still in a super early state but would love for people to start using it and providing some feedback and see what sort of lenses people want to build and search through!

    Some details about the stack for the interested:

        * All Rust w/ some HTML/CSS for the client.
    
        * Client is built w/ yew + tauri
    
        * Backend uses tantivy to index the web pages, sqlite3 to hold metadata / crawl queue
    
    Thanks in advance!
    • marginalia_nu 3 years ago
      Cool.

      But a warning, based on doing quite a lot of crawling from home through my own search engine, it's very easy to have your IP or IP-block end up on annoying graylists where basically every other website you visit will throw a CAPTCHA in your face. I'm aware this is a risk and use a VPN for most of my private web surfing anyway so it's not that much of a bother, but it's a bit sketchy to expose other people to that risk through something like this.

      It would probably be wise to use canned crawls for major websites, maybe something like trading WARCs <https://en.wikipedia.org/wiki/Web_ARChive> over bit-torrent or whatever. Most of these types of websites don't change that often in the places that matter.

      • a5huynh 3 years ago
        Thanks for the feedback! I’ll keep that in mind as this is built out. Fortunately the initial bootstrapping uses data from the Internet Archive and the crawls afterwards is to check for updates (at a reasonable rate). The number of URLs being hit is much much lower in the end than you would think.
      • notforsaleldn 3 years ago
        [dead]