What Happens When You Put a Database in the Browser?

106 points by self 1 year ago | 70 comments
  • xnorswap 1 year ago
    I don't understand these "DB in browser" products.

    If the data "belongs" to the server, why not send the query to the server and run it there?

    If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?

    A lot of the benefits of such databases are their ability to optimise queries for improving performance in a context where the data can't fit in memory (and possibly not even on single disks/machines), as well as additional durability and atomicity improvements. If the data is small enough to be reasonable to send to a client, then it's small enough to fit in memory, which means it'll be fast to query no matter how you go about it.

    The page says one advantage is "Ad-hoc queries on data lakes", but isn't that possible with the most basic form that simply sends a query to the database?

    What am I failing to understand about this category of products?

    • sktrdie 1 year ago
      I think a large benefit to "DB in browser" is about the whole "local-first" software movement. A good overview is written here https://sqlsync.dev/posts/stop-building-databases/

      The essence of the approach is that a large majority of FE apps have constructed quite complex caching layers to improve performance over "querying backend data" - take a look at things like Next.js or React Query <- as the post above mentions, they're essentially rebuilding databases. So instead this approach just moves the db to the browser, along with a powerful syncing layer.

      I think it's an approach that deserves more attention, especially to improve DX where we end up writing a whole lot of database-related logic on clients. Mind as well then just use a database on the client as well

      • jerf 1 year ago
        To the other good points, I'd add that when people start making queries that aren't O(1) or O(n), modern computers and networks are capable of reasonably conveniently moving around amounts of data where it becomes practical for remote client to use its rather substantial power to answer questions on a gigabyte or two of data with its own computing power, and if enough people are doing this at once the combined client power they can have can easily overpower any reasonable amount of cloud compute you might be willing to deploy for this problem as they run non-trivial queries. I may be happy for dozens of clients to download some data and chew through some O(n log n) with mixed O(n^2) components on their own whereas I would not care to provision enough cloud compute to handle it all.

        I think people forget that as cheap as cloud compute is, client compute is even cheaper. Generally it's already paid for. (Marginal electricity costs are lost in the noise of everything else you're supporting a client user with.) And while the years of doubling every 1.5 years may be gone, clients do continue to speed up, and they are especially speeding up in ways that databases can take advantage of (more cores, more RAM, more CPU cache). Moving compute to clients can be a valuable thing in some circumstances on its own.

        • SoftTalker 1 year ago
          This exact argument has been made since people had PCs on their desks (the 1990s, at least). We're still trying to figure out how to do it well.
          • jerf 1 year ago
            Yes, but as the numbers advance from "the remote desktop has like 128KB" and "the remote desktop can chew through gigabytes without much stress", the delta between O(n) and O(n^2) opens up a lot more. It is perhaps a bit counterintuitive, but as systems grow in capability that delta grows.
        • IMTDb 1 year ago
          > If the data "belongs" to the server, why not send the query to the server and run it there?

          We could, but if the data size is not that huge, sending it once to the client and then letting the client perform the queries can be desirable. The tool works without internet access, the latent is much better, all results are coherent etc

          > If the data "belongs" on the client, why have it in database form, particularly a "data-lake" structured db, at all?

          Just because it fits in memory doesn't mean that the shape of the data does not matter. A data structure optimised for whatever query/analysis you want to perform has a significant impact on how fast and efficiently you can perform those operations

          • jt2190 1 year ago
            This argument boils down to:

            - database engines should always run on a server, never on the user’s own machine.

            - local data is never larger than the database engine, and it’s better in all cases to move all of the local data to the server where the database engine runs.

            - local data can always be moved to another machine, regardless of sensitivity.

            Buy are these always true, all the time? It seems to me that there are many use cases (e.g. training local AI models) where these rules should not be so absolute.

            • jimberlage 1 year ago
              There is an increasing subset of people (think those that used to work in MS Access, Excel power users) who learned SQL in a business IT course or on the job. They don’t have data sized to fit in a DB, but they do want to do analyses that use window functions and things that SQL makes more natural than Excel vtables or functions. They may have to give reports to an equally technical boss who would like to play with the report and explore assumptions using SQL.

              The data size is typically not large; SQL and integrating with cloud-based spreadsheets is the selling point.

              • xnorswap 1 year ago
                Okay, so it's more about SQL than the database aspect?

                I can see there'd be demand for that, but I'm not convinced the porting a full-scale DB to WASM is the best way to achieve that goal.

                • jerf 1 year ago
                  Once you accept the idea that you're going to run SQL on the browser, the forces are going to inevitably pull you to having an entire DB engine in the browser anyhow. You're going to need the entire query parser, enough of the IO code to be able to read the DB (which is a lot of it), you're going to want the optimizer, you're going to need RPC to send the SQL and read the results... yes, there is certainly some stuff you can trim, but it's not like it's a horrible idea.

                  So even if someone, say, started a startup with the idea "I'm going to present an MVP of an in-browser DB engine people can use", in 5-10 years it'd be a nearly fully-fledged DB anyhow.

                  • jimberlage 1 year ago
                    There are also some data lakes (though I have not really worked with any) that are homegrown and consist of smallish parquet files in a blob store; they might have several databases looking at them at any given time, and an embedded browser DB provides a frontend that is easier to stand up and auth against in some ways.

                    There may not be a reporting DB available to send queries to, in the traditional sense.

                    • ncruces 1 year ago
                      I don't know about DuckDB, but SQLite's Wasm build is well under 1 MiB minified/gzipped.

                      There's a lot that could be trimmed, but at that point, why?

                  • giancarlostoro 1 year ago
                    I built a web application that became a progressive web app because it needed to work offline because it was used by a pilot during their work, and being at high altitudes, internet is not guaranteed. We cached a bit on localStorage, and I'd loop trying to make HEAD requests (there was no "am I online" API available...) and once we knew you were online we would push the locally cached changes up to the DB. In the meantime we still displayed the current data as you've changed it as well. It worked out nicely.

                    There's a lot of static data that never changes, that if you can request once, and cache it somewhere, it becomes really useful. It's less requests coming to the server. Heck, Tumblr in the 2010s would display a lot of blog data on the dashboard such as follower count, likes count, etc they had outages all the time, when I started to see less and less outages was when they stopped showing all that blog metadata on the dashboard because you didnt have hundreds of thousands (millions?) of users refreshing to see new content, querying across dozens of databases to see likes, follows, etc. It was very common to frequently refresh tumblr to see the latest content.

                    Imagine had they cached this data instead and only updated it based on a lastUpdated fields value being different.

                    • cduzz 1 year ago
                      I think the "db in browser" idea is interesting because it makes the user pay for the compute for many more of the typical DB tasks.

                      Any time you're in a situation where you're in an aggregation point you need to be wary of Moore's law; this is load balancers, database systems, log aggregation, etc.

                      If your computers are getting N cheaper / faster per year, but you're getting M new clients and all your clients are also getting N cheaper / faster per year, you're going to be getting more traffic faster than you can scale, unless you figure out how to cheat. Cheating may be sharding your internal workloads, but it may also be "make someone else do the work."

                      • tlarkworthy 1 year ago
                        That analytics in the browser is about 10000x times more performant, and doesn't contest on a shared resource.
                        • lmeyerov 1 year ago
                          Right - so not a database, but a columnar analytics compute engine. Half of why we wrote arrow js was for pumping server arrow data to webgl, and the other half for powering arrow columnar compute libraries & packages like this.

                          For sub-100ms smooth interactivity, for data in a certain size range sweet spot, can be very nice!

                      • zX41ZdbW 1 year ago
                        I tried https://shell.duckdb.org/, but it was a very rough experience.

                        The "delete" button does not work. The "home" button inserts a whitespace. Pasting with "Ctrl+v" also does not work. Every keypress results in blinking, and there is a notable input lag.

                        When I tried a query

                            duckdb> SELECT * FROM 'https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/*.gz'
                               ...> ;
                            Catalog Error: Table with name https://clickhouse-public-datasets.s3.amazonaws.com/github_events/partitioned_json/*.gz does not exist!
                            Did you mean "sqlite_master"?
                            LINE 1: SELECT * FROM 'https://clickhouse-public-datasets.s3....
                        
                        Suggesting the "sqlite_master" database is also misleading.
                        • niutech 1 year ago
                          How would DuckDB know which *.gz are in that folder? Is there a directory listing?
                        • jeroenhd 1 year ago
                          Why run a database in WASM when IndexedDB exists? Browsers already have a database built in, I don't see the need to download another one.
                          • paulgb 1 year ago
                            IndexedDB is a key/value store; DuckDB is a relational analytic database. You would use IndexedDB for something like application data, and DuckDB for (e.g.) a data-heavy dashboard or ad-hoc query tool.
                            • niutech 1 year ago
                              You can leverage SQL on top of IndexedDB by using AlaSQL (https://github.com/AlaSQL/alasql) or absurd-sql (https://github.com/jlongster/absurd-sql).
                              • quotemstr 1 year ago
                                So what? We should be structuring a lot of databases as layers on top of transactional key value stores. We should be able to swap out the key value store without changing the higher level query engine.
                                • jampekka 1 year ago
                                  We maybe[1] should, but we are not. AFAIK there's no relational query engine for IndexedDB.

                                  Also, there are a lot more people who can get things done with SQL than using indexed KV-stores. KV stores tending to have horrible APIs doesn't help the situation.

                                  [1] E.g. SQLite started with a gdbm backend, but later rolled their own because of limitations caused by it.

                                  • paulgb 1 year ago
                                    When you build that, I will upvote the HN post to that too ;)
                                • Retr0id 1 year ago
                                  Per [0], surprisingly, you can beat IndexedDB performance by layering another DB engine on top. Discussed on HN: https://news.ycombinator.com/item?id=28156831

                                  [0] https://jlongster.com/future-sql-web

                                  • robmccoll 1 year ago
                                    If you use https://dexie.org/ on top of IndexDB, you can turn it into a fairly useful document store with secondary indices, an improved query building interface, and reactive queries. It's doing exactly what you would expect on top of IndexDB if you did it yourself, but saves a ton of boilerplate. They have a sync service to go with it, but in my usage I managed that myself.
                                    • pjc50 1 year ago
                                      IndexedDB should have been SQLite, but for various squishy reasons it isn't, it's something less good.
                                      • jampekka 1 year ago
                                        Because of the IndexedDB API?
                                        • tehbeard 1 year ago
                                          That and Apple breaking it in new and "exciting" ways...
                                        • isodev 1 year ago
                                          It feels very impractical indeed. Also the size of the binary to load the compiled wasm. All this would be much better done on the server and if really needed, users may be given a way to download results (ideally with their own preferred tool for fetching files)
                                          • jampekka 1 year ago
                                            A server sounds quite inpractical if you could otherwise serve the application statically. Or offline.

                                            Also having user data on server causes problems with privacy etc.

                                            • isodev 1 year ago
                                              Well the data has to come from somewhere right? If the goal is to facilitate client-side (bring your own data) scenarios, I'd make a proper native desktop app and take full advantage of the system. A hybrid something running in the browser feels like a compromise between both solutions.
                                          • jt2190 1 year ago
                                            Maybe you’re thinking of WebSQL? https://caniuse.com/sql-storage
                                            • lmeyerov 1 year ago
                                              Columnar analytics are slow in non-columnar engines
                                            • Zambyte 1 year ago
                                              Another interesting option is PouchDB[0], which is a Javascript implementation of the CouchDB[1] synchronization API. It allows you to acheive eventual consistency between a client with intermittent connectivity, and a backend database.

                                              [0] https://pouchdb.com/

                                              [1] https://couchdb.apache.org/

                                              • gregw2 1 year ago
                                                I am not 100% clear how this works...

                                                If you query a Parquet file from your lake via DuckDB-in-browser, does DuckDB run in WASM on the web client and pull the compressed parquet to your browser where it is decompressed? Or are you connecting some DuckDB on the web client to some DuckDB component on a server somewhere?

                                                I presume yes to the first and no to the second but just checking I have my mental model correct.

                                                • zX41ZdbW 1 year ago
                                                  I have an example of a database hosted on GitHub: https://github.com/ClickHouse/web-tables-demo

                                                  Which you can query anywhere using a ClickHouse local database engine.

                                                  When you do a query, it downloads only the required ranges of required columns from the server and does computations on your machine.

                                                  In the same way you can query any data sources with ClickHouse, either local or remote, which is a strict superset in power and usability than duckdb.

                                                  • 4b11b4 1 year ago
                                                    The data would be sent to the browser
                                                    • throwaway-blaze 1 year ago
                                                      Um, my data lake is measured in Pb, not Tb. How is that going to work exactly?
                                                      • undefuser 1 year ago
                                                        DuckDB has certain optimizations which allows it to read only parts of a parquet file. It can also read remote file in streaming fashion so it does not have to wait for the entire file to be downloaded or to store a large amount of data in memory. Relevant documentation: https://duckdb.org/2021/06/25/querying-parquet.html
                                                        • simonw 1 year ago
                                                          DuckDB can use the HTTP range header trick agains parquet, which means a lot of analytical questions against multi-GB files can be answered by fetching only small portions of the overall data.

                                                          Here's a post about applying this same trick to SQLite from few years ago: https://phiresky.github.io/blog/2021/hosting-sqlite-database...

                                                          • 1egg0myegg0 1 year ago
                                                            Howdy! I work at MotherDuck and DuckDB Labs (part time as a blogger). At MotherDuck, we have both client side and server side compute! So the initial reduction from PB/TB to GB/MB can happen server side, and the results can be sliced and diced at top speed in your browser!
                                                      • threeseed 1 year ago
                                                        They didn't mention the lifecycle of the database.

                                                        Because if it's anything longer lived than a week then it could be used by marketers to evade Apple's ITT for retargeting.

                                                        Which would be a huge win for advertisers and a loss for privacy.

                                                        • paulgb 1 year ago
                                                          As far as I can tell they're not storing anything locally, they're pulling the data from Google Cloud Storage as you access it.

                                                          That said, a Wasm module doesn't have access to any storage facilities that the browser doesn't already expose (IndexedDB, OPFS, cookies, etc.) so even if it could be used by marketers, they would gain nothing by using DuckDB for that over just using the same underlying browser storage API.

                                                        • cryptonector 1 year ago
                                                          Supercookies?
                                                          • voidUpdate 1 year ago
                                                            So when I want to browse your website on my phone with a limited data plan, I have to download an entire database client and database, as well as any of your other huge JS libraries?
                                                            • wruza 1 year ago
                                                              I'd prefer that to page reloads, lost focus, scroll position and any input between clicking on a checkbox and receiving a response. Bonus points if I only have to download a db-diff next time, compared to downloading 5x db worth through POST /get-products?<filters>&page=<n> in every e-shopping session. Rarely a client-facing business database exceeds your average youtube cat video in size (assuming pics stored as urls).
                                                            • localfirst 1 year ago
                                                              Ten years ago your concern might be valid but in 2024, given that most people in the developed economies have ample amount of 4g/5g data plans with generous RAM/storage on their mobile phones, it makes more sense to just send them the whole DB once.

                                                              This is the same spirit as Electron vs native. Of course native uses less ram and is efficient but its not noticeable by the end user (because they dont care how its built). So we let developers write desktop apps in HTML/JS and end users do not notice usually until they open up task manager

                                                              We have abundance of data, memory, storage that continually drops in cost. It makes a lot of sense to simply download the DB once and then access it locally with optional syncing.

                                                              This is the future because we can't count on mobile phones to continuously maintain connectivity especially with limited data plans. Far better to simply allocate % of your fixed data plan and decide which local first app to install on your phone.

                                                              • voidUpdate 1 year ago
                                                                I live in a 4g/5g country (with 5g disabled on my phone because it seems to perform worse than 4g), with a 2gb/month data plan. I've usually stayed inside my data budget, but I'm cutting it a little close this month and I've blown past it a couple of times when I've forgotten to switch back to wifi. Sure, I live in a country where I can buy infinite data, but I don't see the point when I rarely use more than 2gb, and a 2gb plan is cheaper
                                                              • jimberlage 1 year ago
                                                                I think their tool is designed for working professionals on laptops - seems like analysts doing traditional business insights reporting. Perhaps there's a world where that target audience is using their phones, but I doubt it's a use case they're too concerned with.

                                                                I would be worried about this approach for something that doesn't need to expose a SQL interface though! It's not the right fit for most webapps.

                                                                • afavour 1 year ago
                                                                  If it’s an app you use regularly that trade off could easily be worth it.
                                                                  • voidUpdate 1 year ago
                                                                    If it's an app, you're not in the browser anymore anyway, so you don have to worry about this browser wasm. You can store data locally anyway and don't have to pull it from the sever
                                                                    • afavour 1 year ago
                                                                      > If it's an app, you're not in the browser anymore anyway

                                                                      Webapps would beg to differ

                                                                    • jampekka 1 year ago
                                                                      Also all the code and data can be easily cached with e.g. Service Workers, so it will be downloaded only once.
                                                                    • 1 year ago
                                                                    • sandwitches 1 year ago
                                                                      [dead]