Google Cloud Storage FUSE

149 points by mvolfik 2 years ago | 108 comments
  • ofek 2 years ago
    I do appreciate that Google is now officially supporting gcsfuse because it genuinely is a great project. However, their Kubernetes CSI driver seems to have in large part copied code from the one I and a co-maintainer have been working on for years:

    - https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver

    - https://github.com/ofek/csi-gcs

    Here is the initial commit: https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/c...

    Notice for example not just the code but also the associated files. In the Dockerfile it blatantly copied the one from my repo, even the dual license I chose because I was very into Rust at the time. Or take a look at the deployment examples which use Kustomize which I like but is very uncommon and most Kubernetes projects provide Helm charts instead.

    They were most certainly aware of the project because Google reached out to discuss potential collaboration but never responded back: https://imgur.com/a/KDuf9mj

    • SamuelAdams 2 years ago
      Your repository seems to have both an Apache and MIT license. What license are you distributing your code under?

      Edit: I see you said it’s dual licensed. From the look of it both allow Google or any other company to copy and reuse code, so what are you upset about?

      • warent 2 years ago
        I don't mean to be rude but yeah, this is exactly what AGPL was intended to combat. It's a lesson learned for these developers, and Google did nothing wrong or even unethical imo.

        A lot of people treat licensing emotionally (e.g. WTFPL, or picking licenses that feel good, or that we saw in another project), however business people are very logical and will unfortunately exploit this.

        The irony is that Google probably would not have done this if the codebase just omitted a license entirely. When I worked there, they wouldn't allow OSS with no license.

        • judge2020 2 years ago
          > The irony is that Google probably would not have done this if the codebase just omitted a license entirely. When I worked there, they wouldn't allow OSS with no license.

          This is because a license is the only way to legally use code. Code being publicly accessible doesn't mean it's free-as-in-freedom to use.

          • soraminazuki 2 years ago
            > The irony is that Google probably would not have done this if the codebase just omitted a license entirely.

            Yes they would. Google's code appears not to include attribution to the OP. So either Google authored the code or violated the license. One would hope that it's former.

          • ofek 2 years ago
            Either, the choice is up to you.

            edit: as I express in a sibling comment this act is legally allowed of course, but is bad practice

            • MuffinFlavored 2 years ago
              > but is bad practice

              Says who/why?

              "It's mean/I don't get a callout/the credit I deserve" is "bad practice"?

              Why aren't you honored that your product was good enough for Google to absorb and build off of? I'd be super proud.

              • Matl 2 years ago
                This is why free software and open source aren't the same. Free software is about this kind of fairness, among others. The simplicity of open source does have its downsides.
                • aprdm 2 years ago
                  Why it is bad practice ... ?
                  • prpl 2 years ago
                    unless there are legal definitions to “bad” it’s well within their rights you’ve granted them.

                    You can’t have a license that says one thing and then rely on implicit community norms to expect something to happen. (For one, you’re assuming the person is even aware of the community norms)

                • js2023 2 years ago
                  Hi Ofek,

                  I am a contributor who works on the Google Cloud Storage FUSE CSI Driver project. The project is partially inspired by your CSI implementation. Thank you so much for the contribution to the Kubernetes community. However, I would like to clarify a few things regarding your post.

                  The Cloud Storage FUSE CSI Driver project does not have “in large part copied code” from your implementation. The initial commit you referred to in the post was based on a fork of another open source project: https://github.com/kubernetes-sigs/gcp-filestore-csi-driver. If you compare the Google Cloud Storage FUSE CSI Driver repo with the Google Cloud Filestore CSI Driver repo, you will notice the obvious similarities, in terms of the code structure, the Dockerfile, the usage of Kustomize, and the way the CSI is implemented. Moreover, the design of the Google Cloud Storage FUSE CSI Driver included a proxy server, and then evolved to a sidecar container mode, which are all significantly different from your implementation.

                  As for the Dockerfile annotations you pointed out in the initial commit, I did follow the pattern in your repo because I thought it was the standard way to declare the copyright. However, it didn't take me too long to realize that the Dockerfile annotations are not required, so I removed them.

                  Thank you again for your contribution to the open source community. I have included your project link on the readme page. I take the copyright very seriously, so please feel free to directly create issues or PRs on the Cloud Storage FUSE CSI Driver GitHub project page if I missed any other copyright information.

                  • mox1 2 years ago
                    You licensed the code as MIT - https://github.com/ofek/csi-gcs/blob/master/LICENSE-MIT

                    Are you saying you have an issue with them copying your MIT licensed code?

                    • soraminazuki 2 years ago
                      If the GP is right, Google is violating the terms of the license. A quick search of the code reveals that Google's code doesn't include copyright headers with attribution to the GP. This could be stolen code.
                      • ofek 2 years ago
                        Yes, copying the code without following up to actually collaborate or even forking to show attribution I think is bad practice for such a large organization, or any entity for that matter.
                        • toyg 2 years ago
                          Just to clarify - the licenses you chose do not require any collaboration or "give back", they only require minimal attribution buried in some readme.

                          You can absolutely berate them for copying without attribution, but that's it; they don't owe you anything else.

                          • tantalor 2 years ago
                            If that's what you want, then update your license to require that.
                        • whuan 2 years ago
                          Are you accusing Google of "large part copied code" based on an old commit which is not even used in this official launch? Do you have any evidence from their recent commit? At least I don't see the current two repos are anywhere alike, except for that you both implement the same interface. Also, they did reach out to you and you just didn't respond, so why are you complaining now?

                          It makes me sad that no one here cares about whether your blame is true. And I'd expect you can provide more convincing evidence. But looks like the accusation is not even true. It's not fair for those contributors man, I hope you can apologize.

                          • ofek 2 years ago
                            I'm not sure if you thoroughly read what I wrote but I did respond to them. This is not a false accusation as you are claiming, you can check the contents of the repo in the current state.

                            Per the licenses they can copy but they must maintain attribution which has not been done.

                          • ofek 2 years ago
                            Update: attribution has been added to the readme file https://github.com/GoogleCloudPlatform/gcs-fuse-csi-driver/c...
                            • penciltwirler 2 years ago
                              Haha a lot of funny comments here. I think overall it's neither here nor there. You should be proud that the "elites" at Google copied your code ;)
                            • MontyCarloHall 2 years ago
                              I’ve experimented with using gcsfuse and its AWS equivalent, s3fs-fuse in production. At best, they are suited to niche applications; at worst, they are merely nice toys. The issue is that every file system operation is fundamentally an HTTP request, so the latency is several orders of magnitude higher than the equivalent disk operation.

                              For certain applications that consistently read limited subsets of the filesystem, this can be mitigated somewhat by the disk cache, but for applications that would thrash the cache, cloud buckets are simply not a good storage backend if you desire disk-like access.

                              What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts.

                              • crazygringo 2 years ago
                                This seems overly pessimistic to me.

                                Sure you're not going to use this as a consumer in place of a local disk, nor are you going to use this as part of your web app.

                                But there are lots of situations in reporting, batch/cron jobs, data processing, and general file administration where it's incredibly easier to use the file system interface than to use an HTTP API via a cloud storage library. Which FUSE is a godsend for. The latency doesn't matter in these cases for one-off things or scripts that already take seconds/minutes/hours anyways.

                                So no this isn't niche or a toy. It's a fantastic production tool for a lot of different common uses. It's not for everything but nothing is. Use the right tool for the job.

                                • retrocryptid 2 years ago
                                  In the old days, we had a system called NFS (Network File System) where, yes, you may decide to use only remote disks. There were several advantages apart from lowering the cost of disks, mainly that you could centrally manage boot images for a fleet of machines. Then we got the web and everyone seemed to assume you could do the same thing over the internet.

                                  I agree with you, I would prefer a local disk to one with 100+ msec of latency and local storage prices are at the point where the right answer is probably "just add local storage."

                                  But I watch with some sympathy the small army of sys-admins (something like 15-20 people) responsible for managing the 3000+ Macs our company uses and remember the 2 person staff which supported the 1500+ diskless workstations from my years at a sadly defunct mini-super-computer manufacturer. It was quite nice... you could go to any machine and log in and your desktop would follow you. I'm told doing the same thing with MSFT requires 10-20 people just to manage the AD hardware (though as a unix-fan, I hang out with other unix-fans who are notoriously rude to MSFT, so maybe it's only 5-10 people needed to manage the AD instance.)

                                  • aprdm 2 years ago
                                    Not old days. NFS is still widely used in the industry. In fact some of them cost millions of dollars for high end computer farms, e.g: isilon
                                    • isanjay 2 years ago
                                      I still use NFS in my home.
                                    • MontyCarloHall 2 years ago
                                      Applications for which filesystem-like access is important (i.e. requiring lots of POSIX file I/O system calls, e.g. read(2)/write(2)/lseek(2)) but latency is unimportant seem pretty niche to me. If you don't need any of the POSIX syscalls, it's not that much more difficult to work with bucket URLs vs. file paths — the general format is the same, i.e. slash-delimited file/directory hierarchies.
                                      • vrosas 2 years ago
                                        Not everything is a webserver. There's a lot of software out there that wouldn't expect files to exist anywhere else besides on disk, and it's not worth fetching them all from cloud storage before you begin working on the data. It's easier just to GCSFuse a bucket to a VM and let the user do what they will. Works great for ad-hoc analysis of poorly or unstructured data.
                                      • ninkendo 2 years ago
                                        The problem is that such systems have a habit of growing in scope until they reach a point where you really do need the more optimal access patterns of using the real HTTP APIs, and the inefficiencies of emulating the full filesystem API will gradually start to bite you. Maybe you’re lucky enough that that won’t happen, but it’s important to understand it for the quick hack job it is, IMO.
                                        • lazide 2 years ago
                                          In most situations that time is years, decades, or ‘never’. Which is fine.

                                          Not everyone or everything scales faster than bandwidth and/or CPU is.

                                          • 2 years ago
                                          • qwertox 2 years ago
                                            I agree. For example if you want to use Google's ASR (Automated Speech Recognition), if your file is longer than 1 minute in duration, you first need to upload it to a bucket, which is a lot of added complexity compared to a regular HTTP POST.

                                            Just copying the file to a mounted bucket would make this a lot easier.

                                            Then again, how does one get the metadata of the uploaded file?

                                            • vrosas 2 years ago
                                              Calling any software system "niche" is kind of hilarious, as if, if it isn't postgres it's a massive failure. It's not supposed to be a high-performance cache of data.

                                              My company uses GCSFuse for ad-hoc analysis/visualization of large but poorly structured output from our lifesciences jobs and it works just fine for that.

                                            • thraxil 2 years ago
                                              Yep. I once inherited a system where the previous team had used GCSFuse to back the `/etc/letsencrypt` directory on a cluster of nginx webservers. It "worked" and may have been a reasonable approach at the time they built it, avoiding setting up a single "master" to handle HTTP-01 challenges (and it was before GCP's HTTPS LB could handle more than a handful of domains/certificates). The problem was that as the number of domains/certificates it handled increased, nginx startup or config reload time got slower and slower as it insists on stat-ing and reading every single file in that directory in the process. It got high enough that it started running into request throttling on the storage bucket. It's no fun when `nginx -s reload` takes two minutes and sometimes fails completely.
                                              • netheril96 2 years ago
                                                The most wrong part of that previous team is to store private keys unencrypted in the cloud, not the performance part.
                                                • thraxil 2 years ago
                                                  I mean... literally every VM running nginx or apache that I've ever seen has had the SSL certs just sitting on the filesystem in /etc/ssl or /etc/letsencrypt or similar... All of letsencrypt's documentation points people in that direction.
                                                  • oittaa 2 years ago
                                                    My understanding is that everything is encrypted by default in GCP. Though you need to manually configure encryption keys if you want to prevent Google ever having access to your data.
                                                • linsomniac 2 years ago
                                                  >What I would really like to see is a two-tier cache system

                                                  Is there any sort of Linux HSM (Hieracrhical Storage Manager)? I haven't see any and have been a bit surprised nothing has really developed there. They can manage putting hot data in RAM, SSDs, colder or larger data on spinning rust, deep freezing onto a tape silo or a cloud storage...

                                                  Some of the NAS devices and RAID cards can support a two-tier caching or data migration using SSDs, where hot or highly-random data (usually identified by smaller write sizes) go to the SSDs, and then can migrate to the spinning discs.

                                                  I've done some "poor mans" version of this using LVM, where I can "pvmove" blocks of a logical volume between spinning discs and SSDs, which is pretty slick, but a very crude tool.

                                                  • folmar 2 years ago
                                                    CASTOR comes to mind for a start.

                                                    Take a look a the CERN paper https://iopscience.iop.org/article/10.1088/1742-6596/331/5/0... as they have a large use case.

                                                    • tadfisher 2 years ago
                                                      Not a general kernel facility that I know of. I use nfscache every day though; my Steam data directory lives on NFS, and I set up nfscache with a 100GB LRU storage. This way I can avoid the "backup/restore" dance and have all my games installed, at the cost of waiting up to a few minutes to warm the cache for a new game.
                                                      • pdimitar 2 years ago
                                                        I don't know about a manager per se but `bcachefs` for Linux seems to do a good chunk of what you're after.
                                                      • markstos 2 years ago
                                                        I once evaluated using s3fuse for managing about 36 million images. The old storage model was on a filesystem so it was supposed to make a smooth transition to the cloud.

                                                        AWS Premium Support wisely advised me against it, not just because of latency but also because the abstraction makes /far/ more API calls then a native solution would.

                                                        After a bit of testing to confirm, I switched to using native API calls. That code was easy to write and the performance was great. I've been wary of cloud FUSE adapters ever since.

                                                        • lazide 2 years ago
                                                          FUSE adapters in general are not for a product/production use in my experience. They’re great for one off convenience use, or basic admin scripts.
                                                        • ashishbijlani 2 years ago
                                                          I'm working on optimizing FUSE using eBPF (ExtFUSE [1]) and adding a caching layer exactly as you mentioned. Will post publicly when ready.

                                                          1. https://github.com/extfuse/extfuse

                                                          • vamega 2 years ago
                                                            Is work on this continuing (or restarting)? I had heard of this a few years ago, but thought the project was shelved.
                                                            • ashishbijlani 2 years ago
                                                              The project is active (just not merged in the kernel yet). Please DM me for questions.
                                                          • pjc50 2 years ago
                                                            > What I would really like to see is a two-tier cache system: most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache. That would open up a world of additional applications whose useful cache size exceeds practical RAM amounts

                                                            This is really hard to get right if the origin cloud storage is anything other than immutable. Otherwise you're in for a world of cache invalidation and consistency pain.

                                                            I've gradually come round to the other opinion: there should be devices that sit on the PCIe/NVMe bus and provide a blob storage API rather than a block one, and there should be an operating system blob API that is similar to but not identical to the filesystem one.

                                                            • 8organicbits 2 years ago
                                                              Same experience. I remember opening a .docx in Word and watching it hang or studder at different operations. I think you'd need very reliable and low latency networking for this to be anything but a painful to use toy.

                                                              I'd be curious to see how it works running on EC2, especially with an S3 endpoint in the VPC. Although I still think you'd be better suited by using S3 as an object store, given the option to built it right.

                                                              • renewiltord 2 years ago
                                                                Catfs is not super production (there are some small changes you need to make in inode handling), but you can do this. We have it on top of goofys. They both need a few changes to work under load but what we do is quite standard:

                                                                1. Goofys for S3 FUSE

                                                                2. Catfs for local disk caching

                                                                3. Linux caches in memory

                                                                4. Mmap file means processes share it

                                                                5. One device then exports this over the network to other machines, each of which have an application layer disk cache.

                                                                6. Machines are linked via 10 GigE (we use SFP+).

                                                                Overall the goofys and catfs guy (kahing) wrote very performant software. Big fan.

                                                                • yuliyp 2 years ago
                                                                  > most recently accessed files are cached to RAM, with less recently accessed files spilling over to a disk-backed cache

                                                                  Isn't this how most servers run normally? (parts of) files which are accessed are in page cache, the rest is on "disk"

                                                                  • ape4 2 years ago
                                                                    That page shows a `mkdir` is 3 json commands. I wonder if its that many HTTP requests.
                                                                    • ArtWomb 2 years ago
                                                                      >>> every file system operation is fundamentally an HTTP request, so the latency is several orders of magnitude higher than the equivalent disk operation

                                                                      gcsfuse latency is ok as it embodies "infinite sync & persistence" ;)

                                                                      • tyingq 2 years ago
                                                                        Well, and there's no such thing as opening a file and modifying some small part of it. That's emulated with a full rewrite of the whole object.
                                                                        • VikingCoder 2 years ago
                                                                          Uh, how does it perform from a Google Compute Engine Virtual Machine?

                                                                          If it performs well there, I could imagine that being pretty useful.

                                                                          • MontyCarloHall 2 years ago
                                                                            That is exactly where I tested it, and the latency was still abysmally poor (~1 second per file operation).

                                                                            I don’t even want to know how bad the latency would be outside of a cloud VM.

                                                                          • bitL 2 years ago
                                                                            Moreover, there is no SLA on those FUSE adapters so putting it into any part of production is too risky.
                                                                            • qsort 2 years ago
                                                                              My personal conspiracy theory: most "cloud services" are just... bad.

                                                                              VMs and disk space I understand completely, having machines on-prem is too much of an hassle and the price isn't that bad. But for stuff like this, managed services, databases especially, you're just getting scammed.

                                                                            • nickcw 2 years ago
                                                                              As the author of rclone I thought I'd have a quick look through the docs to see what this is about.

                                                                              From reading the docs, it looks very similar to `rclone mount` with `--vfs-cache-mode off` (the default). The limitations are almost identical.

                                                                              * Metadata: Cloud Storage FUSE does not transfer object metadata when uploading files to Cloud Storage, with the exception of mtime and symlink targets. This means that you cannot set object metadata when you upload files using Cloud Storage FUSE. If you need to preserve object metadata, consider uploading files using gsutil, the JSON API, or the Google Cloud console.

                                                                              * Concurrency: Cloud Storage FUSE does not provide concurrency control for multiple writes to the same file. When multiple writes try to replace a file, the last write wins and all previous writes are lost. There is no merging, version control, or user notification of the subsequent overwrite.

                                                                              * Linking: Cloud Storage FUSE does not support hard links.

                                                                              * File locking and file patching: Cloud Storage FUSE does not support file locking or file patching. As such, you should not store version control system repositories in Cloud Storage FUSE mount points, as version control systems rely on file locking and patching. Additionally, you should not use Cloud Storage FUSE as a filer replacement.

                                                                              * Semantics: Semantics in Cloud Storage FUSE are different from semantics in a traditional file system. For example, metadata like last access time are not supported, and some metadata operations like directory renaming are not atomic. For a list of differences between Cloud Storage FUSE semantics and traditional file system semantics, see Semantics in the Cloud Storage FUSE GitHub documentation.

                                                                              * Overwriting in the middle of a file: Cloud Storage FUSE does not support overwriting in the middle of a file. Only sequential writes are supported. Access: Authorization for files is governed by Cloud Storage permissions. POSIX-style access control does not work.

                                                                              However rclone has `--vfs-cache-mode writes` which caches file writes to disk first to allow overwriting in the middle of a file and `--vfs-cache-mode full` to cache all objects on a LRU basis. They both make the file system a whole lot more POSIX compatible and most applications will run using `--vfs-cache-mode writes` unlike `--vfs-cache-mode off`.

                                                                              And of course rclone supports s3/azureblob/b2/r2/sftp/webdav/etc/etc also...

                                                                              I don't think it is possible to adapt something with cloud storage semantics to a file system without caching to disk, unless you are willing to leave behind the 1:1 mapping of files seen in the mount to object in the cloud storage.

                                                                              • milesward 2 years ago
                                                                                Please, listen to me: use this only in extremely limited cases where performance, stability, and cost efficiency are not paramount. An object store is not a file system no matter how hard you bludgeon it.
                                                                                • rippercushions 2 years ago
                                                                                  Is this the same gcsfuse that's been around for years, only now with official Google support?

                                                                                  https://github.com/GoogleCloudPlatform/gcsfuse

                                                                                  • scoobydoobydrew 2 years ago
                                                                                    Looking at change descriptions, looks like underlying changes were made to get to this like now using GO client library. I would expect a more stable product, and better performance which looks like the performance benchmarks located under docs has been updated as well. Happy to finally see Google standing behind this, and the official CSI driver is really cool to see.
                                                                                    • throwdbaaway 2 years ago
                                                                                      Heh, my old laptop has a git clone of this from September 1st 2016.
                                                                                      • beastman82 2 years ago
                                                                                        yes

                                                                                        > export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`

                                                                                      • askvictor 2 years ago
                                                                                        Now for official Google Drive support on Linux...
                                                                                        • curt15 2 years ago
                                                                                          How do Googlers access Google Drive from their Linux workstations? Do they have an internal GDrive client?
                                                                                          • martius 2 years ago
                                                                                            Not that I know of, we have some virtual filesystems for specific things, but in general Drive is for shared docs, videos (recorded meetings/presentations) and things like this.

                                                                                            We don't use drive to store other files. Actually, we don't really "store files" since almost everything we need is remote.

                                                                                            See for instance this discussion: https://news.ycombinator.com/item?id=13561096

                                                                                            • surajrmal 2 years ago
                                                                                              Through the web frontend. I'm not aware of any special fuse clients, nor is it particularly appealing. All files I store in it are for web based applications (primarily gsuite). We have alternative collosus based file share mounts which we can use for "native" files. I personally use git and/or rsync to share files between my various corp devices (laptop, cloud vm, desktop) in addition to those other options.
                                                                                              • plaidfuji 2 years ago
                                                                                                I wonder the same, but I also wonder what the actual use case is for the Drive app on Linux. For me, Drive is mostly for syncing office docs (namely MS-office docs), PDFs and images among teams. That type of work doesn’t lend itself well to a Linux env anyway. And for programming-heavy sync tasks, a user will more likely use a remote Git repo for code and GCS for data. Does google even use MS office internally?
                                                                                                • KingOfCoders 2 years ago
                                                                                                  I write my articles in Markdown and would want to switch to a terminal based Linux (Pi Zero, low power, e-paper, distraction free) to do this instead of a GUI.

                                                                                                  (I currently use Goland and Scrivener to write articles and books)

                                                                                                  • capableweb 2 years ago
                                                                                                    > Does google even use MS office internally

                                                                                                    That'd be weird, considering they have their own suite of office tools. Kind of like if Microsoft would be using Google Cloud rather than Azure internally.

                                                                                                  • 2 years ago
                                                                                                • jijji 2 years ago
                                                                                                  I've been using rclone [0] to do the same under linux for years, how is this different?

                                                                                                  [0] https://rclone.org

                                                                                                  • dallbee 2 years ago
                                                                                                    Unfortunately it's common to have a policy in place disallowing 3rd-party app api access to drive storage. This prevents apps like rclone from working, but the drive client works because it isn't 3rd-party.
                                                                                                  • ISL 2 years ago
                                                                                                    Can this be used to mount Drive under linux?
                                                                                                  • retrocryptid 2 years ago
                                                                                                    This has been a thing for a while; I remember using it (or something like it) several years ago. While it's great for random files you might want to place in the G-Cloud, what I really wanted was to access my google docs content from the Linux command line. And you can do that, it's just that they're in non-obvious, non-documented, frequently changing formats that will only ever be usable with Google Docs.

                                                                                                    But if you're using the google cloud like you might use Box.Net or DropBox, it seems fine for light usage.

                                                                                                    • manigandham 2 years ago
                                                                                                      Object storage is a higher-level abstraction than block-storage. FUSE and similar tech can do the job for basic requirements like read-only access by legacy applications but rarely works well for other scenarios.

                                                                                                      A more complex layer like https://objectivefs.com/ (based on the S3 API) would be more useful, although I would've expected the cloud providers to scale their own block-store/SANs backed with object-stores by now.

                                                                                                      • remram 2 years ago
                                                                                                        See also: JuiceFS: https://juicefs.com/

                                                                                                        Adds a DBMS or key-value store for metadata, making the filesystem much faster (POSIX, small overwrites don't have to replace a full object in the GCS/S3 backend).

                                                                                                        Almost certainly a better solution if you want to turn your object storage into a mountable filesystem, with the (big) caveat that you can't access the files directly in the bucket (they are not stored transparently).

                                                                                                      • jefftk 2 years ago
                                                                                                        Cloud Storage FUSE does not support overwriting in the middle of a file. Only sequential writes are supported.

                                                                                                        This seems like a big limitation?

                                                                                                        • 8organicbits 2 years ago
                                                                                                          One challenge with writes in the middle is that it changes the file hash. Cloud services typically expose the object hash, so changing any bit of a 1TB file would require a costly read of the whole object to compute the new hash.

                                                                                                          You could spilt the file into smaller chunks and reassemble at the application layer. That way you limit the cost of changing any byte to the chunk size.

                                                                                                          That could also support inserting or removing a byte. You'd have a new chunk of DEFUALT_CHUNK_SIZE+1 (or -1). Split and merge chunks when they get too large or small.

                                                                                                          Of course at some point if you are using a file metaphor you want a real file system.

                                                                                                          • throwawaaarrgh 2 years ago
                                                                                                            pretty standard limitation of object storage services iirc
                                                                                                            • jefftk 2 years ago
                                                                                                              Doesn't this mean that most programs you might want to use with the FUSE API won't actually work? They'll do fine for a while, until they try to seek, and then they'll get an error?

                                                                                                              Or is there a large group of programs that only ever write sequentially?

                                                                                                              • jsnell 2 years ago
                                                                                                                I'd think non-appending writes are quite rare in practice, other than databases. Even when the application is logically overwriting data, in other kinds of programs it's almost always implemented as writing to a new file + an atomic rename, not in-place modification.
                                                                                                                • hawski 2 years ago
                                                                                                                  Most programs either write a full file every time and replace the old file by a single move or append to an old file. Writting in the middle could happen in a program writting to some kind of archive or disk image. There is probably a whole group of programs that do this I'm not familiar with, but I'm pretty sure of my first sentence.
                                                                                                                  • 2 years ago
                                                                                                                    • throwawaaarrgh 2 years ago
                                                                                                                      well yeah, but there's a lot of things FUSE makes easier. no need to implement a client library, no need to write some custom wrapper or rsync thing to sync files to the bucket or bucket to local system, etc. it won't work for every app but for the ones it does support it saves a ton of extra work and maintenance.
                                                                                                                  • scoobydoobydrew 2 years ago
                                                                                                                    This works, there is nothing stopping it, but just like all cloud object storage it will trigger a complete re-write of the object when saved.
                                                                                                                  • iamjk 2 years ago
                                                                                                                    I mean I get why everyone wants everything to be fuse-compatible but some things just aren't meant to be done.
                                                                                                                    • ggambetta 2 years ago
                                                                                                                      "Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should!"
                                                                                                                    • goodpoint 2 years ago
                                                                                                                      FUSE is really not suitable for this.
                                                                                                                      • trollied 2 years ago
                                                                                                                        Be aware that this is not free:

                                                                                                                        "Cloud Storage FUSE is available free of charge, but the storage, metadata, and network I/O it generates to and from Cloud Storage are charged like any other Cloud Storage interface. In other words, all data transfer and operations performed by Cloud Storage FUSE map to Cloud Storage transfers and operations, and are charged accordingly."

                                                                                                                        • rippercushions 2 years ago
                                                                                                                          Using FUSE doesn't cost you anything extra, but it doesn't make the underlying storage free.
                                                                                                                          • ElectricalUnion 2 years ago
                                                                                                                            You will be doing storage operations silently and in a unoptimized fashion, more so if the underlying FUSE filesystem is implemented in a naive fashion.

                                                                                                                            For example, Cloud Storage never moves or renames your objects; copying and deleting the original one instead. This can end up costing quite a lot if you're using data other that in "standard store" because of minimum storage duration.