How the open source Caddy server uses Grafana Cloud for full-stack observability

36 points by m_sahaf 1 year ago | 6 comments
  • ttymck 1 year ago
    I'd have preferred more technical details. But it's interesting to get a sense of the rather "cowboy" (in a positive sense) approach of the caddy maintainers for their "online" service offering (a build server).
    • m_sahaf 1 year ago
      > I'd have preferred more technical details.

      There isn't much technical details to it. Caddy exposes profiles by default[0] and prometheus metrics are available as opt-in. We set up grafana-agent to collect profiles and metrics from Caddy, poked at the Grafana Cloud portal, studied the available data, and checked the charts for anomalies. Grafana Cloud made it easy for us to get started with that without having to build more infrastructure for them, which will also require extra energy that can be better spent on the core of our project.

      [0] https://twitter.com/MohammedSahaf/status/1760415991513637137

      • starkparker 1 year ago
        The biggest problem in the observability space is people assuming that other people know what any of this entails.

        What "isn't much technical details" to you is required information for way more people than o11y wonks (not you, but more Grafana Cloud and some OTEL champions etc.) seem to believe, especially around cloud/microservice metrics.

        Prometheus and PromQL especially are opaque and hard to start using. The bar of entry to understanding how it can solve problems is higher than it looks from the inside, especially when you have account managers in your ear making sure _you_ know.

        • m_sahaf 1 year ago
          Fair enough. There was a learning curve which we had to overcome. I guess you can blame the curse of knowledge[0] for not making this part of the blog post, or because I was more focused on the results delivered by the Grafana stack than the process. I wonder if it may be something like math, where you have to practice much enough for it to click.

          [0] https://en.wikipedia.org/wiki/Curse_of_knowledge

    • notfunny 1 year ago
      Is this a joke? You can't even use Caddy's metrics in production without a serious performance impact

      https://github.com/caddyserver/caddy/issues/4644

      • m_sahaf 1 year ago
        We have it enabled in production. Checkmate.

        You're welcome to help through numerous means.