Platform teams need a delightfully different approach, not one that sucks less

17 points by akhayam 1 year ago | 13 comments
  • jamalarif 1 year ago
    Platform engineering has been a core theme with in cloud native community for last couple of years. Thanks for highlighting the systemic scalability and complexity challenges in platform engineering today. The CVE for availability is an interesting solution and could be a breakthrough in taming endemic complexity and human bottlenecks.
    • loneqas 1 year ago
      I'm curious about how the platform team approaches continuous learning and knowledge acquisition. Do they rely solely on post-mortems and incidents, or do they also incorporate insights from training? Are there any benchmarks for success?

      Thank you so much for sharing the article; I really appreciate the presentation and concise description of the platform

      • akhayam 1 year ago
        Most are just focusing on incident response and reactively improving things. That's why a proactive discipline that prevents these issues from happening is dearly missed.

        Even when postmortems are done, information continues to exist in team silos as there is no way to share these learning across teams and enterprises. Hence, everyone repeats each others' mistakes.

      • tahirrauf 1 year ago
        This is a profoundly insightful blog. Complexities of managing Kubernetes, resonates deeply. It's interesting to see how these challenges manifest in real-world scenarios and the impact they have on team dynamics and innovation.
        • akhayam 1 year ago
          Platform teams is such a new concept and is catching on like wildfire inside enterprises. However, the technology, people and process challenges faced by these teams are really not understood very well. In this blog, we tried to crystallize the biggest obstacles that we see Platform teams facing.

          Thanks for your encouraging words and we are glad that the challenges and solutions resonate.

        • samee999 1 year ago
          Liked the concepts described here which impact every single platform team. Do you see teams abandoning k8s because of the monotonous increase in its complexity as your infrastructure grows?
          • fawadkhaliq 1 year ago
            k8s complexity is a challenge at scale, but its growth seems likely. Reasons include strong community/support, continuous innovation ensures new capabilities regularly added, overall standardization across various layers of substrate to name a few. It's not for everyone though. Teams with simpler needs might find k8s overkill and opt out for valid reasons. Overall, benefits + community support make it a go-to for many, despite the challenges.
          • abdullah-shah 1 year ago
            Burned by the shared responsibility model way more than we would have liked. AWS said that they will “take the muck away” back in 2011. I guess we are back to owning the muck now :)
            • fawadkhaliq 1 year ago
              Couldn't agree more. As someone who used to work at AWS, I've seen it from both sides. AWS has valid reasons (business and technical) for not taking responsibility for all the layers on top. The missing piece is the operational knowledge AWS possesses but platform teams elsewhere lack access to. That's one reason to bring in a “trusted broker” to bridge this gap.
            • irvinzhan 1 year ago
              I thought we were the only ones that had these issues :)

              I wonder how other teams are solving the hiring vs automation challenges?

              • akhayam 1 year ago
                Mostly by hiring contractors, which is basically not working well for a lot of enterprises.
              • masood_ali 1 year ago
                Thank you for sharing insights into the burning issues of platform engineers. Cloud complexity is an old problem that requires new solutions. Standardization and knowledge sharing will save both time and efforts. Good luck
                • akhayam 1 year ago
                  Thanks! Indeed, sharing knowledge programmatically will save time and effort, _while_ improving safety of these systems.