System Initiative: Second Wave DevOps
111 points by kelp 2 years ago | 79 comments- chologrande 2 years agoAfter reading this post, I've browsed the site. I'm not sure how this is anything but significantly worse than the current model?
I've been around long enough to know that any "no code" style interface or GUI are typically the _problem_ not the solution. Regardless of the code they export, you end up with fat fingers, misclicks, forgotten UI paths to follow... Taking a software eng approach to shipping infra is a stable, known process that the infra team and the software teams can understand, no specialized GUI tool knowledge required.
I've been using the same basic terraform modules, jenkins pipelines, and infra architecture for nearly 7 years across multiple companies and numerous cloud deployments. It's not fancy but it justworks.jpg. Every time I re-use that code for a new deployment or account I save TONS of time.
Devops doesn't have to be hard. Infrastructure doesn't have to be complex. Deploying every day isn't _that_ difficult. KISS Method is key, especially when you're looking for speed. Using _less_ tools from the CNCF is better, and will let you move faster, not adding a new one.
- totallywrong 2 years ago> Devops doesn't have to be hard. Infrastructure doesn't have to be complex
That's simply not true for anything larger than a few services and a small dev team. The cloud is very complex to do right when you focus on security, performance, and scalability. And Terraform invariably devolves into a nightmare when you have a ton of resources with dependencies between them.
- chologrande 2 years agoI'm definitely not google scale, but we're global, in over 300 cities spanning ~30 countries. On an avg day we process well over 25k rps on multiple services. Simple architecture and IaC like terraform is exactly how we manage the dependencies. It's the solution, not the problem.
- dipperdottydoo 2 years agoYou think you have simple architecture when you’ve introduced Terraform to what is, based on your statistics, a two server use case. A PlayStation is capable of 25 kRPS and probably its data iops, too. Buy another one and you’re HA.
You’re trapped in the complexity of the method and think you’ve achieved nirvana. This comment reminds me of those demos when Hadoop was the rage, where people would do a $4 million Hadoop ETL on their laptop and shut up a room.
- midasuni 2 years ago25krps? As in requests per second? I.e one request every 3 seconds?
- dipperdottydoo 2 years ago
- cube2222 2 years agoEsp. as you start splitting up your statefiles!
However, I do think that this is mostly essential complexity, rather than accidental one. We're now building systems that are way more secure and/or scalable than before. Least possible network access and permissions everywhere already add a bunch of complexity. Pushing complexity from our code to managed cloud offerings does its part, too. But all of this can be tamed very well with modules and reusable components.
That said, if you're scaling Terraform, I do recommend you to check out the tools that have sprung up in the recent years to manage it. I'll personally recommend Spacelift[0] (see disclaimer). It can help you orchestrate your statefiles once you start having many of them (even tens or hundreds of statefiles in a single workflow are no problem) using stack dependencies, help team members self-serve through blueprints, automate all the things through OPA policies, and generally help you scale your Terraform usage to a larger team.
[0]: https://spacelift.io
Disclaimer: Software Engineering Team Lead at Spacelift, so take the recommendation with a fair grain of salt; I do legitimately think it's a great product though. If you'd like to reach out, feel free to do so through the website or the contact details in my profile.
- chologrande 2 years ago
- JohnMakin 2 years agoWell said. Click ops is the root of all evil, IME, unless you're a very small shop.
- jedberg 2 years agoClick ops is great for most shops, as long as it has advanced configs available and you have at least one expert on staff for when those configs are needed.
At Netflix our goal was always to build tools where the majority of devs just check into source control and click a few buttons, but could go as far as configuring kernel tunables if necessary (but also making that as unnecessary as possible).
- JohnMakin 2 years agoBut the infrastructure underneath the stuff the devs use was almost certainly not click ops’d. I’m fine clicking a deployment into prod, I am not fine making advanced cloud or build configurations via a UI
- JohnMakin 2 years ago
- jedberg 2 years ago
- jsiva 2 years agoText-based interfaces still have the same issue with being able to make a typo or tab completing without checking. Seems like the major advantage is that these text based tools are able to be versioned well through scripts. This might be fixed with a GUI version of autohotkey, turning these gui interactions into a script.
- chologrande 2 years agoMandated peer review, planned actions, and automated risk evaluation are part of our infra pipeline. This typically doesn't exist outside of software dev style pipeline.
- jsiva 2 years agoI was about to disagree but the automated risk evaluation (ARE) part definitely qualifies your whole statement. Going to go off on a tangent: how do we introduce automated risk evaluation to environments outside of software development. Implementing ARE as software is probably the most efficient method in terms of time and resources. But ideally (some) users of ARE should be able to improve on it. In the case of "traditional" engineering (civil/electrical/chemical etc.) there are engineers who specialize in numerical methods and can improve on ARE. But what about professions where software development skills are not as widespread (or seen as a legitimate contribution to the field). There are still probably going to be members of these professions with software development abilities but is there a point where other methods could be considered (i.e. electrical/mechanical methods) for ARE implementations.
- jsiva 2 years ago
- chologrande 2 years ago
- azanar 2 years agoI don't see the current model as a single model, but as several models interrelating.
In software, there are at least three models I can think of immediately: code, configuration and user data.
Why do I separate configuration? Isn't configuration just code or data? I don't think it is. It is data _about_ a particular system, as opposed to a particular user.
Why the distinction here? The code of a system can be designed, developed, and tested against a set of supported configurations. At that point, the system might only run under one configuration at a time, but can be trusted from a requirements perspective to operate under other configurations without needing to go through the whole software development lifecycle again.
Why not just store this in user data, then? Different requirements. Three off the top of my head: configuration data wants much better change management than most user data does. That management wants to be exportable and importable. It wants different access controls.
Historically, configuration data change management has been done in SCM, such as git. The reason why git isn't a big deal in development is because it is not a point of particularly high friction relative to the other parts of the software development lifecycle. It is a _much_ bigger point of friction in configuration changes.
Hence, three models.
We can argue about whether or not configuration changes _ought_ to go through the full cycle, because I am wrong to trust _any_ change to a system with anything less. My practical experience suggests that most of the time, the damage done is less than the cost of enforcing a strict lifecycle on everything.
- azanar 2 years agoTo make this concrete: terraform for me has been part code, part configuration.
I define a resource, and provide a whole set of knobs on that resource. That's the code part. I test that code against a variety of configurations, the same way I might unit test application code against a variety of app configurations. I also verify that changing knobs from one setting to another behaves. With automated testing, this actually isn't all that hard to do. Once I've verified things work right, I deploy.
At this point, I will default to trusting that things will work. This is the configuration part. Set these knobs to whatever permitted value you want, and the system will update behavior based on those new values. Most of the time, things like this work. That is good enough for me.
- firesteelrain 2 years agoYou obviously have never worked in Aerospace or Safety critical systems
- azanar 2 years ago
- pmoriarty 2 years ago"Deploying every day isn't _that_ difficult"
Just don't ever ask to roll back...
- esafak 2 years agoWhy? Keep build artifacts and deploy any build you like. Do you have a state or dependency problem?
- pmoriarty 2 years agoIt's quite common for one service to depend on another (or multiple others), on the network/firewall state, on configurations that might affect or be affected by other services, etc.
What looks simple when you're the king of your own little kingdom suddenly doesn't seem as simple when that fantasy meets the reality of sharing the world with others.
- anotherhue 2 years agoSame reason clocks shouldn't jump backwards, it breaks so many assumptions you end up with insanity. Do a revert commit so it's the old code in new clothing.
- pmoriarty 2 years ago
- esafak 2 years ago
- Mutlut 2 years agoGreat that your setup work, i personally hate terraform and try to avoid it.
k8s is also KISS but it brings even more 'out of the box' like logging and monitoring, would highly recommend you to take a look perhaps you like it.
Terraforms state management is bad and a lot of people don't get that you store secrets in them. Bootstrapping this securly already needs infrastructure like remote stores.
Jenkins is fine i would say but with argocd you actually gain real insight. Argocd is also IaC and you can manage argocd through argocd.
The adoption of argocd in the platforms i have build, is great. Developer teams love it, get used to it very fast and don't need cluster access/ (in your case vm access).
With k8s you also get zero downtime deployment, blue / green basically for free.
- totallywrong 2 years ago
- aftbit 2 years ago>Doing “DevOps work” is unquestionably the worst part of building a modern application. It’s full of tiny papercuts, indignities we suffer in our toolchains, our feedback loops, and our software. It’s a city of brutalist buildings filled with sharp-edged couches pretending to be comfortable. Think of all the advances in how we interact with tools in other domains - then take a look at the way you build, deploy, and operate your software, at all the crazy gyrations you use to glue it all together - and ask yourself why you accept it.
I dunno, usually I find databases and migrations to be the hard part. At this point, I have enough examples of app deploys that I can have a new app up and running on a pair of VMs with a robust blue/green deploy and backups inside of an hour or two, with deploy by Github Actions responding to pushes to prod branch.
Even if you don't have my company's half-decade worth of example devops, you can do something easier, like a single instance on a Digital Ocean machine with deploy by "ssh -A server 'cd yourapp && git pull && sudo systemctl restart yourapp'". Sure, you'll have a few seconds of downtime, and you'll expose your SSH keys to anyone on that box for those few seconds, but if you know some Linux and nginx, you can get this working inside of an hour from scratch.
- hinkley 2 years agoFundamentally, I think the “why” is still the fact that DevEx started with devs, and DevOps started as a collaboration with Ops people, who already were not speaking the same language of robustness that we do. Which is a little weird, and probably part of the friction between the groups.
They historically took on the reliability role, if nobody else did, but they were implementing reliability on top of a house of cards, which is a kind of hypocrisy that makes even mediocre devs bristle. Don’t lecture me on robust software, boyo. Your tools are made of string cheese and staples.
- pmoriarty 2 years ago"They historically took on the reliability role, of nobody else did, but they were implementing reliability on top of a house of cards, which is a kind of hypocrisy that makes even mediocre devs bristle. Don’t lecture me on robust software, boyo. Your tools are made of string cheese and staples."
I don't know why you'd blame ops for the crappyness of the tools they have at their disposal.
Yes, Ansible, Salt, Puppet, and Chef are spaghetti-code inducing congealed messes of design. So are large collections of complex shell scripts.
So what's the alternative? What spherical cow of a configuration management tool from Platonic dev heaven shall be foisted on us this time?
I'm sure it'll be super clean and elegant this time, unlike the last thousand shitty tools they made.
And don't get me started on devs that think they're qualified to do ops when all they know is their language of choice (if even that) and have never thought about the network, security, capacity, redundancy, failover, reliabililty, hardware, backups, the rest of the company or other users.
- jrott 2 years agoNo no no Kubernetes or Serverless or ChatGPT is going to save us this time.
More seriously it always going to be complicated and annoying. It's really past time we started dealing with the fundamental complexity of everything we are trying to do with software.
- jrott 2 years ago
- Spivak 2 years agoBecause you're not talking about that thing that ops people are talking about. We build reliable systems. You're talking about reliable software. Ops people come from the perspective that all software is inherently unreliable including your app, especially your app and have to work within those constraints.
Terraform and Ansible look like gyroscopes compared to the build process of any modern software stack. We offered our dev teams a whole ass pizza party every time they had 10 green builds (on main) in a row. In three years we've paid it out once.
- hinkley 2 years agoThe operational tools should be the most stable bits and instead they are janky as fuck and I’ve spent too much of my career smacking victim-blaming tennis balls back over the net. If you look at what Ansible replaces it’s a wonder production ever worked at all. If you have to baby your automation it’s not automation.
Ops people are not used to thinking in boundary conditions. Hell, devs forget half the time. That’s part of why people wanted to merge them in the first place. Get the right sorts of cynicism together in a room and make me something with a big green button an idiot can push while everyone is in a meeting.
- thezilch 2 years agoOh boy, a whole pizza!
- hinkley 2 years ago
- zsoltkacsandi 2 years agoMy experience is (as a former dev, current ops) that the problem isn’t that dev and ops people aren’t speaking the same language, nor the tooling or processes.
The problem comes from the management/business side. They hire devs and tell them that ship features as fast as you can. Also they hire ops guys and tell them that I want this whole thing super reliable, we can’t afford a minute downtime.
In my opinion this is why DevOps is mostly pointless. We are trying to fix with tooling, processes, new tech, and fancy roles the fact that business people don’t want to make compromises or choose between the pace of delivery and reliability.
- hinkley 2 years agoThat’s definitely part of the dynamic. My biggest regret with automated testing is that software used to be a triumvirate of Quality, Dev, and Management, and when dev was fucking around they had two teams hitting them, and when Management was out of control, they had everyone mad at them. Get rid of QA and it’s Us vs Them and that worked briefly at the dawn of Agile but they got wise.
OP’s should replace QA at that table to rebalance the equation. But again, and as you illustrated, we have an adversarial relationship that takes a lot of across the aisle work to introduce sanity.
- hinkley 2 years ago
- pmoriarty 2 years ago
- bob1029 2 years ago> Even if you don't have my company's half-decade worth of example devops, you can do something easier,
If you are open to it, try configuring an azure function app to use GitHub in the deployment center. I heard actual gasps from certain team members when they saw it automatically push the GH action workflow file into master and kick off the job without any additional bullshit beyond the GH authentication ceremony and org/repo/branch selection.
- giovannibonetti 2 years ago> I dunno, usually I find databases and migrations to be the hard part.
For me, the following tools make that a joy: - Postgres as the database, which is very predictable and extremely reliable; - Migrations with Ruby on Rails, that have just the right balance between a convenient DSL and letting you write SQL when necessary; - The strong_migrations gem that catches in development unsafe commands to run in production, and explains how to make them safe
- efxhoy 2 years agoWe run the exact same setup, in my two years at the company we’ve only had one migration related issue and that was due to different minor versions of postgres between CI and staging. Now if postgis could get those official ARM docker images pushed i’d have nothing left to complain about.
- efxhoy 2 years ago
- hinkley 2 years ago
- oofnik 2 years agoI'm happy to see someone really trying to color outside the lines with deployment tooling. I think we've fallen into a number of paradigms for system operations that we know are kind of bad, but we tell ourselves about how much more awful it used to be to numb the pain. That sort of attitude is the real killer of innovation.
I say bring it on; more variance and more disruption in this space as people try new approaches might be what we need to get us out of the rut we've been stuck in for too long. No idea if it will work, but good luck to Adam and his team.
- rossmohax 2 years agoThis talk is must see to understand what SI tries to achieve.
- rossmohax 2 years ago
- Mutlut 2 years agoI'm very curious about this as normally all tools i know still have a higher entry point than i realize.
My current setup is 'get a k8s cluster spup up and configured properly as fast and easy as possible' and than just use argocd. Argocd is by far the best tool i have been using in the last 15 years: It does exactly what it should do (syncing and showing me k8s insight vs my git repo), can manage itself through the same mechanism (IaC) and people of different backgrouns are very fast in using it.
This tool either might bridge the gap for people and potentially solve problems but i do have to say: argocd.
Even if you think you want to start small and just use kubectl: start with argocd.
- holoway 2 years agoIf you want to know more about some of the technical details, we wrote something up: https://www.systeminit.com/blog-five-breakthroughs
- ericand 2 years agoCheck out Adam's "What if infrastructure as code never existed" talk from a month or two ago. Great framing for SI and entertaining. https://www.youtube.com/watch?v=5lPa2U239C4
- deadeye 2 years agoPerhaps we don't deploy six times a day because we're responsible for something a little more important and delicate than a free photo sharing website.
- negus 2 years agoI guess GitHub is enough important and delicate. How often do they deploy?
- negus 2 years ago
- mailund 2 years ago> Things like using source control, shared observability, feature flags, dark launching, continuous integration, and continuous delivery are widely considered best practices.
I seriously want to know which places this is! I've been at 5 different companies, and I've never been a place where people don't look at me like I'm speaking French when I suggest dark launching a feature or introducing feature toggles. I've yet to experience a place that actually integrates continuously, as opposed to merely having a ci pipeline without actually doing continuous integration.
- esafak 2 years agoIt's mainstream in Silicon Valley.
- mailund 2 years agoInteresting! I'm not in SV, but consulting in a European city that portraits itself as having a fairly advanced tech community. I've yet to encounter anyone on a team I've been on that is familiar with it
Didn't know the differences could be that huge.
- mailund 2 years ago
- peteridah 2 years agoI feel you – I had heard a _lot_ about dark launching and never actually worked at a company that actually did it until I worked at Hashicorp. Now in my mind it seems mainstream. I truly am in a bubble.
- esafak 2 years ago
- anotherhue 2 years agoWe could begin by accepting that operations work often deals with far more complex problems than feature work.
Yes, the tooling is bad, all tooling is bad, but hearing the same old 'Infrastructure should "just work"' trope is getting old. Such developers should stop grandstanding and roll up their sleeves. Learning about TCP isn't beneath you.
- Coryodaniel 2 years ago> Learning about TCP isn't beneath you.
Its literally beneath you as an application developer in the TCP/IP stack :budumptss:
- Coryodaniel 2 years ago
- 2023throwawayy 2 years agoI applaud the effort here, but just looking at how messy the graph is to deploy one docker image doesn't exactly make me want to try this for anything with a level of complexity beyond that.
- holoway 2 years agoThis is a reasonable reaction! :) There are a lot of ways to make the visual interface scale - examples from things like Blender, Figma, etc. We believe we can use them to make it scale semantically as the complexity climbs.
- holoway 2 years ago
- solatic 2 years agoThe tools aren't the problem, leadership and culture is. Tools can't fix problems with leadership and culture. "Executive buy-in" is conspicuously missing from the post. When companies are "doing DevOps" by wiring up manual deploys and manual approvals in GitHub Actions, it's not the tool that's at fault. When developers are applauded for deploying once a month, it's not the fault of the YAML engineers who were hired to build the pipeline that's used once per month, it's the fault of the executives putting their hands together and clapping.
No tool in the world is going to convince an executive to trust their people, to take risks with uptime and stability, and to break production as a necessary part of organizational learning. No, that requires executives to feel supported by other executives. Tools do not create collaboration and trust; people do.
- holoway 2 years agoHi! Adam Jacob here. Happy to answer questions!
- skywhopper 2 years agoWhat does the performance curve look like with larger numbers of resources? ie, How soon does it get bogged down? 1,000 resources? 10,000? 1,000,000?
- holoway 2 years agoIt’s too early to know with real data. Architecturally, it should scale - but you know it’s going to need optimization when we start getting into high numbers.
- holoway 2 years ago
- skywhopper 2 years ago
- trabant00 2 years agoAnother attempt by Dev to make Ops a commodity? I was there when Microsoft announced ~20 years ago they're doing away with sysadmins with a GUI. Then Suse tried as well if I remember correctly. The conferences erupted with anger and booing, I was shaking my head and laughing.
Then came the great YAML plague and we had to give up our title and general purpose languages in favor of silly names, templates and DSLs. But you still have to understand the OS, the hardware, and have real world experience with availability, redundancy, etc. So the new generation of "DevOps" who was raised directly on terraform and k8s failed miserably in achieving any results.
Anybody saw any junior Ops (DevOps, SRE, GitOps, wtfeverops) job openings in the last few years? No? I wonder why. The tools are better, no? It should be easier than ever to deploy. We have all this micro-service orchestration and all those beautiful public clouds. All the conferences and the marketing are saying it's a breeze. You don't have to worry about ha, replication, iops and so on, we'll put that on your bill thank you very much.
So here comes Dev again with a solution: click-click-drag Ops. Surely this will fix things, surely you can now hire right from the street and train to deploy. Or will my old admin ass get even pricier as the demand ever rises and the supply is dwindling? Stay tuned to find out.
- holoway 2 years agoNothing about SI is intended to make ops a commodity. This is a power tool built by Ops people.
- holoway 2 years ago
- arpyzo 2 years agoI don't understand why the prevailing opinion is that it's acceptable for software development to be complex. It's simply the nature of the beast! Not infrastructure though. Infrastructure should be simple because...?
Perhaps the reason your organization is only deploying once a month is the same reason it takes it a month to make simple code changes, which is because you haven't hired sufficiently capable engineers, and not because you're missing some magic.
edited: grammatical error
- silok 2 years agoI think it looks pretty promising, as long as the deployment setup is not only configurable via the UI, but also can be specified decoratively with code and hierarchical data (eg json).
IaC is a really powerful concept, and system initiative does not need to be in conflict with that paradigm, just another layer of abstraction that still allows IaC.
The main issue is how to combining UI state + and manual state.
The worst thing you can do imo, is to use a common representation, eg the UI would try and edit your manually written declarations. That is just a recipe for disaster.
The answer to this type of mixed editing is a layer approach, eg what is being done in the USD format (https://openusd.org/release/index.html)
Each authoring "instance" has full control of its layer, and composition semantics define how the layers compose to the final declarative structure.
- erilbeth 2 years agoIt is never about the tools. it is the people who creates problems. examples in my case: * A manager whose dev team cannot meet any schedule is starting to blame devops. * A dev architect implements a brand new bleeding edge tech stack and ranaway, devops people have to wake up every night to fix poor-designed software. * A boss who doesn't hire a sec team, wants pci-dss passed, guess who's gonna do. * A team lead who abuses you, calls the company who hired you and tells them don't hire it, it is useless.
I've been working as a devops engineer for 6 years. I'm done. I quit and never gonna do this job again. Good luck with creating more complexity with tools.
- rossmohax 2 years agoI like the appeal of model being bidirectional. Also modelling sequence of actions is really not solved problem in Terraform & Pulumi: canary change, check metrics, rollout to the rest of the region, check metrics, then all regions, if they solved it all while being "declarative" and high level it can be the next tool of choice for me.
I am not worried about UI representation of the model like many comments, it is not the main point of this project as I understand. UI just that - a representation, same relationships might as well be coded in HCL or the like of it.
- esafak 2 years agoI think the new wave of devops is called platform engineering and developer experience.
- nathants 2 years agowhat makes infra hard is the large delta between what you think exists and what actually exists. the larger the delta, the worse everything gets.
the only mitigation to this is less. less tooling, less infra, less abstractions. you want that delta approaching zero as uptime goes to infinity.
i’m not sure how replacing walls of code with walls of yaml of walls of gui graphs change anything at all.
i suppose it’s possible some paradigm leap in infra understandability is hiding in crazy nextgen ui/ux, but i’m not holding my breath.
the last leap i encountered was moving from the python sdk to the go sdk for manipulating aws. this was significant, but still more of a qol improvement than something fundamental to the solution space.
- nickstinemates 2 years agoThe core value proposition is really valuable and the demo video nailed it. Change propagation as requirements change is ripe for error.
Have been following progress, excited for where it goes.
- Graffur 2 years agoSo it is Flickr I have to thank for this devops nonsense
- convolvatron 2 years agoAt least a couple words about what that might look like?
- ericand 2 years agoThe homepage has a video demo: https://www.systeminit.com/
- ericand 2 years ago