Why "Operate first" is important for your project

An interview with my colleagues Marcel Hild and Robert Bohne on new ways to apply Open Source principles to operating software.

Ingo: Hello Robert, hello Marcel! Nice to have you here! And today we have an interesting topic “operate first” and I’m very curious what that really means. So, Marcel and Robert, please introduce yourself shortly.

Marcel: I’m working in the office of the CTO at Red Hat on the topic of machine learning and how it could help transform operations into a more scalable way of doing things, reusing the brains of our AI overlords that are yet to come.

Robert: I’m the specialist solution architect for OpenShift. My role is to address and to educate customers and partners, all around OpenShift. How to operate OpenShift, how to work with OpenShift, how to deploy your application.

Ingo: So we are all coming from Red Hat. We are all doing open source. And this is traditionally mainly about the freedom of the code. But what is the problem with this? When I hear “Operate First”, are we going into operations? As far as I know, operations are typically done at the customer’s side, at the user’s side. Why do we need to be involved in operating?

Marcel: I think this is due to the fact that operating software is getting more and more complex. And these days it’s not so much about software features, but it’s about software availability. So that’s why people are going to the hyperscalers, to the cloud vendors: Because they are more interested in actually having the service available than operating the service themselves.

How can we build operational excellence right into the software?
Marcel Hild

And that’s introducing a problem in the end to end Open Source story: How can we build operational excellence right into the software? Open source did a great job at liberating software itself so that it isn’t just proprietary vendors selling the software, but users who can contribute back and add software features themselves.

Traditionally in Open Source the focus lies on the code. How to operate the code is left to the user.

Now, how do we get people to contribute back into the operational excellence, into the operability of software? “Operate First” is trying to address this very problem.

If you look at open source software, you have this contributor funnel: Somebody is using the software and maybe they have a problem with this software and they open up an issue. Maybe somebody is working on that issue. Maybe the person that opened the issue itself is actually working on that issue. And so out of 100 users, you might have 10 users reporting back problems, and one user actually fixing a problem. This is creating a funnel of contributions.

With cloud software, we essentially dry out this funnel because contributing back to the operability of software stops at the API layer. Maybe you can file an issue, but what happens with that issue is behind the cloud vendor or behind the service layer of the operated service.

It’s not as easy to just compile something and try it out yourself locally, because setting up a service requires way more than just configuring and compiling a bunch of software components these days. If you look at Kubernetes for instance, or if you look at storage clusters, that’s nothing that you would just typically do on your laptop. It’s something where you have to have a certain set of machines, a certain scale.

I the cloud paradigm, it’s getting more and more important to know how to run the code than the code itsself.

And opening up that scale to allow contributions so that not just individuals, but also companies can collaborate together on solving a single problem and not reinventing the wheel again and again. This is what “Operate first” tries to solve.

“On a philosophical layer it is how we rethink and envision Open Source in the cloud era, where it’s more about availability and operational excellence, than about software features itself”
Marcel Hild

On a philosophical layer it is how we rethink and envision Open Source in the cloud era, where it’s more about availability and operational excellence, than about software features itself.

Ingo: When I understand that correctly: We in Red Hat have a mantra which we traditionally call “Upstream first”. This means, when we develop code, we do it in the open source communities. We do not do it behind closed doors and open up our contributions as far as possible. So is “Operate first” that, is that a similar procedure? Not on the code level, but on the operations procedures?

Marcel: Yes. “Upstream first” means, basically every single line of code that Red Hat develops should be ported back into the upstream project, not just to reduce the maintenance burden, but also to give it back to the community. And usually we try to develop in upstream and then productize at a second thought.

We develop in the upstream community so that everybody can have their voice. All customers can have their voices heard and contribute back, shape the future of Red Hat products and these software projects.

How do you translate that into a product as a service? Maybe it’s storage as a service, maybe it’s queuing as a service. Maybe it’s Apache Kafka as a service. Maybe it’s open data science as a service. How do you engage users contributing back? How will this service be shaped?

And I don’t think it stops with users. It’s also about architects. It’s about developers. It’s about other projects that want to integrate there because nowadays it’s not just a database and off you go or you have a web application that accesses the database. Applications these days are spanning over a lot of microservices. They are spanning over a lot of domains. Going back to that Lego brick example: You will always reset and re-architect for a certain use case that a customer has and you’re doing it by reusing those Lego bricks. The difficulty is more in: “How can I create something new out of these Lego bricks?” and not so much at improving the Lego bricks themselves.

So if we also open up reference architectures, implementations of solutions for problems that are physically accessible and not just on slides and PDFs where somebody talks about a reference architecture, but you can actually touch it and you can contribute to the running code.

I think that levels up the possibilities for contributing back to that reference architecture. It’s also a place for customers coming together with solution architects with individual software vendors or with the system integrators working together in a cloud paradigm, a hybrid cloud paradigm, because it’s not just cloud, but it’s also on-premise on bare metal.

Integrating such reference architectures and having them run for a longer period of time so that customers can touch those reference architectures. System integrators can touch those reference architectures and contribute back and have a living example and therefore bootstrap the next setup at a customer, you don’t have to start from scratch. You can see how GitOps is actually being implemented in such a scenario, and you can just go back and examine how it’s actually implemented for real.

Ingo: Okay. So that means I can come as an interested user and watch the real life operated environment, how things are working and can come and make my contributions? So is there already something I can see or is there a practical implementation already or is it a plan for the future?

Marcel: That’s a very good question, because we started with this philosophical backdrop also at Red Hat. That’s more than a year old now where we asked ourselves the question, how can we map the contribution to software into the contribution to an implementation of software, the operations of a software stack.

And we came to the conclusion that every cloud, every setup, every instance can “operate first” in the definition of a verb. So being open, being inclusive, taking more feedback from developers, et cetera. But I think it’s also important that we have a reference implementation of such an “Operate first” community cloud.

So there is

https://operate-first.cloud

It’s a community operated cloud environment, which consists right now of Kubernetes, that’s here OpenShift on a bare metal installation running at the Boston University.

We have another cluster running at Hetzner, sat up with the help of the German solution architects, and we are operating clusters in AWS. So it’s already a multi-geo and a hybrid cloud setup.

We have all the procedures to onboard projects. This is documented on GitHub and the logs and the metrics data are being captured and are made [publicly] available. And if you want to onboard your open source project or your workload into that cloud environment, you can do so by filing a ticket, then you go through the onboarding process and then you can see how we deploy namespaces, how we deploy workloads.

We use ArgoCD and other state of the art GitOps procedures. And you can all track this without actually signing up for anything, because it’s all available on Github. But if you want to contribute, it’s also easy: You basically create a pull request to the Git repositories that contain the operational configuration, and then bots will help you automate and deploy workloads.

“It’s something for aspiring SREs who want to learn how to do SRE in a cloud-native context.”
Marcel Hild

Or if you want to resolve operational issues, the logs are there for introspection and you can contribute back. So it’s something for aspiring SREs (Site Reliability Engineers) who want to learn how to do SRE in a cloud-native context. But it’s also something for open source projects that want to run community workloads there and contribute back in terms of running their workloads there. And then getting information and acquiring knowledge on how to operate these services, and feed that back into the project itself.

And for solution architects, there’s this notion of “this a place where we can showcase reference architectures and let them run for an extended period of time”. And therefore reduce the amount of work to set up a reference architecture. And you can showcase it to a customer, to your fellow solution architect because it’s out there running openly on the internet and not behind companies’ VPNs.

Ingo: Yes. Great Thanks!. First of all, I’m very interested in looking into this and since I’m personally also working with the Hetzner environment, I like to know more here: Robert, you just started to work on doing “Operate first” on Hetzner clusters. Can you tell us what you did and why you did that? And how can I get involved if I want to learn more?

Robert: I did it because I wanted an “Operate first” cluster here in EMEA and on a cheap infrastructure like Hetzner and my colleague Christoph Görn reached out to me. They needed support to build a cluster in EMEA. And then I basically joined the team and the Github organisation, created a repository and started working on it. We communicated via slack, to discuss some architecture topics, how to set up the cluster, how to use some APIs, how to use the API to the DNS server for a “Operate first” and so on.

“I basically joined the team and the Github organisation, created a repository and started working on it.”
Robert Bohne

Then I created a bunch of Ansible playbooks to create an OpenShift cluster on a Hetzner bare metal cluster, not based on virtual machines. We decided to go with a bare metal cluster and contributed those playbooks to the “Operate first” team. And now the team is onboarding the cluster, setting up the single sign on, connecting to the GitOps approach and rolling out all the configurations.

And now we have a cluster here in EMEA, and it was very nice to see how you can work together in this global team. People from US, people from Czech Republic, people from Germany, and this was very interesting for me and working more, in a Dev fashion way, although the topic but it’s basically ops. This was really new to me, with pull requests, with GitHub issues and discussing the issues. IT was interesting and I learned a lot and hope some people from “Operate first” learn things from me too, because I’m a strong ops guy. I grew up in the ops area in a real ITIL world. With incident tickets, change requests and problem tickets. Yeah, this is quite interesting and still ongoing, of course.

Marcel: Yes. And it’s exactly this learning aspect that is important here, Robert sets up this cluster at Hetzner and it’s not an easy task to set it up there because they have some specialties accessing the hardware, etc. But now it’s all documented on GitHub, so everybody can reuse it and either redo and learn from it or make it better and contribute back.

So we learned from Robert how to do it at Hetzner. And we are educating others on how to set it up in a GitOps fashion and with Ansible playbooks and all the good stuff instead of doing it once for a demo and then tearing the cluster down. And the documentation is never being looked at again.

It really lives on there because it’s actively maintained just as you would do it with day-2 operations at a customer side, because the customer is also not just setting up a cluster just for demo purposes, they want to operate it for a longer time. And then the actual fun part begins.

“All this history is being dragged there in the open for further re-inspection”
Marcel Hild

How do we use this cluster for actual workloads and how do we scale this cluster? How do we upgrade this cluster? How do we maintain it basically? And all this history is being dragged there in the open for further re-inspection.

At some point my machine learning overlords will come and learn from it, and then we will have AIOps bots doing it for us.

Ingo: Great, thanks!. There is definitely another aspect I really like about that, because I also have a strong infrastructure and operations background, and I see that operations people are sometimes very good, sometimes smart, but are not very good in documenting and transferring their knowledge to others. Does learning from dev help here?

Robert: Sometimes this is the same problem with dev guys. So they also do not document everything.

Ingo: So we learn from each other?

Marcel: But there’s another example: For developers, and I’ve also been a developer for a long time, usually I “google” something, like a stack trace, I “google” an error message and I end up on stack overflow, or I end up on the documentation or on an issue for that particular project. And I can dig into why this function is not behaving as it should be or how do I architect something? How do I write code basically?

The internet is full of help for people that write software, because we have been doing it in an open source fashion. Now, if I’m setting up a particular environment or operate a particular environment, I might search for an error message, but that doesn’t really help me setting up the environment or running the environment because the logs are not searchable because the incident issues are not searchable because the metrics are not searchable.

So you won’t find an edge case or an outage properly documented or a post mortem document from an outage. These are not searchable because they are always proprietary to the operations platform.

Imagine you are operating an OpenShift cluster and you have an outage and you can already see this node is going down. And I have these error messages showing up. Now imagine that you are searching for it on the internet. And then you find a post-mortem document from the “Operate first” community where they had a similar outage. And you immediately know where to look at, how to debug that outage or how to handle that outage.

I think that’s pure gold for people operating stuff. And then, I mean, obviously we don’t want any outages, so hopefully this outage is being documented, made it back into the software itself so that the outage is prevented in the first place.

Ingo: Very good. I think we are at the beginning of a new era of Open Source exploring new ways of operating software.

Marcel: I think it’s really the first really truly open cloud or production environment that I saw. We have similar setups for the Fedora community, for the home automation community. There are already communities out there that operate things in a community fashion, but it’s still proprietary in the sense of it’s being closed to those that operate this environment. So actually creating an environment that is open by default in terms of configurations, logs and incident tickets, and how it’s being set up in an easy to consume fashion and also in a cloud native way. So really at the boundaries of how we run things these days, I think this doesn’t exist today. That’s really novel in terms of reinventing or rethinking open source and reapplying it to operations in general.

“This is not just an idea, this simply works!”
Robert Bohne

Robert: And to combine the classical operations world, starting in ordering bare metal servers, putting bare metal servers into the rack. And at a certain point we strictly go into the GitOps and DevOps and SRE world, but we combine those two worlds. And this is very interesting to see that this works. So this is not just an idea, this simply works.

Marcel: And it’s a place for trying out things. It’s like Open Source software, nobody comes and says, this is the way we do it because we have a waterfall plan to implement stuff. And we have these features to deliver. If you want to do something, if you want to try out something: Do it. It’s open, right? You can contribute and you have this primordial soup of ideas that challenge each other. I’m welcoming bare metal installs, cloud installs, OpenStack, OpenShift, Kubernetes and community operators. These are all welcome projects although they might solve the same problem, but solving it with a different implementation, that’s creating competition and that’s creating the best solution for something.

Ingo: Marcel, Robert: Thank you very much for this engaging interview!
For the reader: If you want to engage with the “Operate First” team, look at https://operate-first.cloud and try to run your project there!

Related