10 Tips to Write a Kubernetes Operator

How to Get Started Writing a K8s Operator

Operators are the new buzz. There are plenty to choose from. They can help you a lot managing your services and taking care of your cluster.

However there isn’t an Operator for everything. Your organisation may have a custom use-case for an operator. Or you are developing a service yourself where an operator model may make sense. Whichever is the case, there are many things to consider first. Here are some notes from our experiences of getting started with operators:

1. Do you really need an operator?

There are many ways to get a service installed or to manage components in a cluster. Deciding to write your own operator is a big one. Aside from implementing all the functionality, you’d have to think about managing the operator itself, operating your operator and fixing bugs. The operator may be able to solve your problem, but you also just have added another component that you now own completely.

Alternatives can be:

Check if there isn’t already a good operator out there doing the job
Can your CI/CD take care of the job in a good way?

2. Be aware of the goals and objectives of the operator

Don’t write an operator that can do everything. The operator is supposed to be able to take care of only one issue at a time. E.g. “Manage me the lifecycle of a Redis Cluster”.

Also decide early on, based on the objectives, if your operator can be scoped to a namespace or if it needs to be cluster-scoped.

3. Make use of the operator framework

There is an operator framework sdk written in GoLang that will help you a lot! Make yourself familiar with setting up and using the operator framework. Make it do something really simple initially and take it from there.

There are some important noteworthy features of the Operator SDK:

Watching Resources

With the framework you can easily watch resources and react accordingly to changes. This is usually how a reconciliation loop is triggered.

For example:


//+kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;patch;update;
func (r *DeploymentWatcher) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    
    // Do your magic 
    
    return ctrl.Result{}, nil
}
 
func (r *DeploymentWatcher) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&v1.Deployment{}).
        Complete(r)
}

A very basic structure. With the correct RBAC in place, Reconcile() will be triggered whenever there is a change to any deployment in the cluster.

Reconcile

This is your main loop (or loops, there can be multiple) that can get triggered by various different controllers which in turn can watch resources. In Reconciliation you’re making decisions about what should happen, or if anything should happen at all. At the end of your reconciliation loop you’ll most likely change something in the cluster, handle errors and report back to metrics, logs and events.

Client

Of course there is also a client that interacts with the cluster API itself. You can do all the usual things with the client like list, get, update, delete and it’ll be your way to talk to the cluster. There are some helpful methods around that, such as a retry mechanism that the framework can provide. Also the client is cached which helps you with not spamming your actual cluster API too much.

The framework itself is open source and has a very active community with many examples, guides and helpful members if you encounter any issues.

If GoLang is not your cup of tea we actually have our own operator SDK based on Java.

4. Make it open Source

Really. Do it.

Open sourcing your operator can only have advantages. Always develop the operator with third parties using, adopting and developing it in mind. So basically your audience will be “everyone”. You’ll also benefit from great feedback, bug reports, and pull requests from others.
If there is a feature that is proprietary, think about how it can be generalized and abstracted away so it can be made open source.

5. Decide early on what k8s version and distribution you want to support

Decide early if you want to support multiple distributions, and if yes, which ones. (For example OpenShift). Keep distribution specific code at a minimum while still being compliant to all of them. Spent some time early on to find a solution that works as universally as possible for all distributions. This will make your life a lot easier later on.

6. Assume things can, and will go wrong

In distributed systems anything can happen. Input can be missing, incomplete or misconfigured. The Operator may not be able to work at all of a sudden, because someone deleted it’s roles. Or the operator may be dependent on a previous step that didn’t work properly, cascading an issue down the line.

Other areas are not necessarily your operators fault. For example what happens when the Cluster API starts throttling or rejecting your requests? What happens when the API takes 3 times longer to respond than usual? (Hint: Things may go wrong)

Verify your input and actions that they are what they are supposed to be. Always assume things can go wrong and add appropriate error handling and recovery methods in your code. Nothing erodes trust in your operator faster than an operator doing something it’s not supposed to do.

7. Test tests tests

Write appropriate unit and e2e tests and integrate it into your CI. Luckily the operator framework can help you with that. It offers a test suite that can use a fake api (or real one if you want) with different stages, test tables, evaluations etc (EnvTest). The most popular testing framework is a combination of ginkgo and omega which also integrates nicely with the operator sdk.

Testing the workflow and behaviour of your operator is important to detect edge cases and take care of them. With new features coming in, the shape of your operator will change, and so will the behaviour. Having a grounded evaluation and affirmation through the tests will help you a lot. Especially if you work in a team.

8. Simplify your CRD creation and deployment

Keep all of that stuff mostly automated. Kustomize or Helm come to mind. Having a solid framework to get your operator into a cluster and working will not only help you during development but also for anyone else adopting the operator.
(The people operating the operator will thank you!)

9. Permissions and Documentation

Operators are powerful, they often live and do their work on a cluster level. With great power, comes great responsibility. Not only do you need to keep your RBAC permission at a minimum, but you also need to describe the behavior of your operator in detail. Go into detail what is the goal of your operator (this goes back to point 1) and describe what it does, and also, what it not does. It’s important to stake boundaries of the operator to 3rd parties.

10. Follow the 12 Factor apps principles

Operators are a perfect example where the 12-Factor-app principles can be easily applied to. They’ll make your operator more resilient, stable and easier to work with. Always assume other people and third parties will make use of the operator. So it should integrate nicely wherever it is supported without too much customization.

---

That sure is not all. Another good resource is the operator best practices.

But I believe that the most important thing is this: Just get started and experiment and have fun!