WTF Is Cloud Native

WTF Is Policy as Code?

Identity is easy. Everybody has one, after all. Authentication? Possibly even less convoluted. Anyone using a smartphone authenticates dozens of times per day just to use it, and that’s even before involving remote services like those needed for banking, work, or social media. Perhaps it was this apparent simplicity that drew me into the world of identity systems some five or six years ago.

In the years prior to that, I had primarily been working in product or service development, where a common theme for us developers was to be handed some vague description of what needed to be done, and then spend days or weeks trying our best to do just that. Only when we finished would we discover that what we did wasn’t really what was being asked, and we’d need to spend a considerable amount of time refactoring what we had built to support these changed or additional requests. At the end of a sprint we would complain politely, and if we were lucky things would improve for a while before returning to the status quo some weeks later. 

You may relax though, this blog isn’t going to be about how Agile, Scrum, or some other methodology solved this challenge. Instead, the focus for this post is how we arrived at an open-source solution, in the form of the Open Policy Agent (OPA) that met all of our goals for working with policy as code. But first, some background. 

One day my team was assigned new responsibilities. The modern API we had been working on for a while was going to be handed over to a new team, and we were instead assigned the non-trivial task of building a much needed new identity platform—one that would replace the handful of legacy platforms that were currently trying to do this. Some of the requirements included having a single concept of a user ID and a platform in place for dealing with single sign-on (SSO) authentication across dozens of disparate products and systems. 

Of all the changes this meant for us, one of the more interesting ones was the introduction to working with standards. We had probably all worked with them to some extent, but it wasn’t until that point that we really got to work with them closely. Where we had previously dealt with vague requirements and loose specifications, we were now faced with all these documents that would say exactly what was right and what was wrong. The strictness of it all was, paradoxically enough, liberating.

For a few months we were all knee-deep in specifications and standards documents, learning how to use OAuth2 for obtaining access tokens, how OpenID Connect would provide a much needed identity layer for our applications, and how SCIM would simplify and standardise identity across disparate identity back-end stores. From what we learned, we were able to build an identity system that would serve the company for many years to come. 

The day of the first release eventually came, and miraculously things just seemed to work. People would login via the front end, the identity system would issue tokens in the back end, and these tokens would then be transported across our products and services to identify the user or service behind the request. While we certainly had a backlog of improvements to work on for the next couple of years, the basic functionality described in all those standards was there.

The value of delegation

So, now what? With a well working system for authentication in place, and standards that clearly defined how to transport the resulting identity data across services and systems, where would we go from here? 

While we had successfully avoided it for the duration of the project, the big elephant in the room was clearly authorisation. With all these standards to help us answer who you are (ie   authentication), there seemed to be almost nothing similar out there to tell us what you can do (ie authorisation). The little we could find in that space was ancient and felt like something that had sprung out of the era of the monoliths—far from the distributed, microservice-based, Cloud Native applications we had been building.

If there was one concept we valued over any other in the identity systems we worked with, it was that of delegation. Delegation is essentially an elegant way of saying “not my problem!”, and who wouldn’t love saying that when faced with something truly challenging? For authentication, this basically meant that services could delegate the responsibility for authentication entirely to the identity system. Did something like that exist for authorisation? 

While there were a few notable differences between delegating authentication versus delegating authorisation, none of them seemed insurmountable. One key difference was that while the result of authentication was an identity that could be used across systems, authorisation policies tended to be much more local to the services involved. Permissions granted to access system X would rightfully be dismissed by system Y. 

Policy as code? No thanks!

This locality requirement had led developers to put the authorisation policies of their services as close to the service as possible, and what could possibly be closer than in the code of the service itself? 

This coupling, however, came at a cost. Policy embedded in the code of the application could not be easily shared, or even easily discussed, across teams and products. Not even developers with sufficient knowledge of all the different programming languages and frameworks involved in an organisation would be able to point out policy-specific code from unrelated business logic, and even if somebody could, the level of effort involved would mean it never really happened. “Policy as code”—or rather, “policy in code”—was in that sense exactly what we were trying to avoid!

Having authorisation logic or organisational policy codified was however certainly better than keeping it in a PDF file, Word document, or Jira ticket. But if that policy code was mixed with application code and unrelated business logic, it most likely meant we’d need to have both—an “original” policy document and implementations of the same policy in code. Keeping them both in sync would be error prone, and over time they’d be bound to drift apart. 

Any programmer has probably seen comments in code that didn’t reflect what the code was actually doing. While mildly annoying in such context, these types of inaccuracies could be disastrous when applied to organisational policy, and even more so when policy is meant to mirror requirements enforced by national or international law.

What we’d really want here—and what we’d later discover was the real meaning of “policy as code”— is that policy code was indeed code, but code that could easily be kept separate and decoupled from application code and business logic. We just didn’t know where to look.

Delegated authorisation?

Coupling policy code to our apps had an even worse consequence: any change 

made in policy post deployment required the application code to be updated and then redeployed. While this might work OK for a monolithic application, it clearly posed a problem in a microservice-based architecture. What if there were hundreds, or even thousands, of services, among which it would make sense to have some organisational policy shared, while also allowing individual services to add their own policies on top? 

One of the bigger challenges in any sufficiently complex system is going to be that of managing change. The more services and systems—or people and teams—involved, the higher the cost of the change. The delegated authentication model had escaped this by saying “changes are managed internally in the identity system”, and the microservice model had delegated this responsibility out to the individual teams managing the service. 

Distributed authorisation seemed to draw a contradictory line right between the two models, where we would find ourselves wanting some responsibilities centralised while simultaneously allowing services the independence that developers had come to expect from the microservice model. Certainly no easy feat!

The long road to OPA

The more we tried to imagine what a modern authorisation solution would look like, the further away it seemed. Eventually, we let go of the idea. Delegating the responsibility of solving authorisation to the development teams meant that even though this solution wasn’t great from a higher level perspective, it would not directly affect us working with the identity system, as that was considered a somewhat isolated domain. We had effectively managed to adopt the “not our problem!” approach that we had wished for our architecture. 

It wasn’t until the KubeCon conference in Barcelona a couple of years later that I would be reminded of this. Some of the talks there revolved around building powerful guardrails around the Kubernetes API by letting the Kubernetes admission controllers consult an external policy engine when deciding whether to allow a change of state in the cluster. The name of this policy engine was Open Policy Agent (OPA), and it was apparently quite a popular open-source project in the Kubernetes world. 

Now that I was spending about half of my time working in a Kubernetes-focused platform team, this seemed interesting. Returning from the conference, I started a small proof of concept to explore the capabilities of this policy engine for Kubernetes, only to find out that not only did it do that, but it seemed to offer an almost perfect solution for the authorisation problem I had struggled with some years before. 

Policy as code? Yes please!

OPA solved the “policy as code paradox” by inventing a policy language of its own: Rego. It initially seemed like this would add a lot of complexity, and quite some investment in time. That initial investment would soon pay off however, as policy—and to some extent, the complexity—could now be decoupled and isolated from unrelated business logic. 

Doing this for a single system would indeed have been expensive, but it became increasingly clear with each system added that much of the policy logic truly was common. In fact, this approach would not only decrease the total complexity of the system, but would also allow one to easily reason about policy across even large organisations spanning multiple teams, divisions, or geographic boundaries. 

Rego offered a declarative language where we would be able to describe policy like “allow user X access if member of group Y, or has role Z” without having to go into the details of how that should be done. As long as the data Rego got to work with was structured in a way that could be represented as JSON or YAML, policy authoring was a reasonable thing to do. Moreover, built-in functions like those verifying and decoding JSON web tokens made OPA the perfect match to the identity system already in place.

Delegated authorisation

The next, and arguably bigger, problem OPA solved was exactly the “delegated authorisation” scenario we had been struggling with. The distribution model of OPA in microservice environments means that each instance of a service will have its own little instance of OPA running just alongside it. These instances are free to combine local policy and data with common policy code shared by multiple systems in the cluster, and distributed through one of the management APIs that OPA may use to allow centralised policy management, if it makes sense to do so.

As if we weren’t convinced already, the last little detail in the OPA package would amaze us even further. Not only did OPA solve much of the microservice authorisation problem we’d been struggling with but, as the Kubernetes admission controller use case demonstrated in Barcelona showed, the concept of decoupling policy from enforcement could be applied to pretty much any domain. Before long, we were investigating also using OPA for our Kubernetes clusters, our Infrastructure-as-Code projects, and our deployment pipeline.

High standards

While OPA itself is not a standard, it allows policy authors to tap into any of the prevalent mechanisms for dealing with authorisation (some of which have standards around them), such as access control lists (ACLs), as well as role- or attribute-based access control (RBAC/ABAC).  You could use OPA to build your own policy framework entirely. It doesn’t even necessarily have to be about authorisation! As much as I had enjoyed working with standards before, I could certainly see how that model would be limiting more than it would be liberating in this context.

Joining the community

We had seen all we needed to see. While we certainly had some work ahead of us, OPA had proven to be exactly what we had envisioned—but failed to describe—some years before. OPA managed to perfectly manifest the whole notion of “policy as code” as no system we’d ever been close to had.

Had OPA not been open sourced, perhaps other contenders would have stepped up to the challenge. Instead we’ve seen both a vibrant community grow around OPA and a thriving ecosystem of adaptations, integrations, and plug-ins, that just keep getting bigger and better with time. This year counts as my third year as an active member of the OPA community, and I’m not going anywhere!

So, where can you start? My usual suggestion is to start simply. Identify a couple of informal policies in existing services or apps you’ve built and consider rewriting them in Rego and OPA. Make sure to keep a tab with the excellent OPA docs open as you explore Rego and join the OPA Slack to see what others in the community are working on or might be struggling with. To really take a deep dive into Rego and take your skills to the next level, check out the Styra Academy

Hopefully my story has helped shed some light on the type of problems solved by formalising policy as code, and the benefits of decoupling policy decisions from the business logic of your applications and services. Whether for infrastructure or authorisation, Kubernetes or build pipelines, OPA offers a unified way of working with policy that will only grow in importance with your organisation and tech stack.

Where does your policy-as-code journey start?

Conway's_Law_WTFinar.png

Comments
Leave your Comment