Under Control: Why Governance Engineering is Coming to Cloud Native

The history (and prehistory) of Cloud Native has been characterised by the gradual encroachment of automation on more and more pieces of the software engineering lifecycle. In the dark ages before the year 2000, automation was limited to tools like make and shell scripts, and was typically directed towards building software artifacts, such as tar files or CD-ROMs.

As time went on, more tooling built up to support the automation of the software lifecycle. Git helped automate the management of software code, Jenkins helped automate and manage software build processes, Terraform the automation of infrastructure delivery (leveraging the Cloud’s development of APIs for infrastructure provisioning), Docker the encapsulation of software environments, and so on.

This diagram from our GitOps e-book was put together to show the antecedents of GitOps, culminating in tools like FluxCD and ArgoCD that are centred around Kubernetes. Kubernetes itself might be considered the apotheosis of the Cloud Native philosophy, emphasising the importance of software platform delivery as code that can be stored and managed in an auditable source control.

What this diagram and history shows is that software automation and code is gradually ‘eating the world’ of manual processes related to the software lifecycle. This trend started with building software artifacts, and then moved onto various software areas such as configuration, storage, delivery, deployment, testing, security, and infrastructure provisioning. For reasons that will become obvious, this trend is likely to be seen first within the financial sector, but there is no reason other sectors won’t have increasing compliance demands as technology becomes ever more pervasive and risky for their operations, and those companies seek to mitigate the risks to their business.

Governance, Risk, and Compliance

One area that has remained relatively impervious to automation has been the Governance, Risk and Compliance (GRC) areas of IT service management. As technology has increasingly automated the operation of the financial sector, little has changed in GRC, except perhaps the amount of money spent (up) and the number of people involved (up). As a result, GRC in finance has scaled poorly, still relying on regular manual point-in-time checks on whether controls are in place and working. Container Solutions has written about this challenge previously here.

So far, engineers have shown little interest in tackling this problem, perhaps because controls are seen as stifling rather than enabling, and ‘managing risk’ via controls is less intellectually challenging to master than security issues such as supply chain management or vulnerability detection.

However, that may change soon.

Regulatory Focus

Like any other part of industry, regulators are subject to trends, and these trends can be shaped by events. In the financial sector, the biggest event this century was the 2008 global financial crisis, which precipitated a marked shift in regulatory frameworks and priorities. This shift brought focus to financial risk, both systemic and to the consumer. Banks and other financial institutions were required to hold more capital in reserve to guard against potential losses, and new measures were implemented to detect and prevent fraud and financial misconduct. The overall aim was to reduce risk and increase resilience in the financial positions that banks took.

While systemic risk never goes away, after 15 years this focus on financial regulation is receding, and there is increasingly a focus on operational risk and resilience (as opposed to financial).

The Digital Operational Resilience Act

In this context, the Digital Operational Resilience Act (DORA) has been put forward by the EU. This act seeks to “enable and support the potential of digital finance in terms of innovation and competition while mitigating the risks arising from it”. The initial memo on the act covers various areas to do with IT risk management, centering around risk management and reporting. The regulatory wind is blowing in a similar direction in the UK, with a resilience policy paper preceding the enactmentent of the UK Finance Software and Markets Act 2023.

The published documents on DORA are somewhat vague and generic. Although the act is slated to become law from January 17, 2025, the technical standards are expected to be published ‘in tranches from January 17, 2024’, so the detail isn’t known yet. This doesn’t give much time for financial institutions to allocate budget to deliver any changes they need to make to be compliant, especially as budgets are still typically decided on a yearly cadence in banking.

Some ‘reading between the lines’ of the DORA documents might make its intentions more clear, however. It seems that regulators have been frustrated with the opaqueness and lack of clarity and transparency around risk, and incidents relating to risk. This would explain the strong focus in the act on standards around ‘incident reporting, reducing administrative burdens and strengthen[ing] supervisory effectiveness’. To put it plainly, regulators are going to expect compliance and audit functions to have the ability to report on their controls regularly, efficiently and clearly.

There is also an emphasis on system testing, again suggesting that regulators are keen to see a reduction in operational risk through better deployment and monitoring of new technology and systems.

Of particular interest to us at Container Solutions are these quotes from the act:

‘financial entities shall establish, maintain and review, with due consideration to their size, business and risk profiles, a sound and comprehensive digital operational resilience testing programme as an integral part of the ICT risk management framework ‘

‘financial entities need to have in place comprehensive capabilities enabling a strong and effective ICT risk management, alongside specific mechanisms and policies for ICT-related incident reporting, testing of ICT systems, controls and processes, as well as for managing ICT third-party risk.’

These quotes point to an increasing focus on GRC from the regulators, and specifically on the testing and reporting of risk and ICT-related incidents. In short, less check-controlsbox and more real-time observability.

Controlling Controls

One of the frustrations we think regulators have is the lack of visibility on the effectiveness of controls. At the moment, audits of controls take place on a cadence in the years, and are carried out ‘by hand’ by auditors whose job it is to seek out evidence of the adherence to, and effectiveness of, controls.

At Container Solutions, we know this problem is ripe for automation and standardisation, just as other areas of software (such as CI/CD) have been over the last ten years. The first steps toward this have been taken in the industry, as FINOS has announced their Common Cloud Controls project. This project seeks to ‘develop a unified set of cybersecurity, resiliency, and compliance controls for common services across the major cloud service providers (CSPs)’.

However, this only deals with shared standards around the description of controls, and not the implementation of the controls. For example: a control description might be: ‘S3 buckets must not be available across the Internet’. Control implementations come in three classes: preventative, detective, and reactive. For this control the implementations might be:

A CSP policy written in a product such as Azure Policy (preventative), or
Attempting to connect to each S3 bucket in turn across the Internet and reporting any that allow access (detective), or
Deleting each S3 bucket that is detected as being open to the Internet (reactive)

We’re Working On It

At Container Solutions, we’re working on an open source solution that seeks to help companies automate their controls and auditing efforts. Just as with other Cloud Native transformation efforts, we see this as precipitating a step change not just in the pace of auditing, but also the effectiveness, transparency, and scalability of auditing processes.

Efficiency and Speed: Automation can drastically reduce the time spent on manual audit tasks, allowing for quicker assessments and responses. This enables companies to conduct continuous audits with fewer resources.

Accuracy and Consistency: Automated controls reduce the risk of human error and ensure consistency in the application of audit rules, resulting in more accurate and reliable audit results.

Transparency: Automation provides real-time visibility into audit processes and outcomes, making it easier for stakeholders to monitor compliance, auditors to demonstrate compliance, and regulators to evaluate the results.

Scalability: As companies grow, so do their auditing needs. Automated controls can be easily scaled up to match the pace of growth, reducing the need for additional auditing resources.

Adaptability: In the fast-paced cloud environment, compliance requirements are constantly evolving. Automated controls can be more easily updated to reflect new regulations and standards, keeping companies agile in the face of change.

Therefore we can begin to see a rebalancing of the equation on the GRC side where disproportionate costs and infrequent manual checking is replaced by automated and real time observability; which provides the opportunity to take immediate action significantly mitigating risk.

Stay tuned for more updates on this effort, as we seek to make Governance Engineering a first-class citizen in the Cloud Native landscape.

Under Control: Why Governance Engineering is Coming to Cloud Native

Governance, Risk, and Compliance

Regulatory Focus

The Digital Operational Resilience Act

Controlling Controls

We’re Working On It

Learn jq the Hard Way, Part I: JSON

Learn jq the Hard Way, Part II: The jq Command

Talk to sales

Stay In Touch