Site Reliability Engineering, or SRE, an engineering practice formalised and named by Google, has helped many organisations maintain their platforms and ensure application performance and reliability, while also allowing for their application developers or platform engineers to innovate and respond to fast-evolving business needs.
I’ve written an e-book on that topic, and we have been addressing the basics of SRE in our blog. We think it’s an important part of a modern, fast-responding, Cloud Native system.
But what is CRE? That’s something that’s related to SRE, but goes a step further.
To recap: In SRE, responsibility for an organisation’s platform is split between two teams:
- A product team, which focuses on delivery of the business value, application, or service (including innovation).
- A reliability team, which focuses on both maintaining and improving the platform itself.
We’ve taken that concept and added additional value. In CRE—Customer Reliability Engineering, also inspired by service at Google—the product team is comprised of our customer’s engineers. But the reliability team is made up of Container Solutions engineers.
The advantage this offers is that it brings more experience and the latest best practices into the customer’s organisation.
Container Solutions helps the customer’s development teams both to create the technical and organizational requirements and to meet those requirements. To enable teams to meet such requirements, training, architectural consulting, SLO/SLI workshops, and engineering in the form of co-implementation with CS engineers can be offered.
This setup also frees the in-house engineers who make up the product team to focus on innovation—in other words, to focus on not only what customers demand now, but what they will demand in the near future.
Because these engineers are not obligated to take on pager duty in their off hours, they are able to spend time learning new languages and technologies, building their skills and becoming more valuable to their company. And because they’re not getting burned out by being on-call so often, they might be more likely to stay longer with their employer.
How Our Team Works
CRE is what you get when you take the principles and lessons of SRE and apply them towards customers.
The Container Solutions team deeply inspects the key elements of a customer’s critical production application—code, design, implementation, and operational procedures. We take what we find and put the application (and associated teams) through a strict Production Readiness Review. At the end of that process, we tell you, ‘Here are the reliability gaps in your system. Here is your error budget. If you want more uptime (nines), here are the changes you need to make’.
We also build common system monitoring, so that we have a mutual agreement upon telemetry for paging and tickets. It’s often a lot of hard work on the customer's part to get past our Production Readiness Review,, but in exchange for the effort, customers can expect the following:
- Shared paging. When their pagers go off, so do ours.
- Auto-creation and escalation of Priority 1 tickets
- Container Solutions’ participation in customer war rooms (because, despite everyone’s best efforts, bad things will inevitably happen)
- A CS-reviewed design and production system
When following this assessment and principles, the cost of outsourcing the support will only be a fraction of the long-term cost of an application or system not having passed this review.
Our CRE service is designed to be delivered remotely—a necessity for companies still operating under COVID-19 restrictions.
To learn more about how CRE works, check out our service page, where you can download a detailed brochure and request more information.
You can find all of our information about SRE and CRE in one place. Click here.