The field of site reliability engineering originated at Google with Ben Treynor Sloss, who founded a site reliability team after joining the company in 2003, but the practice has spread across most organisations that require high uptime and availability, with SRE being among the 25 fastest-growing job titles according to LinkedIn. Indeed, Container Solutions’ own Cloud Native Operations service is modelled after it. So while you unlikely have the same scaling requirements as Google—few do—it’s always good to go revisit their more evolved SRE outlook. For this, Container Solutions spoke to Jennifer Petoff, director of site reliability engineering education at Google, and Dave Stanke, developer advocate at Google Cloud Platform.
So much of the SRE space is centred on tooling, but really there is a more pressing core skills and technical skills need that has to be filled first.
For Petoff, the most important characteristics of a good SRE are:
1. An innate curiosity to figure out how things work.
2. A willingness to engage in principled questioning of the status quo.
3. A dislike for toil and doing the same manual task over and over again.
SREs also need empathy and good communication skills. As Stanke reminded us, “SREs often find themselves in contentious negotiations between stakeholders—product owners, developers, datacenter ops—each of whom brings impassioned advocacy for the value that their role provides.” Plus, in the event of an incident, SREs are frequently the bearers of bad news, making that gentle bedside manner essential.
From a more technical standpoint Petoff refers to SREs as force multipliers that “apply software engineering skills to make their jobs easier and enable services to scale without needing to scale the team at the same rate”.
But, when they are still spending too much time on the unexpected—outages, postmortems and being on-call—SREs continue to struggle to find their place and the space to plan and prevent the unexpected or even unknown.
To Embed or Not to Embed, That Is the Question
Like cybersecurity, site reliability engineering often comes off as an outside influence just there to cause bottlenecks. In both situations, it doesn’t mean that anyone thinks these are unimportant jobs, it’s just they are the jobs that seem to slow down release velocity. This could be because often SRE, as at Google, is housed in its own team instead of embedded within software development or cross-silo development teams. But SREs should still be able to effect meaningful change even without being seated next to developers.
Petoff offers what she calls the ‘SLO handshake’ as a way to build trust between SREs, developers and other stakeholders. Service level objectives, which put the user at the forefront, she argues, create a natural balance between feature velocity and reliability. “It makes it easier to get everyone on the same page and rowing in the same direction. This is the opposite of a bottleneck and removes friction from the relationship”, she said.
The SLO helps translate for all stakeholders that reliability is a first-class feature of the product. Treating it as an essential feature gives more agency to SREs.
“What is important is that reliability has a strong voice at the table when there may be a temptation to trade off reliability in favour of new feature launches, despite not having enough error budget to proceed,” Petoff said.
She emphasises that, while SRE is a balance—or sometimes rift—between reliability and feature velocity, reliability is a feature unto itself. At Google, reliability is even seen as a first-class feature.“If your product isn’t accessible to users or causes them frustration, all those other shiny product features don’t matter. I think positioning reliability as a feature can help SREs drive change even if they aren’t embedded on a development team.”
So long as you embed reliability across teams, SREs don’t necessarily have to be embedded alongside it.
Of course, as her colleague Stanke explained, as a service matures, SRE teams may become embedded on product teams or even become de facto product owners. “As a service matures and layers are built on top, it often increasingly operates as a platform”, which slows new feature development. “At this point, SRE teams may become its service owners, with product teams—and their SREs—as downstream customers.”
“balance the immediate operational needs of their host teams with the long term objectives for operational excellence across the whole engineering organisation. Forward deployed SREs are specialists akin to diplomats who must carefully initiate and facilitate strategic agreements across teams and with engineering leadership.”
However, whether on cross-functional teams or not, the 2021 Accelerate State of DevOps Report highlighted how SRE is highly compatible with DevOps and contributes to achieving positive DevOps outcomes. Stanke refers to site reliability engineering as a common language and philosophy that should permeate whole organisations.
“All of the specialists on a cross-functional team—software devs, product managers, UX designers, even roles like marketing and legal—should understand reliability principles and collaborate to prioritise reliability as a feature, with full consideration of user benefits and risks,” he said. This means the SRE influence should remain, regardless of if they are embedded on a product team.
But that means not putting too many operations and on-call responsibilities on your SRE team that can keep them from fulfilling their purpose of future-focused reliability.
As Petoff put it, “SREs must have time to make tomorrow better than today.” They must feel empowered to work on what they believe will improve the reliability of a service and will reduce the toil required to run it through automation. Which in turn unlocks innovation in your engineering teams.
Measuring Site Reliability Engineering
Another problem, according to the 2021 SRE Report by Catchpoint which reflected on almost 300 SRE interviews, is that it’s challenging to measure the impact of the practice. The report zeroed in on what they identified as three key SRE tenets, asking if they were increasing or decreasing:
- Toil reduction—those manual, repetitive activities
- Time spent on call
- The dev versus ops divide
Most organisations just aren’t sure. For most, there still isn’t a way to make baseline comparisons. As we know, the foundation of business agility is that you cannot improve what you aren’t measuring. But then if the SRE job description is too broad, there’s simply too much to measure. Yes, the SRE mantra is to automate everything, but before that happens, how do you know where to start?
Based on the survey results, SREs are spending the most time, in order of priority, on:
- Responding to incidents or outages
- Postmortem analysis and write-ups
- On-call rotation
- Developing applications or capabilities
- Experimenting or receiving training to expand knowledge and skills
- Authoring business processes, rules and best practices
- Performing audits
The first three most prominent activities do nothing to reduce toil, time on call, or the DevOps split. If the main purpose should be to solve for complexity, SREs are frankly doing too much manual ops work. Google limits that work to no more than half time, but how many organisations have the scale, scope and budget of Google? Not every company can just hire more SREs.
Catchpoint posits that the only way to achieve this 50/50 reliability and ops split is by site reliability engineering shifting left, influencing the design and development phase earlier. This way they should catch reliability issues earlier, when they are cheaper and less complex to fix.
DORA metrics are a particularly good place to start. Through six years of research, the DevOps Research and Assessment (DORA) team has identified these as four key metrics which indicate the performance of a software development team:
- Lead time
- Deployment frequency
- Mean time to restore (MTTR)
- Change fail percentage
Moreover DORA offers a benchmark to identify Elite, High, Medium and Low performing teams, to help you figure out where your team fits in.
These metrics are described in detail in the book “Accelerate” by Nicole Forsgren et al, a book we strongly recommend at Container Solutions. You can also learn how they are being applied at eBay on our new "Hacking the org" podcast.
Where is your site reliability team stationed? And how are you supporting them to succeed?