Build, Run

How eBPF enables Cloud Native Innovation and Performance

Many organisations today adopted Cloud Native to allow the development and delivery of novel products and services. What was a breakthrough competitive advantage adopted only by elite teams became the norm for digital natives as much as enterprises.

The growing workloads on container platforms created more complexity for the teams operating them. They face the challenge of dual transformation—keep innovating to provide application developers with novel capabilities whilst keeping the platform at optimal performance. However, the Linux kernel which underpins so much of what we build on requires a long time for changes.

The breakthrough solution lies in what Barbara Liskov called the “power of abstraction”. The brilliant Turing Award winner, whose career inspired so much modern thinking around distributed computing, highlighted the role of abstractions in “finding the right interface for a system as well as finding an effective design for a system implementation”.

Liskov has been proven right many times over. We are now at a juncture where new abstractions—and eBPF specifically—are driving the evolution of Cloud Native system design in powerful new ways. These new abstractions are unlocking the next wave of Cloud Native innovation and performance.

Cloud Native Challenges: Complexity and Scale

Whilst there are multiple approaches, from a software stand-point Cloud Native often embraces an application design where a single kernel becomes the common denominator managing many workloads and services. In this case, Cloud Native shifts the scale and scope from a few VMs to many containers with higher per-node container density for efficient resource use and shorter container lifetimes. These dynamic IP pools for containers also have high IP churn.

The challenges don’t end there.

Once the clusters are stood up there are 'Day 2' challenges like observability, security, multi cluster and cloud, as well as compliance. Those challenges fall under the scrutiny of the CFO and the COO. You don’t just move to a Cloud Native environment, you have to operate it in a way so that it's not just seen as a cost centre but a technology function that provides competitive advantage and continuous improvement.

Once you have a Cloud Native environment set up, you will face integration requirements with external workloads (e.g., through more predictable IP addresses via service abstractions or egress gateways). You will also have to deal with the successive migration toward IPV6-only clusters for better IAM flexibility, and NAT46/64 for interaction with legacy workloads. Multiple clusters will need to be connected on/off- prem in a scalable manner with topology aware routing, traffic encryption, and so much more.

These problems are only going to grow larger as more and more diverse workloads are deployed to the platform. Gartner estimates that by 2025 over 95% of new digital workloads will be deployed on Cloud Native platforms, up from 30% in 2021.

Limitations of the Linux Kernel Building Blocks

The Linux kernel is the most common foundation for solving these challenges. But Cloud Native needs newer abstractions than are currently available in the Linux kernel because many of these building blocks, like cgroups (CPU, memory handling), namespaces (net, mount, pid), SELinux, seccomp, netfiler, netlink, AppArmor, auditd, and perf were designed more than 10 years ago.

These tools don’t always talk together and some are inflexible, allowing only for global policies and not per-container. They don’t have awareness of pods or any higher level service abstractions and many rely on iptables for networking.

As a platform team, if you want to provide developer tools for a Cloud Native environment, you can still be stuck in this box where Cloud Native environments can't be expressed efficiently. Platform teams can struggle to deliver the features and performance that their applications need, making their work look like a cost centre rather than a value creator.

eBPF: Building Abstractions for the Cloud Native World

eBPF is a revolutionary technology that allows us to dynamically program the kernel in a safe, performant, and scalable way. It is used to safely and efficiently extend the Cloud Native capabilities of the kernel without requiring changes to kernel source code or loading kernel modules.

eBPF:

  • Hooks anywhere in the kernel to modify functionality and customise its behaviour without changing the kernel's source
  • Means that programs are verified to safely execute, preventing kernel crashing or other instabilities
  • achieves near native execution speed with a JIT compiler
  • Allows addition of OS capabilities at runtime without workload disruption or node reboot
  • Shifts the context from user space in Kubernetes into the Linux kernel

These capabilities allow us to safely abstract the Linux kernel and make it ready for the Cloud Native world.

eBPF Abstractions for the Cloud Native (R)evolution

Next let’s dive into some of the ways the eBPF abstraction is helping evolve Cloud Native platforms to increase performance and Cloud Native innovation.

#1--eBPF Speeds Up Kernel Innovation
Adding a new feature or functionality to the Linux kernel is a long process. In the typical patch lifecycle, teams need to develop a patch, get it merged upstream, then wait until major distributions get released. Users typically stick to LTS kernels—for example, Ubuntu is typically on a 2 year cadence. So innovation with the traditional model requires kernel modules or building your own kernels, leaving most of the community out. And there is minimal to no feedback loop from developers to users.

eBPF managed to break this long cycle by decoupling from kernel releases. For example, changes in Cilium can be upgraded on the fly with the kernel running and work on a large range of kernel releases. This allows teams to add new Cloud Native functionality, like agentless observability or scalable networking, years before it would otherwise be possible.

#2 eBPF Extends the Kernel With a Safety-Belt On
New features can increase functionality, but also bring new risks. Development and testing costs much more for kernel code versus eBPF code for the same functionality. The eBPF verifier ensures that the code won’t crash the kernel. Portability for eBPF modules across kernel versions is achieved with CO-RE, kconfigs, and BPF type info. The eBPF flavour of the C language is also a safer choice for kernel programming. All of these make it safer to add new functionality to the kernel than patching directly or using a kernel module.

#3--eBPF Allows for Short Production Feedback Loops
Traditional feedback loops required patching the in-house kernel, gradualling rolling the kernel to the fleet to deploy the change, starting to experiment, collecting data, and bringing the feedback into the development cycle. It was a very long and fragile cycle where nodes needed to restart and drain their traffic making it impossible to move quickly especially in dynamic Cloud Native environments. eBPF decouples from the kernel and allows atomic program updates on the fly, drastically shortening this feedback loop.

#4--eBPF Moves Data Processing Closer to the Source, Reducing Resource Consumption
Traditional virtualized networking functions, such as load balancers and firewalls, are solved at a packet level. Every packet needs to be inspected, modified, or dropped, which is computationally expensive for the kernel. eBPF reframed the original problem by moving as close to the event source as possible, toward per-socket hooks, per-cgroup hooks, and XDP, for example. This allows the migration from dedicated boxes to generic worker nodes and results in significant resource cost savings.

#5--eBPF Enables Lower Traffic Latency
Using eBPF for forwarding, many unneeded parts of the networking stack can be bypassed, drastically improving networking. For example, with eBPF, Cilium was able to implement a bandwidth manager which reduced p99 latency by 4.2x. It also helped enable BIG TCP and a new veth driver replacement that lets containers achieve host networking speeds.

#6--eBPF Delivers Efficient Data Processing
eBPF reduces the kernel’s feature creep that slows down data processing by keeping the fast-path to a minimum: Custom Cloud Native use cases don’t need to become part of the kernel, they just become additional building blocks in eBPF that can be leveraged in different use cases. For example, by decoupling helpers and maps from entry-points in eBPF, Cilium is able to create a faster and more customizable kube-proxy replacement in eBPF that can continues to scale when iptables falls short.

#7--eBPF Facilitates Low-Overhead Deep Visibility Into the System
With the churn in Cloud Native workloads, it can be difficult to find and debug issues. eBPF collectors make it possible to build low-overhead, fleet-wide tracing and observability platforms. Instead of having to modify application code or adding sidecars, eBPF allows zero instrumentation observability. Troubleshooting production issues on-the-fly can be done safely via bpftrace while allowing significantly richer visibility, programmability, and ease-of-use than old-style perf.

#8--eBPF Creates Secure Identity Abstractions for Policy Enforcement
In Cloud Native environments, eBPF allows you to abstract away from high Pod IP churn towards more long lasting identities. IPs are meaningless given everything is centred around Pod labels and generally the Pod lifetime is very short with ephemeral workloads. By understanding the context of the process in the kernel, eBPF helps abstract from the IP to provide more concrete identity abstractions. With a secure identity abstraction for workloads, Cilium was able to build features like egress gateways for short lived Pods and mTLS.

eBPF For Performance and Innovation in a Cloud Native World

Cloud Native transformed how leading technology teams build and deliver applications making platform engineering the new normal. Now, attention shifts to optimising performance, scalability, security for production workloads while allowing for continuous innovation. Many of the Linux kernel building blocks supporting these workloads are decades old. eBPF allows us to dynamically change the kernel to implement changes safely and effectively. eBPF creates new kernel building blocks, unlocks innovation, and drastically improves platform performance. All of these together allow already successful Cloud Native platform teams to stay ahead of their competition.

Isovalent Frankfurt

Comments
Leave your Comment