What is SRE?
Site Reliability Engineering (SRE) is a discipline that focuses on redundancy, resiliency, and performance of software engineering, application infrastructure and operations with a tip of the hat to Toyota's six sigma practices. The main goals are to create scalable and highly reliable software systems. We start to do this by measuring and monitoring performance in 9's (the number of 9's as a measure of percentage. Think of AWS object storage in S3 boasting a lofty 99.999999999% object durability and 99.99% availability. That means one in 100 billion objects stored on S3 will have a loss of data issue and object storage will be accessible for all but 53 min a year.
Understanding the Basics
The core idea of SRE is to minimize the impact of outages on users. SREs use creative problem solving coupled with analytics to attack problems in software engineering, system administration and operations in a way that failures happen less frequently but also are less noticeable. This includes automating routine tasks, creating systems to scale automatically, redundancy for failovers, and managing system performance.
A great leader here is Netflix. You may not realize but they pioneered many of the concepts of SRE. They created an entire suite of tools, called Simian Army, set to disrupt and cause outages across their development environments. They knew, with the scale of their operation, failures would happen so instead of trying to prevent outages, they tried to control the chaos. Chaos Monkey, the first of these tools, was designed to randomly terminating instances, so connections to the instance would fail. Engineers at Netflix had to cope and program through these challenges handling all sorts of failure scenarios. Ultimately, the winner here was the end user. While watching your show, you would have no idea there may have been multiple failures on the back end.
Additionally, SREs focus on measuring and optimizing system reliability. They establish Service Level Objectives (SLOs being the target goal) and Service Level Indicators (SLIs being the current measure) to quantify reliability in meaningful ways and make data-driven decisions. Reliability in SRE is not about ensuring that systems never fail, but about ensuring that they fail in a way that has minimal impact on users.
Beginnings of Site Reliability Engineering
SRE originated at Google in the early 2000s. Ben Treynor, who is credited with inventing the term, was tasked with making Google's already large-scale sites more reliable, efficient, and scalable. He approached the problem by applying a software engineering mindset to operational challenges. This led to the creation of a new role that combined the skills of both a software engineer and a systems administrator, laying the foundation for what would become known as Site Reliability Engineering.
Since its inception at Google, SRE has evolved and spread throughout the tech industry. It has become a key component in managing large-scale, cloud-based systems. The principles of SRE have been adopted and adapted by various organizations, each adding their own practices and tools to suit their specific needs.
As the field has grown, so has the understanding of what SRE entails. It has expanded beyond operations into areas such as building more reliable systems, improving incident management, and creating blameless postmortem cultures. This evolution reflects the broader shift in the tech industry towards DevOps practices, where development and operations teams collaborate more closely.
SRE is a vital and evolving discipline that addresses the complex challenges of running large-scale, reliable systems. Its roots in Google's innovative approach to system reliability have spread across the tech ecosystem to most major tech operations like AWS, Microsoft and Netflix, leading to new ways of thinking about infrastructure, operations, and software development. As technology continues to advance, SRE will likely continue to evolve, adapting to new challenges and technologies.