SRE keeps digging to prevent problems | Google Cloud Blog

Steve McGhee

Solutions Architect, Google Cloud

March 13, 2020

Try GCP

Get $300 free credit to spend over 12 months.

Free Trial

At Google, our teams follow site reliability engineering (SRE) practices to help keep systems healthy and users productive. There is a phrase we often use on our SRE teams: “At Google scale, million-to-one chances happen all the time.” This illustrates the massive complexity of the system that powers Google Search, Gmail, Ads, Cloud, Android, Maps, and many more. That type of scale creates complex, emergent modes of failure that aren’t seen elsewhere. Thus, SREs within Google have become adept at developing systems to track failures deep into the many layers of our infrastructure. Not every failure can be automatically detected, so investigative tools, techniques, and most importantly, attitude are essential. Rare, unexpected chains of events happen often. Some have visible impact, but most don’t.

At Google scale, million-to-one chances happen all the time.

within error budget” events. They are expected, accepted, and engineered into the design criteria of our systems. However, they still get tracked down to make sure they aren’t forgotten and accumulated into technical debt—we use them to prevent this class of failures across a range of systems, not just the one that had the problem. This incident serves as a good example of tracking down a problem once initial symptoms were mitigated, finding underlying causes and preventing it from happening again—without users noticing. This level of rigor and responsibility is what underlies the SRE approach to running systems in production. 

Digging deep for a problem’s roots

In this event, an SRE on the traffic and load balancing team was alerted that some GFEs (Google front ends) in Google’s edge network, which statelessly cache frequently accessed content, were producing an abnormally high number of errors. The on-call SRE was paged. They immediately removed (“drained”) the machines from serving, thus eliminating the errors that might result in a degraded state for customers. This ability to rapidly mitigate an incident in this way is a core competency within Google SRE. Because we have confidence in our capacity models, we know that we have redundant resources to allow for this mitigation at any time.

At this point, our SRE had mitigated the issue with the drain, but they weren’t done yet. Based on previous similar issues, they knew this type of error is often caused by a transient network issue. After finding evidence of packet loss, isolated to a single rack of machines, our SRE got in touch with the edge networking team, which identified correlated BGP flapping on the router in the affected rack. However, the nature of the flaps hinted at a problem with the machines rather than the router. This indicated that the problem revolved around a particular machine or set of machines.

Further investigation uncovered kernel messages in the GFE machines’ base system log. These errors indicated CPU throttling:

MMM DD HH:mm:ss xxxxxxx kernel: [3220998.149713] CPU16: Package temperature above threshold, cpu clock throttled (total events = 1596886)

The process on the machine responsible for BGP announcements showed higher-than-usual CPU usage, which perfectly correlated with both the onset of the errors and the CPU throttling. This confirmed the theory that the throttling was significant enough to be impactful and measurable by Google’s monitoring system:

“>

This was illustrated in a recent incident that Google users would likely not have noticed. We consider these types of failures “within error budget” events. They are expected, accepted, and engineered into the design criteria of our systems. However, they still get tracked down to make sure they aren’t forgotten and accumulated into technical debt—we use them to prevent this class of failures across a range of systems, not just the one that had the problem. This incident serves as a good example of tracking down a problem once initial symptoms were mitigated, finding underlying causes and preventing it from happening again—without users noticing. This level of rigor and responsibility is what underlies the SRE approach to running systems in production. 

Digging deep for a problem’s roots

In this event, an SRE on the traffic and load balancing team was alerted that some GFEs (Google front ends) in Google’s edge network, which statelessly cache frequently accessed content, were producing an abnormally high number of errors. The on-call SRE was paged. They immediately removed (“drained”) the machines from serving, thus eliminating the errors that might result in a degraded state for customers. This ability to rapidly mitigate an incident in this way is a core competency within Google SRE. Because we have confidence in our capacity models, we know that we have redundant resources to allow for this mitigation at any time.

At this point, our SRE had mitigated the issue with the drain, but they weren’t done yet. Based on previous similar issues, they knew this type of error is often caused by a transient network issue. After finding evidence of packet loss, isolated to a single rack of machines, our SRE got in touch with the edge networking team, which identified correlated BGP flapping on the router in the affected rack. However, the nature of the flaps hinted at a problem with the machines rather than the router. This indicated that the problem revolved around a particular machine or set of machines.

Further investigation uncovered kernel messages in the GFE machines’ base system log. These errors indicated CPU throttling:

MMM DD HH:mm:ss xxxxxxx kernel: [3220998.149713] CPU16: Package temperature above threshold, cpu clock throttled (total events = 1596886)

The process on the machine responsible for BGP announcements showed higher-than-usual CPU usage, which perfectly correlated with both the onset of the errors and the CPU throttling. This confirmed the theory that the throttling was significant enough to be impactful and measurable by Google’s monitoring system:

The wheels (casters) supporting the rack had been crushed under the weight of the fully loaded rack. The rack then had physically tilted forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled.

how your team can adopt SRE best practices.

“>

This level of rigorous analysis and persistence is a great example of incident response using deep and broad monitoring and the culture of responsibility that keeps Google running 24×7. 

Google Cloud customers often ask how SRE can work in a hybrid, on-prem, or multi-cloud environment. SRE practices can be used to work across teams within an organization, across multiple environments. SRE helps teams work together during incidents like this, from traffic management to data center hardware operations.  

Find out more about the SRE approach to running systems and how your team can adopt SRE best practices.

Other News

Menu