As technologists we have a narrow view of The System we're operating to serve our customers. We think and discuss in great detail all the technical components: network cables, routers, wireless access points, racks, blades and chassis and power supplies, and hypervisors, virtual machines, containers, orchestration, load balancers, all of the above in cloud providers, domain names, TLS certificates, etc. Most of us think of ourselves as outside of the system, but we and our teams and groups and divisions are, in fact, the most essential source of resilience in The System. This explains why observability has become so vital—improvements to observability enable us humans to better apply our resilience.
The system: tangled graph of software and hardware components below The Line of Representation including all the tools used to deploy and monitor and secure those components; above that line are Engineers, Managers, Salespeople, Accountants, Executives and their mental models of what exists below the line.
The thing we've built is already beyond comprehension. No individual knows the whole thing. In fact, no team knows the whole thing. And none of us can actually put our hands on the system. Network and hardware teams can get pretty close, but even they cannot touch the VMs and containers running within. There real physical states of electrical circuits down at the bottom. This isn't imaginary. But our only access to all those components is through the line of representation: all the tools and levers we buy or build for ourselves.
During an incident, some element in the line of representation draws our attention to something going wrong below the line. Some operators will see one thing going wrong. Others may see several things going wrong. While still others may be completely unaware that anything is out of the ordinary.
The essential thing to understand... the more we automate, the harder it is to operate. The more we add to the system (features, components, staff) the more opportunities there are for collisions and surprise.
Teams are part of the system.
Teams adapt when the components fall down.
Resilience happens because the humans learn.
We learn and improve our skills in operating the system. We introduce changes to the components and procedures and communication channels and training programs.
The things below the line would grind to halt, likely in less than a week or two, if all the humans above the line shut down their laptops and phones.