High-hazard activities rely on rules, procedures and standards to specify safe ways of operating systems. These procedures are written by system designers in collaboration with safety experts and attempt to anticipate anomalous situations. However, regulations and procedures for work in complex systems are always incomplete, and sharp-end staff must sometimes deviate from the task as planned. These deviations may be required by circumstances that were not anticipated by the procedure’s authors, requiring frontline workers to develop workarounds. Other deviations are due to workers developing shortcuts and local optimizations which reduce their workload or improve productivity [Dekker 2011].
Over time, this phenomenon leads to “the slow steady uncoupling of practice from written procedure” [Snook 2000]. Behaviour that is acquired in practice and is seen to “work” becomes “legitimized through unremarkable repetition” (“it didn’t lead to an accident, so it must be OK”).
Pioneering safety researcher Jens Rasmussen identified a similar phenomenon which he called “drift to danger” [Rasmussen 1997], the:
systemic migration of organizational behavior toward accident under the influence of pressure toward cost-effectiveness in an aggressive, competing environment
Rasmussen illustrated the competing priorites and constraints that affect sociotechnical systems, as shown below. Any large system is subjected to multiple pressures: operations must be profitable, they must be safe, and workers’ workload must be feasible. Actors experiment within the space of possibilities formed by these constraints, as illustrated below:
if the system reduces output too much, it will fail economically and be shut down
if the system workload increases too far, the burden on workers and equipment will be too great
if the system moves in the direction of increasing risk, accidents will occur
All organizations are affected by different pressures and adaptive processes, which compete for attention, and lead to migration, often towards situations with higher levels of risk.
The competitive environment encourages decision-makers to focus on short-term financial success and survival, rather than on long-term criteria (including safety).
Workers will seek to maximize the efficiency of their work, with a gradient in the direction of reduced workload (in particular if they are encouraged by a “cheaper, faster, better” organizational goal).
These pressures push work to migrate towards the limits of acceptable (safe) performance. Accidents occur when the system’s activity crosses the boundary into unacceptable safety.
Note that the boundary of safe performance is rarely easy to define before the accident, and it can move over time due to changes in the organization.
This “drift into failure” tends to be a slow process, with multiple steps which occur over an extended period. Each step is usually small so can go unnoticed, and no significant problems are noticed until it’s too late. The drift is not caused by people’s evil desires to generate accidents, or to lack of attention or of knowledge; it is a natural phenomenon that affects all types of systems.
The acceptable (safe) performance boundary can move over time, in response to outside events. This is a feature of adaptive systems (note that pretty much all large systems are adaptive, because with an evolving environment, any system which does not adapt will eventually fail and die). For instance, major accidents tend to lead to a push that (temporarily) increases safety margins, and the introduction of new technologies can either increase or decrease the safety margin.
The drift into failure model highlights the importance of a system’s history in explaining why people work as they do, what they believe is important for safety, and which pressures can progressively erode safety. The model helps see safety as a control problem, where the underlying dynamics are very slow but also very powerful, and difficult to manage.
A number of significant accidents illustrate the concepts of drift into danger and normalization (or institutionalization) of deviance:
The Challenger space shuttle accident in 1986 was caused, at a technical level, by the failure of O-rings in the solid fuel booster rockets. The investigation into the accident showed that these O-rings regularly sustained damage (erosion of the joint material) which exceeded the level that was planned for during rocket design, so engineers tracked the O-ring damage and the occasional failures (damage occurred in fourteen of twenty-four prior flights). Since the failures did not escalate to produce an accident, a feeling grew that they were not dangerous, and managers approved “criticality 1 waivers” despite the design goal of zero joint failures. The launch day was unusually cold, leading to worse than usual performance for the O-rings, and eventually to their complete failure. A detailed analysis of the organizational culture at NASA, undertaken by sociologist Diane Vaughan after the accident, showed that people within NASA became so much accustomed to an unplanned behaviour that they didn’t consider it as deviant, despite the fact that they far exceeded their own rules for elementary safety. This is the primary case study for Vaughan’s development of the concept of normalization of deviance [Vaughan 1996].
The Columbia space shuttle accident in 2003 was caused by foam breaking off the external fuel tank and hitting the shuttle during takeoff, damaging its thermal protection system. The shuttle burned up during re-entry into the earth’s atmosphere. Analysis of the accident by the Columbia Accident Investigation Board [CAIB 2003] showed that previous flights had also been affected by foam loss, without leading to catastrophic consequences. The investigation suggested that NASA had suffered from the same type of drift towards danger as prior to the Challenger accident 10 years earlier. Foam loss incidents were viewed as a maintenance issue, and not as a flight safety issue, despite the fact that foam loss was not an acceptable scenario according to shuttle design, and despite regular damage.
A 1994 friendly fire accident in which two U.S. Air Force F15 fighter jets patrolling the no fly zone over northern Iraq in the aftermath of the Gulf War shot down two U.S. Army Black Hawk UH-60 helicopters is documented by Scott Snook in his book Friendly Fire [Snook 2000]. Following an operational error made by the helicopter pilots, a partial failure of IFF equipment1, poor communication with air traffic controllers on a military AWACS aircraft, the F15 pilots misidentified the two helicopters as Iraqi Hinds, and fired missiles that killed all 26 peacekeepers aboard. The error was made more likely by poor communication between the fighter pilots, poor coordination between the different U.S. forces present in the zone and the dilution of responsibility across the different actors in the system.
- The crash of cruise ship Costa Concordia, a 4800 person capacity modern passenger liner, on an island close to the coast of Italy in 2012 (32 killed) after the ship captain intentionally deviated from the standard course, taking the ship very close to shore in a manœuvre known as a “ship’s salute”. The directors of the cruise company “not only tolerated, but promoted and publicised the risky ship salutes off the island of Giglio and other tourist sites as a convenient, effective marketing tool”, according to a criminal suit filed in the case. These salutes, greeting inhabitants with the ship’s foghorn, were commonplace, and the mayor of Giglio wrote to a captain of a Costa vessel to thank him for the “unequalled spectacle”, which had become an “indispensable tradition”.
Factors that contribute to practical drift and the normalization of deviance:
- Production pressure or cost reductions overriding safety concerns, with an increasing tolerance for
- shortcuts or “optimizations” that allow increased performance
- the “temporary” violation of safety rules during periods of high workload
- the circumvention of safety barriers
The absence of periodic reassessments of operational procedures to align them with system evolutions and the usual practices of sharp-end workers (involving a risk assessment when changes are made)
Excessively long and complex operational procedures. This is often caused by gradual accretion of extra checks and safeguards each time an incident analysis has identified a possible source of failure (“oh, we’ll ask the operators to check that the pressure isn’t above usual at this stage”), in particular when the underlying reason for the extra check is not explained to frontline staff.
Organizational barriers which prevent effective communication of critical safety information and stifle professional differences of opinion
Appearance of informal chain of command and decision-making processes that operate outside of the organization’s rules
Confusion between reliability and safety, including reliance on past success as a substitute for sound engineering practices (“it worked last time, so even if it’s not quite compatible with our standards, it’s good enough”)
A “limit ourselves to compliance” mentality, in which only those safety innovations that are mandated by the regulator are implemented
Insufficient oversight by the regulator, or regulators with insufficient authority to enforce change in certain areas
A tendency to weigh operational ease/comfort/performance more than the restrictions which are often required for safe operation
Criticism of the notion
[Dekker 2004, 133] states:
Maintaining safety outcomes may be preceded by as many procedural deviations as accidents are.
According to this view, deviations from procedure (in particular when the procedures are poorly designed) may be necessary to cope with unusual conditions and specific characteristics of the working environment; they do not necessarily indicate that safety margins have been eroded. Dekker emphasizes the importance of hindsight bias: when you undertake an investigation to find the factors that contributed to an accident, it’s easy to unearth deviations from procedure and decide that they are causally related to the accident, though in practice they may have been commonplace and did not lead to bad outcomes.
Note however that procedural violations are only one component of drift to danger, which also refers to changes in people’s perceptions of risk, their priorities, decision-making, and interactions with other people and other organizations.
Similarly, [Hollnagel 2009] suggests that:
Performance variability may introduce a drift in the situation, but it is normally a drift to success, a gradual learning by people and social structures of how to handle the uncertainty, rather than a drift to failure.
Interest over time
The figure below shows the frequency of the phrases “practical drift”, “drift into failure” and “normalization of deviance” in printed documents over the last few decades2.
Rasmussen’s migration model illustrates that small optimizations and adaptations can accumulate over time, taking the system far from its initial design parameters. If there is no counterweight to this “practical drift” from operations staff who are alert to the possibility and the dangers of the normalization of deviance, from the safety function, or from an effective regulator, systems are likely to drift towards catastrophe.
CAIB. 2003. “Report of the Columbia accident investigation board.” NASA. https://www.nasa.gov/columbia/caib/html/start.html.
Hollnagel, Erik. 2009. “The four cornerstones of resilience engineering.” In Resilience Engineering Perspectives, Volume 2: Preparation and Restoration, edited by Christopher P. Nemeth, Erik Hollnagel, and Sidney Dekker, 117–34. Ashgate.