Fail-Safe Operations Failing to Be Safe

It’s comforting to think that we can just throw resources at a problem to eliminate it. But, alas, that doesn’t seem to be the case. There are some important reasons why fail-safe organizations, well, fail-to-be-safer...

vogtle-nuclear-power-plant-large.jpg

The more you hang around large institutional investors – especially those with big in-house operations – the more you realize how important operational risk management has become for them. The new hires in these places will spend their first few days filling out compliance information and listening to speeches on ‘segregation of duties’, ‘accountability’, ‘checks and balances’, ‘fail-safe functions’ and ‘redundant operations’. Why? Two big reasons:

1) Legitimacy: Many of these funds exist because politicians see them as “more professional” than traditional government employees in the area of finance. A big mistake by these funds could delegitimize them in the eyes of the politicians and result in their dismantling. 2) Scale: The bigger you are, the larger a “small” screw up becomes. A 1% loss due to an internal, operational mistake doesn’t look all that bad if you’re managing my money (blip). But it sure would look bad if you were managing Norway’s sovereign wealth ($6B).

These two factors combined are at the heart of why Norway’s NBIM – and other funds around the world – focus so much attention on operational risk management. And I single out NBIM here because it has been quite transparent about its work to minimize the number of “unwanted events” taking place within the organization. Indeed, NBIM’s annual report discusses the ‘fail-safe-functions’ they have in place and all the redundant organizational components that the fund is building to reduce these errors in the future.

But all this focus on “redundancy” comes with a big health warning: it doesn’t always work, and, in fact, it can make your organization less reliable! In order to understand some of the problems with building ‘fail safe’ institutional investors, I thought I’d direct you to a (really fun) paper by Scott D. Sagan entitled “The Problem of Redundancy Problem: Why More Nuclear Security Forces May Produce Less Nuclear Security”. Here’s a blurb:

“The use of redundancy in its many forms is a common strategy used to make more reliable systems out of inherently imperfect parts. Redundancy theory in engineering demonstrates how even unreliable components, if independent and connected in a parallel manner, can lead to rapid increases in overall system reliability. A large number of social scientists and security analysts have, therefore, called for the widespread use of redundancy as one of the necessary requirements of “high reliability organizations.”

That sounds good, right? Hold on, there’s more...

Sponsored

“The article uncovers the dark side of redundancy by focusing on how efforts to improve nuclear security can inadvertently back fire, increasing the risks they are designed to reduce...”

It’s comforting to think that we can just throw resources at the problem to eliminate it. But, alas, that doesn’t seem to be the case, as Sagan illustrates through some lovely examples and cases. In general, there are three reasons fail-safe organizations, well, fail-to-be-safer:

1) Complexity: By adding to the complexity of an organization – through additional fail-safe systems and protocols – the designers are creating what theorists call “hidden failure modes” (and what Donald Rumsfeld calls “unknown unknowns”). Sagan offers two examples of this type of problem: Adding two more engines to an aircraft in order to minimize the risk of all engines failing, while not realizing that mitigating this risk has created a new, far greater risk: that an exploding engine (of which there are now twice as many) downs the plane. Adding more guards to protect a person, while increasing the likelihood that one of those guards is an infiltrator. 2) Social Shirking: Most ‘fail-safe’ concepts are taken directly from engineering to resolve mechanical problems; they are often implemented on airplanes or in nuclear facilities. But a mechanical concept will always struggle to translate into social contexts with an array of agency issues, as people – unlike parts – will be aware of each other and alter their behavior accordingly. Research has shown that when a single individual sees a crime, they will call the police 75% of the time. However, every time an additional witness is added to a crime scene, the likelihood that any individual reports the crime decreases by 15%. 3) Risk Taking: Research also shows that as our surroundings become “safer”, we become much more prone to risk taking. Think of it as a type of moral hazard. It’s for this reason that the introduction of helmets in skiing (apparently) hasn’t had a big impact on ski-related injuries – people just go steeper, bigger, higher, faster...and get hurt. In other words, as you get comfortable in your safety, you incur higher risk. This means you push your organization to the edge (even if it is “safer”).

What to do about it? I think we need to be careful about how far we take redundancy in the design of institutional investors. It’s not clear to me that the benefits outweigh the costs. (And I’d be grateful if you have any data that either confirms or rejects the concepts above.) And the costs are quite high – not only is the safety of the organization not dramatically improved, but it adds a level of bureaucracy and rigidity that will inevitably damage returns over the long-term.

Related