Blog

The Resilience Dilemma

Posted: 15th October 2024
By: Levi Gundert

Editor's note: The following blog post originally appeared on Levi Gundert's Substack page.

0236694f-85c8-4620-992b-87418c1e77af_1024x1024.webp

“Riz” was Oxford’s 2023 word of the year. For the uninitiated, riz roughly translates to charisma (if you have a young teenager at home, ask them for a definition of “rizzler” and enjoy the confused expression resulting from your breach of their world). However, in business, we should co-opt riz as shorthand for resilience, particularly for those in the security industry. Riz represents the ability to withstand attacks, minimize damage, and quickly recover, ensuring business continuity. In security, riz requires consuming intelligence, inspecting existing controls, and addressing identified gaps (where appropriate).

Complexity and resilience have a nuanced relationship — resilience often increases with complexity up to a point, but too much complexity erodes riz, making systems fragile and harder to defend. While simple solutions (following the KISS principle) can lead to resilience, maximum resilience in security often lies somewhere in the middle of the complexity spectrum.

b5e9aedd-f725-4f5f-863c-5c6b87570a09_1432x854.webp

This chart and subsequent charts were quickly produced by hex.tech using Python code generated from ChatGPT.

Real-World Examples: Complexity and Resilience in Action

Two semi-recent events - a global IT outage (low complexity) and an attempted assassination (high complexity) - offer an opportunity to dissect how these two themes - complexity and resilience - influence risk management and how we should assess solutions that maximize resilience. The challenge is riz measurement and how security professionals can have confidence that they have comprehensively reviewed risk scenarios. The Cone of Plausibility and PESTLE are intelligence frameworks that provide scaffolding for scenario exploration, and help assess how complexity impacts resilience in different contexts.

[1] Global IT Outages: Low-Complexity Scenarios

When a faulty EDR update triggered significant Windows malfunctions, the pain was immediately felt, and the social media armchair pundits were quick to throw stones. However, these errors can happen anywhere; that’s the frailty of human brains mixed with complex technology systems.

Instead of analyzing “fault,” smart organizations looked for new learnings. Of course, resilience against future scenarios was at the top of the conversation list. It’s natural to consider expanding beyond a single point of failure (or vendor) for any function, security, or otherwise, but this is where scenarios and associated costs are helpful.

Recorded Future’s CIO - Josh Jones - estimates that running multiple EDR products for a company of 1,000 employees likely creates somewhere in the ballpark of 15% more overhead - software maintenance, updates, knowledge bases, etc. For a large enterprise with 100K endpoints, 30% additional resources are likely required to maintain multiple EDR products across different segments of the endpoint population. Then, there are decisions about which business units (or which percentage of business units) should run a different EDR client.

The chart below approximates where a single EDR vendor scores in simplicity. The same concept could apply to any segment of a computing stack. The theory is that additional diversity in vendor relationships could partially mitigate a similar future operating system outage. Naturally, additional resilience may also result in greater complexity.

In security, organizations can measure resilience by recognizing solution complexity as an essential cost before investing resources.

3639d7c3-d5bd-48c6-ba73-5b178d1ad2ed_1382x822.webp

A visual approximation of an operational technology scenario with low complexity and resilience.

[2] U.S. Secret Service (USSS) Physical Protection: High-Complexity Scenarios

When the news of the failed assassination attempt of former President Trump (and the deaths of innocent bystanders) started lighting up people’s phones, friends and family were quick to text me, asking for an explanation related to how such a scenario was possible.

Drawing from my experience as a former U.S. Secret Service (USSS) Agent, I can provide some small insights into the complexity of protective operations. The USSS is a dual-mission agency, responsible for both protection and investigations. This dual mandate and limited resources create a complex operational environment. A comparison of USSS funding and FBI funding highlights a significant resource gap (for context, USSS special agents total ~4k compared to FBI special agents, who number ~14k).

Limited resources become detrimental given the demands of criminal investigations (e.g., fraud and cybercrime), added to continuous protection mandates (POTUS, VPOTUS, former POTUSes, visiting foreign dignitaries, national events, etc.), which require significant capital and human assets. The USSS is successful in its mission because it excels at planning and logistics. The last resort, “bodyguarding,” generally only occurs due to suboptimal riz. Unlike digital systems downtime, which we generally accept happens, the risk tolerance for an adverse event in a protective mission is roughly zero.

This is where former POTUS Trump’s Pennsylvania rally becomes relevant. To address agent staffing demands, the USSS regularly partners with local, state, and federal law enforcement to protect outer perimeters during events. In addition to boosting numbers, these partnerships bring additional expertise, such as in-depth knowledge of the local environment, high-risk tactical units (SWAT), canine handlers, EOD experts, etc., that benefit the protective mission. However, coordinating technology (e.g., communications) is one area where complexity becomes challenging.

A command post often has to relay communications between agencies that may use different equipment, encryption protocols, etc. This dynamic creates delays and, at the risk of speculation, may have contributed to the hesitation of USSS sniper teams attempting to assess the nature of a suspect on a roof where other agencies were actively working.

While USSS advance agents (for both trips and individual sites) are incredibly detail-oriented, even the pre-event briefings with combined agencies can create confusion, as there are different standard operating procedures (e.g., rules of engagement, arresting authority, etc.). The physical security environment becomes inherently more complex as additional agencies and departments are added to the mix. Too much complexity reduces riz.

b6099c50-702a-4452-a7f5-867ec18704e9_1410x840.webp

A visual approximation of a physical security scenario with high complexity and lower resilience.

Using the Tools: Cone of Plausibility and PESTLE

14152740-2a6f-435e-8fc5-df607f23c4f5_1024x1024.webp

Mental scaffolding helps create confidence in evaluating potential outcomes—routine events and black swans—to ultimately assess trade-offs in complexity and resilience. In security, resilience cannot be measured solely by internal metrics. As complexity grows, so too does the need for structured thinking. This is where the Cone of Plausibility and PESTLE excel as frameworks, helping professionals make informed decisions that ultimately safeguard business continuity.

PESTLE (Political, Economic, Sociological, Technological, Legal, and Environmental) helps broaden the scope of analysis, ensuring that organizations don’t just focus on internal factors but also account for the broader context that influences resilience.

The Cone of Plausibility is used to assess a range of future scenarios by staging different possible outcomes. SecAlliance has a helpful write-up (with examples) on how PESTLE-M (+Military) feeds the cone in cyber threat intelligence (CTI).

The scenario designations “wildcard, plausible, and baseline” are helpful labels that decision-makers can consistently anticipate. Additionally, these scenarios lack conclusions, thus avoiding leading the witness and allowing consumers to formulate their own derivative risk implications.

Western business implications from accelerating geopolitical tensions with China present an interesting cone exercise (twenty-year time horizon). Specifically, the cyber scenarios involve potential solutions with significant trade-offs in complexity and resilience.

Risk Management

There are no right answers in complexity/resilience trade-offs. Organizations need better processes to evaluate equities and soberly manage risks. Sometimes, the riz juice isn’t worth the squeeze, but everyone should understand the choices.

A special thank you to Rosie Hampton, Dylan Davis, and Megan Keeling for a lovely walk and discussion of these topics.

Related