Stuart McKay, CyberCX’s Senior Manager of Security Operations in our NSW Security Operations Centre (SOC) recently presented at CyberCon on quantifying success in a modern SOC and why existing measurements are no longer enough. This blog is based on Stuart’s presentation, which was created in collaboration with Fraser Metcalf, who is a Principal Solution Lead for Cyber Defence.
In the New South Wales SOC at CyberCX we have a team of around 30 analysts delivering both an internal SOC function and a commercial Security Operations Centre (SOC) function for clients. So, it goes without saying that optimising outcomes and measuring success when running a SOC is never far from our minds.
As anyone familiar with Goodhart’s law will tell you, when a measurement becomes the target, it stops being a good measure. When running a Security Operations team who look after critical infrastructure and global businesses, the last thing you want is to incentivise the wrong behaviours in your team. But how can we measure the team in a way which will drive the right behaviours?
Considering the unique nature of SOC work, it is important to look at how we measure and drive a SOC team. After all, we want the people we are entrusting with our digital protection incentivised to pay attention to their work. But we also want to balance those incentives so that they can keep up with the pace of potential cyber security incidents in a modern business and not focus on one element to the exclusion of all others.
Consider what incentives we create by measuring in the SOC; if we measure our team on the mean time to triage (that is, the amount of time they take to investigate a potential security incident), we may incentivise them to spend the minimum amount of time on an investigation possible. Conversely if we measure only the accuracy of our determinations during an investigation, our teams could end up spending too long making sure all the data lines up perfectly. Indeed these measurements, as well as other industry measurements such as Mean time to Detect (MttD) and Mean time to Respond (MttR), can incentivise behaviours contrary their intention.
The purpose of the SOC is a balance between many possible measurements. We do not want our triage to be too quick, nor too slow. We need it to be just right. The same can be said for any number of other components of the functions of your SOC team, from variety of incidents to proactive threat hunting and improving detection capabilities. So how do we walk our team along this narrow tightrope without driving them off one side?
This can be addressed through a principle of ‘strategic opacity in incentive provision’1, which is a complicated way of saying we could solve this problem by measuring randomly from a larger set of measurements which incentivise contradictory behaviours. By doing this, an analyst cannot optimise for a specific KPI (in fact they are incentivised not to), and so we can drive towards the actual goal of the SOC; to protect the organisations we look after.
What does this look like in practice?
In practice what this could look like is the creation of a suite of measures and their opposing ‘counter measures’.
These could then be all pooled together into a bucket of measurements, which can be randomly selected from on an ongoing basis to get a sense for how the team are doing. Think of it like a regular and random temperature check.
Here is a snapshot of what these measures and their opposing counter measures could look like, along with the overall goal you might want to achieve:
|Balance between speed and accuracy in the triage of incidents.||Number of incidents triaged per day
This measures the number of incidents an analyst closes in a shift.
|Accuracy of incident classification
This measures the accuracy of the determinations made by the analyst.
|Analysts gaining both breadth and depth of experience in handling incidents.||Number of similar incidents triaged
This measures the number of similar incidents handled by an analyst (for example phishing incidents).
|Number of different incidents triaged
This measures the variety of incident types handled by an analyst.
|Analysts able to both write a detection, and to be proactively investigate potential threats.||Number of detections written
This measures the number of detections an analyst contributes to the overall codebase.
|Number of proactive threat hunts
This measures the amount of times an analyst proactively investigates something possibly suspicious.
By randomly selecting which of these measures you look at each week or month, it reduces the risk of metrics being influenced or manipulated by the team. It is important to note that while this is designed to be deliberately transparent, the SOC team should all be brought on the journey of understanding the approach to measuring the outcomes of the SOC, lest this approach feel arbitrary and opaque.
Our experience has been that for the modern SOC, people enabled by technology to defend against cyber threats are at its centre. The people-centric approach means driving the right behaviours as they will have a multiplying impact on other facets of the SOC.
Getting the balance right means you can reap the benefits of greater client outcomes, increased team technical proficiency, continuous improvement of capabilities, and a better overall performance culture.