CyberCX Hack Report: Insights from a year of offensive security testing

Finding the ‘floor’ in AI cyber risk

Technical

connecting dots

Oliver Smith, CyberCX Intelligence Lead Analyst [1]  21 February 2024

 

In November 2023, the Australian government signed the Bletchley Declaration, committing to manage the risk posed by artificial intelligence (AI) alongside 27 other countries and the European Union. Australia’s commitment is likely to spur legislative intervention in the domestic AI landscape in 2024 and beyond. But the speed of technological development coupled with the unrealised capability of current-generation technology leaves lawmakers with a uniquely difficult calibration challenge. Organisations too are faced with a nebulous AI cyber security risk, further clouded by uncertainty about the shape and effectiveness of AI safety regulation to come.

Irrespective of future regulation and technological developments, the AI technologies that exist in 2024 represent the low-water mark of AI risk to cyber security in Australia. To demonstrate this ‘floor’ of risk, we extrapolated from observed threat actor tendencies to develop a malicious AI capability. We used only current generation, open source models[2] operated locally on consumer-grade hardware.[3] That is, technology that cannot realistically be clawed back from the public domain. The system we created – an automated voice phishing engine – embodies the floor of AI cyber security risk in its domain for organisations and individuals and represents already ceded territory for lawmakers.

 

Background

Through 2023, we observed threat actors adopting malicious generative AI tools centered around code generation including WormGPT, Evil-GPT and WolfGPT. While these tools have loomed large in industry dialogue, our collection indicates these tools produce low quality outputs and retain substantial knowledge barriers around creating prompts and integrating generated code. Based on our observations of malicious AI use to date, we assess that current generation AI tools are most effective for tasks like social engineering and information operations due to the lower technical knowledge burden and reduced requirement for precision.

Threat actors already operate automated social engineering tools – for example, illicit One Time Password (OTP) services can be used to make automatic voice phishing phone calls to victims to social engineer multifactor authentication (MFA) credentials.

 

Figure 1 – Example illicit OTP Bot voice phishing service operated via Telegram.

 

 

Our hypothetical system builds on this concept to make phishing phone calls to victims targeting multifactor authentication (MFA) credentials for enterprise account access. A threat actor could use such a system as the second stage to compromised credentials with minimal human oversight.

 

Generation phase

We start by developing a social engineering pretext for making the call – impersonating IT staff performing an authentication migration that requires staff members provide an MFA code. We create a generic prompt for a local Large Language Model (LLM) acting as the phishing agent.[4] We populate placeholders with tailored information about the target organisation discovered through reconnaissance.

 

Figure 2 – Example phishing agent prompt.

 

 

We simulate victim responses using a second LLM agent. The victim prompt contains emotion and response style placeholders that we alter between interactions to inject randomness into the victim agent’s behaviour. This randomness helps us simulate more of the possible conversation space.

 

Figure 3 – Example victim agent prompts.

 

 

The phishing agent and the victim agent interact with one another repeatedly to generate text scripts simulating individual phone calls. We split the scripts into victim agent line-phishing agent response pairs and use a text embedding model[5] to represent each victim line as a semantic vector. We can compare two semantic vectors to mathematically evaluate the closeness in meaning between input text. We insert the vectors into a database, forming a bank of all the ways the victim agent responded to the phishing scenario. We can cull interactions that were unsuccessful using a script-supervisor agent and by dropping vectors that are very close in meaning to one another as redundant.

 

Figure 4 – Simulated conversation space populating, projected onto 3 dimensions using T-distributed Stochastic Neighbour Embedding.

 

We now need to create voice lines for each of the phishing agent lines in our dataset. We used the Bark text-to-voice model,[6] which can produce very natural sounding audio with the drawback of extreme variance in quality. We use voice cloning techniques[7] to give the generated voice a passable local accent using a 15 second sample of a donor voice. A threat actor could impersonate the voice of an actual staff member at the target company if a short voice sample could be sourced.

 

 

Figure 5 – Voice line produced by text to voice model natively and with Australian accent cloning.

 

 

 

The text-to-voice model occasionally produces either random content not matching the input or poor-quality audio. To supervise the content of voice lines, we use a voice-to-text model to transcribe generated audio back into text,[8] and compare semantic vectors of the input and output text to ensure a close match.[9] To supervise the audio quality, we prepared a dataset of clips categorised as passing or failing quality through human review.[10] We used the dataset to train a classical convolutional neural network[11] on mel spectrograms of the audio clips, which we then used to reject samples with audio defects or an inconsistent voice.

 

Figure 6 – Generative Adversarial Network supervising voice clip generation.

 

 

We spent less than 100 hours of compute time to generate more than 1000 simulated conversations and associated validated audio clips. Many of these clips were wholly generic and could be reused against different targets to save generation time on a subsequent attack.

 

Execution phase

To execute a phishing attempt, we start by calling the victim. A threat actor could use number spoofing to match the impersonated company’s actual help desk number. When the victim picks up, we enter a behaviour loop that repeats the following:

 

Figure 7 – Execution loop.

 

The most computationally intensive aspect of this loop is transcribing the victim’s voice to text, which had an average processing time of around 5-10% of the input audio length. We also incurred a 0.8 second delay waiting for the victim to stop speaking. The vector database and filesystem lookups were near instantaneous, taking approximately a tenth of a second combined. All combined, the system produces a reasonable facsimile of intelligent human speech and can deliver convincing contextual information.

 

Figure 8 – Example favourable interaction with the proof-of-concept system.

 

 

With a negligible setup and operating cost, this system can make calls constantly and perpetually. A threat actor using such a system gains access to scale that was once the domain of a multi-member criminal conspiracy without incurring the associated increase in risk of discovery by law enforcement.

 

Implications

 

Defensive advice

For individuals:
For organisations:

 


 

[1] This blog is based on research originally presented at the ISACA Canberra Conference in November 2023.

[2] An ‘open source’, ‘open weight’ or ‘local’ model is a model that has been released into the public domain such that it can be copied and operated on a users’ hardware. An example of an open source model is Meta’s Llama family, which can be downloaded, operated and preserved on individual devices. These models can be contrasted against closed source models like OpenAI’s ChatGPT 3 and later models, which must be accessed through a web service that allows the potential of post-release performance and safety changes.

[3] We operated the system demonstrated through this blog on an NVIDIA 2070 GPU with 8GB of VRAM. Inference was performed with Llama.cpp (https://github.com/ggerganov/llama.cpp) and Transformers (https://github.com/huggingface/transformers).

[4] We used the Mistral-7B-OpenOrca LLM (https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca) in a quantized form (https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-GGUF).

[5] We used the hkunlp/instructor-large instructor embedding model (https://huggingface.co/hkunlp/instructor-large).

[6] We used Suno AI’s Bark model (https://github.com/suno-ai/bark).

[7] We used Serp AI’s Bark voice clone fork (https://github.com/serp-ai/bark-with-voice-clone).

[8] We used Viabhavs10’s implementation (https://github.com/Vaibhavs10/insanely-fast-whisper) of OpenAI’s Whisper model (https://huggingface.co/openai/whisper-large-v3).

[9] We used a cosine similarity between vectors threshold of > 0.9.

[10] We used several datasets identifying specific undesirable audio features to reduce dataset requirements.

[11] We used PyTorch to train and run inference on the CNN binary classification model (https://github.com/pytorch/pytorch).

[12] We detected pauses in speech using the Python Speech Recognition library (https://github.com/Uberi/speech_recognition#readme).

[13] https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-in-2023-generative-ais-breakout-year

Other Cyber Security Resources

Ready to get started?

Find out how CyberCX can help your organisation manage risk, respond to incidents and build cyber resilience.