Source author record

Ilia Shumailov

Ilia Shumailov appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Cryptography and Security cs.CY Artificial Intelligence Computation and Language Computer Science and Game Theory Computer Vision Human-Computer Interaction Networking and Internet Architecture Social and Information Networks Sound

Catalog footprint

What is connected

14works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Exploring the limits of strong membership inference attacks on large language models

State-of-the-art membership inference attacks (MIAs) typically require training many reference models, making it difficult to scale these attacks to large pre-trained language models (LLMs). As a result, prior research has either relied on weaker attacks that avoid training references (e.g., fine-tuning attacks), or on stronger attacks applied to small models and datasets. However, weaker attacks have been shown to be brittle and insights from strong attacks in simplified settings do not translate to today's LLMs. These challenges prompt an important question: are the limitations observed in prior work due to attack design choices, or are MIAs fundamentally ineffective on LLMs? We address this question by scaling LiRA--one of the strongest MIAs--to GPT-2 architectures ranging from 10M to 1B parameters, training references on over 20B tokens from the C4 dataset. Our results advance the understanding of MIAs on LLMs in four key ways. While (1) strong MIAs can succeed on pre-trained LLMs, (2) their effectiveness, remains limited (e.g., AUC<0.7) in practical settings. (3) Even when strong MIAs achieve better-than-random AUC, aggregate metrics can conceal substantial per-sample MIA decision instability: due to training randomness, many decisions are so unstable that they are statistically indistinguishable from a coin flip. Finally, (4) the relationship between MIA success and related LLM privacy metrics is not as straightforward as prior work has suggested.

preprint2026arXiv

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Dynamic quantization emerged as a practical approach to increase the utilization and efficiency of the machine learning serving flow. Unlike static quantization, which applies quantization offline, dynamic quantization operates on tensors at run-time, adapting its parameters to the actual input data. Today's mainstream machine learning frameworks, including ML compilers and inference engines, frequently recommend dynamic quantization as an initial step for optimizing model serving. This is because dynamic quantization can significantly reduce memory usage and computational load, leading to faster token generation and improved model serving efficiency without substantial loss in model accuracy. In this paper, we reveal a critical vulnerability in dynamic quantization: an adversary can exploit such quantization strategy to steal sensitive user data placed in the same batch as the adversary's input. Our analysis demonstrates that dynamic quantization, when improperly implemented or configured, can create side channels that expose information about other inputs within the same batch. We call this phenomenon Quantamination, describing contamination from quantization. Specifically, we show that at least 4 of the most popular ML frameworks in use today either default to or can use configurations that leak data across the batch boundary. This data leakage, in theory, allows attackers to partially or even fully recover other users' batched input data, representing a serious privacy risk for existing ML serving frameworks.

preprint2026arXiv

Soft Instruction De-escalation Defense

Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.

preprint2025arXiv

Iterative Deployment Improves Planning Skills in LLMs

We show that iterative deployment of large language models (LLMs), each fine-tuned on data carefully curated by users from the previous models' deployment, can significantly change the properties of the resultant models. By testing this mechanism on various planning domains, we observe substantial improvements in planning skills, with later models displaying emergent generalization by discovering much longer plans than the initial models. We then provide theoretical analysis showing that iterative deployment effectively implements reinforcement learning (RL) training in the outer-loop (i.e. not as part of intentional model training), with an implicit reward function. The connection to RL has two important implications: first, for the field of AI safety, as the reward function entailed by repeated deployment is not defined explicitly, and could have unexpected implications to the properties of future model deployments. Second, the mechanism highlighted here can be viewed as an alternative training regime to explicit RL, relying on data curation rather than explicit rewards.

preprint2022arXiv

Architectural Backdoors in Neural Networks

Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data and data sampling procedures to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a link between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of training settings.

preprint2022arXiv

Efficient Adversarial Training With Data Pruning

Neural networks are susceptible to adversarial examples-small input perturbations that cause models to fail. Adversarial training is one of the solutions that stops adversarial examples; models are exposed to attacks during training and learn to be resilient to them. Yet, such a procedure is currently expensive-it takes a long time to produce and train models with adversarial samples, and, what is worse, it occasionally fails. In this paper we demonstrate data pruning-a method for increasing adversarial training efficiency through data sub-sampling.We empirically show that data pruning leads to improvements in convergence and reliability of adversarial training, albeit with different levels of utility degradation. For example, we observe that using random sub-sampling of CIFAR10 to drop 40% of data, we lose 8% adversarial accuracy against the strongest attackers, while by using only 20% of data we lose 14% adversarial accuracy and reduce runtime by a factor of 3. Interestingly, we discover that in some settings data pruning brings benefits from both worlds-it both improves adversarial accuracy and training time.

preprint2022arXiv

Model Architecture Adaption for Bayesian Neural Networks

Bayesian Neural Networks (BNNs) offer a mathematically grounded framework to quantify the uncertainty of model predictions but come with a prohibitive computation cost for both training and inference. In this work, we show a novel network architecture search (NAS) that optimizes BNNs for both accuracy and uncertainty while having a reduced inference latency. Different from canonical NAS that optimizes solely for in-distribution likelihood, the proposed scheme searches for the uncertainty performance using both in- and out-of-distribution data. Our method is able to search for the correct placement of Bayesian layer(s) in a network. In our experiments, the searched models show comparable uncertainty quantification ability and accuracy compared to the state-of-the-art (deep ensemble). In addition, the searched models use only a fraction of the runtime compared to many popular BNN baselines, reducing the inference runtime cost by $2.98 \times$ and $2.92 \times$ respectively on the CIFAR10 dataset when compared to MCDropout and deep ensemble.

preprint2022arXiv

Rethinking Image-Scaling Attacks: The Interplay Between Vulnerabilities in Machine Learning Systems

As real-world images come in varying sizes, the machine learning model is part of a larger system that includes an upstream image scaling algorithm. In this paper, we investigate the interplay between vulnerabilities of the image scaling procedure and machine learning models in the decision-based black-box setting. We propose a novel sampling strategy to make a black-box attack exploit vulnerabilities in scaling algorithms, scaling defenses, and the final machine learning model in an end-to-end manner. Based on this scaling-aware attack, we reveal that most existing scaling defenses are ineffective under threat from downstream models. Moreover, we empirically observe that standard black-box attacks can significantly improve their performance by exploiting the vulnerable scaling procedure. We further demonstrate this problem on a commercial Image Analysis API with decision-based black-box attacks.

preprint2021arXiv

On Attribution of Deepfakes

Progress in generative modelling, especially generative adversarial networks, have made it possible to efficiently synthesize and alter media at scale. Malicious individuals now rely on these machine-generated media, or deepfakes, to manipulate social discourse. In order to ensure media authenticity, existing research is focused on deepfake detection. Yet, the adversarial nature of frameworks used for generative modeling suggests that progress towards detecting deepfakes will enable more realistic deepfake generation. Therefore, it comes at no surprise that developers of generative models are under the scrutiny of stakeholders dealing with misinformation campaigns. At the same time, generative models have a lot of positive applications. As such, there is a clear need to develop tools that ensure the transparent use of generative modeling, while minimizing the harm caused by malicious applications. Our technique optimizes over the source of entropy of each generative model to probabilistically attribute a deepfake to one of the models. We evaluate our method on the seminal example of face synthesis, demonstrating that our approach achieves 97.62% attribution accuracy, and is less sensitive to perturbations and adversarial examples. We discuss the ethical implications of our work, identify where our technique can be used, and highlight that a more meaningful legislative framework is required for a more transparent and ethical use of generative modeling. Finally, we argue that model developers should be capable of claiming plausible deniability and propose a second framework to do so -- this allows a model developer to produce evidence that they did not produce media that they are being accused of having produced.

preprint2020arXiv

BatNet: Data transmission between smartphones over ultrasound

In this paper, we present BatNet, a data transmission mechanism using ultrasound signals over the built-in speakers and microphones of smartphones. Using phase shift keying with an 8-point constellation and frequencies between 20--24kHz, it can transmit data at over 600bit/s up to 6m. The target application is a censorship-resistant mesh network. We also evaluated it for Covid contact tracing but concluded that in this application ultrasonic communications do not appear to offer enough advantage over Bluetooth Low Energy to be worth further development.

preprint2020arXiv

Snitches Get Stitches: On The Difficulty of Whistleblowing

One of the most critical security protocol problems for humans is when you are betraying a trust, perhaps for some higher purpose, and the world can turn against you if you're caught. In this short paper, we report on efforts to enable whistleblowers to leak sensitive documents to journalists more safely. Following a survey of cases where whistleblowers were discovered due to operational or technological issues, we propose a game-theoretic model capturing the power dynamics involved in whistleblowing. We find that the whistleblower is often at the mercy of motivations and abilities of others. We identify specific areas where technology may be used to mitigate the whistleblower's risk. However we warn against technical solutionism: the main constraints are often institutional.

preprint2020arXiv

Tendrils of Crime: Visualizing the Diffusion of Stolen Bitcoins

The first six months of 2018 saw cryptocurrency thefts of $761 million, and the technology is also the latest and greatest tool for money laundering. This increase in crime has caused both researchers and law enforcement to look for ways to trace criminal proceeds. Although tracing algorithms have improved recently, they still yield an enormous amount of data of which very few datapoints are relevant or interesting to investigators, let alone ordinary bitcoin owners interested in provenance. In this work we describe efforts to visualize relevant data on a blockchain. To accomplish this we come up with a graphical model to represent the stolen coins and then implement this using a variety of visualization techniques.

preprint2020arXiv

To compress or not to compress: Understanding the Interactions between Adversarial Attacks and Neural Network Compression

As deep neural networks (DNNs) become widely used, pruned and quantised models are becoming ubiquitous on edge devices; such compressed DNNs are popular for lowering computational requirements. Meanwhile, recent studies show that adversarial samples can be effective at making DNNs misclassify. We, therefore, investigate the extent to which adversarial samples are transferable between uncompressed and compressed DNNs. We find that adversarial samples remain transferable for both pruned and quantised models. For pruning, the adversarial samples generated from heavily pruned models remain effective on uncompressed models. For quantisation, we find the transferability of adversarial samples is highly sensitive to integer precision.

preprint2020arXiv

Towards Certifiable Adversarial Sample Detection

Convolutional Neural Networks (CNNs) are deployed in more and more classification systems, but adversarial samples can be maliciously crafted to trick them, and are becoming a real threat. There have been various proposals to improve CNNs' adversarial robustness but these all suffer performance penalties or other limitations. In this paper, we provide a new approach in the form of a certifiable adversarial detection scheme, the Certifiable Taboo Trap (CTT). The system can provide certifiable guarantees of detection of adversarial inputs for certain $l_{\infty}$ sizes on a reasonable assumption, namely that the training data have the same distribution as the test data. We develop and evaluate several versions of CTT with a range of defense capabilities, training overheads and certifiability on adversarial samples. Against adversaries with various $l_p$ norms, CTT outperforms existing defense methods that focus purely on improving network robustness. We show that CTT has small false positive rates on clean test data, minimal compute overheads when deployed, and can support complex security policies.

Ilia Shumailov

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Exploring the limits of strong membership inference attacks on large language models

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Soft Instruction De-escalation Defense

Iterative Deployment Improves Planning Skills in LLMs

Architectural Backdoors in Neural Networks

Efficient Adversarial Training With Data Pruning

Model Architecture Adaption for Bayesian Neural Networks

Rethinking Image-Scaling Attacks: The Interplay Between Vulnerabilities in Machine Learning Systems

On Attribution of Deepfakes

BatNet: Data transmission between smartphones over ultrasound

Snitches Get Stitches: On The Difficulty of Whistleblowing

Tendrils of Crime: Visualizing the Diffusion of Stolen Bitcoins

To compress or not to compress: Understanding the Interactions between Adversarial Attacks and Neural Network Compression

Towards Certifiable Adversarial Sample Detection