Researcher profile

Nigel Shadbolt

Nigel Shadbolt contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

Alignment evaluation in machine learning has largely become evaluation of models. Influential benchmarks score model outputs under fixed inputs, such as truthfulness, instruction following, or pairwise preference, and these scores are often used to support claims about deployed alignment. This paper argues that deployment-relevant alignment cannot be inferred from model-level evaluation alone. Alignment claims should instead be indexed to the level at which evidence is collected: model-level, response-level, interaction-level, or deployment-level. Two studies support this position. First, a structured audit of eleven alignment benchmarks, extended to a sixteen-benchmark corpus, dual-coded against an eight-dimension rubric with Cohen's kappa = 0.87, finds that user-facing verification support is absent across every benchmark examined, while process steerability is nearly absent. The few interactional benchmarks identified, including tau-bench, CURATe, Rifts, and Common Ground, remain fragmented in coverage, and benchmark construction rather than data source determines what is measured. Second, a blinded cross-model stress test using 180 transcripts across three frontier models and four scaffolds finds that the same verification scaffold raises one model's verification support to ceiling while leaving another categorically unchanged. This shows that scaffold efficacy is model-dependent and that the gap identified by the audit cannot be closed at the model level alone. We propose a system-level evaluation agenda: alignment profiles instead of single scores, fixed-scaffolding protocols for comparable interactional evaluation, and reporting templates that make the inferential distance between evaluation evidence and deployment claims explicit.

preprint2026arXiv

From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice's maxims: scoping (acknowledging the limits of one's perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one's position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose "principled" counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

preprint2026arXiv

NeurIPS Should Require Reproducibility Standards for Frontier AI Safety Claims

Frontier AI safety claims - published assertions that a highly capable general-purpose model is below a threshold of concern, adequately mitigated, or suitable for release - increasingly shape model deployment, governance, and public trust. Yet the artefacts needed to evaluate them are routinely withheld, producing an evidential inversion: the most consequential claims in AI safety are often the least reproducible. This position paper argues that NeurIPS should require reproducibility standards for papers making such claims, treating non-reproducibility not as a transparency preference but as an evaluation-methodology failure. The 2026 International AI Safety Report [Bengio et al., 2026] concludes that reliable pre-deployment safety testing has become harder to conduct and that models now distinguish test from deployment contexts; the 2025 Foundation Model Transparency Index [Wan et al., 2025] reports a sector-average transparency score of 40/100 with no major developer adequately disclosing train-test overlap; contemporaneous measurement-theory work shows that attack-success-rate comparisons across systems are often founded on low-validity measurements [Chouldechova et al., 2025]. We propose a three-tier disclosure framework, distinguishing public, controlled, and claim-restricted disclosure, paired with a mandatory claim inventory, scope statements, and a phased implementation path with graduated sanctions. The framework treats secrecy and openness as endpoints of a spectrum, with controlled review (via a federated colloquium of qualified secure-review hosts) covering claims whose artefacts cannot be released publicly, and right-scaling claims whose artefacts cannot be reviewed even confidentially. The standard the community applies to its most consequential claims should be at least as high as the standard it applies to its least.

preprint2026arXiv

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.

preprint2022arXiv

Backdoors Stuck At The Frontdoor: Multi-Agent Backdoor Attacks That Backfire

Malicious agents in collaborative learning and outsourced data collection threaten the training of clean models. Backdoor attacks, where an attacker poisons a model during training to successfully achieve targeted misclassification, are a major concern to train-time robustness. In this paper, we investigate a multi-agent backdoor attack scenario, where multiple attackers attempt to backdoor a victim model simultaneously. A consistent backfiring phenomenon is observed across a wide range of games, where agents suffer from a low collective attack success rate. We examine different modes of backdoor attack configurations, non-cooperation / cooperation, joint distribution shifts, and game setups to return an equilibrium attack success rate at the lower bound. The results motivate the re-evaluation of backdoor defense research for practical environments.

preprint2022arXiv

Goodbye Tracking? Impact of iOS App Tracking Transparency and Privacy Labels

Tracking is a highly privacy-invasive data collection practice that has been ubiquitous in mobile apps for many years due to its role in supporting advertising-based revenue models. In response, Apple introduced two significant changes with iOS 14: App Tracking Transparency (ATT), a mandatory opt-in system for enabling tracking on iOS, and Privacy Nutrition Labels, which disclose what kinds of data each app processes. So far, the impact of these changes on individual privacy and control has not been well understood. This paper addresses this gap by analysing two versions of 1,759 iOS apps from the UK App Store: one version from before iOS 14 and one that has been updated to comply with the new rules. We find that Apple's new policies, as promised, prevent the collection of the Identifier for Advertisers (IDFA), an identifier for cross-app tracking. Smaller data brokers that engage in invasive data practices will now face higher challenges in tracking users - a positive development for privacy. However, the number of tracking libraries has roughly stayed the same in the studied apps. Many apps still collect device information that can be used to track users at a group level (cohort tracking) or identify individuals probabilistically (fingerprinting). We find real-world evidence of apps computing and agreeing on a fingerprinting-derived identifier through the use of server-side code, thereby violating Apple's policies. We find that Apple itself engages in some forms of tracking and exempts invasive data practices like first-party tracking and credit scoring. We also find that the new Privacy Nutrition Labels are sometimes inaccurate and misleading. Overall, our findings suggest that, while tracking individual users is more difficult now, the changes reinforce existing market power of gatekeeper companies with access to large troves of first-party data and motivate a countermovement.

preprint2022arXiv

GreaseVision: Rewriting the Rules of the Interface

Digital harms can manifest across any interface. Key problems in addressing these harms include the high individuality of harms and the fast-changing nature of digital systems. As a result, we still lack a systematic approach to study harms and produce interventions for end-users. We put forward GreaseVision, a new framework that enables end-users to collaboratively develop interventions against harms in software using a no-code approach and recent advances in few-shot machine learning. The contribution of the framework and tool allow individual end-users to study their usage history and create personalized interventions. Our contribution also enables researchers to study the distribution of harms and interventions at scale.

preprint2022arXiv

Hiding Behind Backdoors: Self-Obfuscation Against Generative Models

Attack vectors that compromise machine learning pipelines in the physical world have been demonstrated in recent research, from perturbations to architectural components. Building on this work, we illustrate the self-obfuscation attack: attackers target a pre-processing model in the system, and poison the training set of generative models to obfuscate a specific class during inference. Our contribution is to describe, implement and evaluate a generalized attack, in the hope of raising awareness regarding the challenge of architectural robustness within the machine learning community.

preprint2022arXiv

Interpolating Compressed Parameter Subspaces

Inspired by recent work on neural subspaces and mode connectivity, we revisit parameter subspace sampling for shifted and/or interpolatable input distributions (instead of a single, unshifted distribution). We enforce a compressed geometric structure upon a set of trained parameters mapped to a set of train-time distributions, denoting the resulting subspaces as Compressed Parameter Subspaces (CPS). We show the success and failure modes of the types of shifted distributions whose optimal parameters reside in the CPS. We find that ensembling point-estimates within a CPS can yield a high average accuracy across a range of test-time distributions, including backdoor, adversarial, permutation, stylization and rotation perturbations. We also find that the CPS can contain low-loss point-estimates for various task shifts (albeit interpolated, perturbed, unseen or non-identical coarse labels). We further demonstrate this property in a continual learning setting with CIFAR100.

preprint2021arXiv

Exploring Design and Governance Challenges in the Development of Privacy-Preserving Computation

Homomorphic encryption, secure multi-party computation, and differential privacy are part of an emerging class of Privacy Enhancing Technologies which share a common promise: to preserve privacy whilst also obtaining the benefits of computational analysis. Due to their relative novelty, complexity, and opacity, these technologies provoke a variety of novel questions for design and governance. We interviewed researchers, developers, industry leaders, policymakers, and designers involved in their deployment to explore motivations, expectations, perceived opportunities and barriers to adoption. This provided insight into several pertinent challenges facing the adoption of these technologies, including: how they might make a nebulous concept like privacy computationally tractable; how to make them more usable by developers; and how they could be explained and made accountable to stakeholders and wider society. We conclude with implications for the development, deployment, and responsible governance of these privacy-preserving computation techniques.

preprint2020arXiv

'I Just Want to Hack Myself to Not Get Distracted': Evaluating Design Interventions for Self-Control on Facebook

Beyond being the world's largest social network, Facebook is for many also one of its greatest sources of digital distraction. For students, problematic use has been associated with negative effects on academic achievement and general wellbeing. To understand what strategies could help users regain control, we investigated how simple interventions to the Facebook UI affect behaviour and perceived control. We assigned 58 university students to one of three interventions: goal reminders, removed newsfeed, or white background (control). We logged use for 6 weeks, applied interventions in the middle weeks, and administered fortnightly surveys. Both goal reminders and removed newsfeed helped participants stay on task and avoid distraction. However, goal reminders were often annoying, and removing the newsfeed made some fear missing out on information. Our findings point to future interventions such as controls for adjusting types and amount of available information, and flexible blocking which matches individual definitions of 'distraction'.

preprint2020arXiv

Strangers in the Room: Unpacking Perceptions of 'Smartness' and Related Ethical Concerns in the Home

The increasingly widespread use of 'smart' devices has raised multifarious ethical concerns regarding their use in domestic spaces. Previous work examining such ethical dimensions has typically either involved empirical studies of concerns raised by specific devices and use contexts, or alternatively expounded on abstract concepts like autonomy, privacy or trust in relation to 'smart homes' in general. This paper attempts to bridge these approaches by asking what features of smart devices users consider as rendering them 'smart' and how these relate to ethical concerns. Through a multimethod investigation including surveys with smart device users (n=120) and semi-structured interviews (n=15), we identify and describe eight types of smartness and explore how they engender a variety of ethical concerns including privacy, autonomy, and disruption of the social order. We argue that this middle ground, between concerns arising from particular devices and more abstract ethical concepts, can better anticipate potential ethical concerns regarding smart devices.