Source author record

Pascal Berrang

Pascal Berrang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Machine Learning

Catalog footprint

What is connected

3works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond Red-Teaming: Formal Guarantees of LLM Guardrail Classifiers

Guardrail Classifiers defend production language models against harmful behavior, but although results seem promising in testing, they provide no formal guarantees. Providing formal guarantees for such models is hard because "harmful behavior" has no natural specification in a discrete input space: and the standard epsilon-ball properties used in other domains do not carry semantic meaning. We close this gap by shifting verification from the discrete input space to the classifier's pre-activation space, where we define a harmful region as a convex shape enclosing the representations of known harmful prompts. Because the sigmoid classification head is monotonic, certifying the worst-case point is sufficient to certify the entire region, yielding a closed-form soundness proof without approximation in O(d) time. To formally evaluate these classifiers, we propose two constructions of such regions: SVD-aligned hyper-rectangles, which yield exact SAT/UNSAT certificates, and Gaussian Mixture Models, which yield probabilistic certificates over semantically coherent clusters. Applying this framework to three author-trained Guardrail Classifiers on the toxicity domain, every hyper-rectangle configuration returns SAT, exposing verifiable safety holes across all classifiers, despite seemingly high empirical metrics. Probabilistic GMM certificates also expose a divergent structural stability in how these models represent harm. While GPT-2 and Llama-3.1-8B maintain robust coverage of 90% and 80% across varying boundaries, BERT's safety guarantees prove uniquely volatile. This 'coverage collapse' to 55% at the optimal threshold reveals a sparsely populated safety margin in BERT, which only achieves full coverage by adopting an extremely conservative pessimistic threshold. These approaches combined, provide new insights on how effective Guardrail Classifiers really are, beyond traditional red-teaming.

preprint2023arXiv

Accountable Javascript Code Delivery

The internet is a major distribution platform for web applications, but there are no effective transparency and audit mechanisms in place for the web. Due to the ephemeral nature of web applications, a client visiting a website has no guarantee that the code it receives today is the same as yesterday, or the same as other visitors receive. Despite advances in web security, it is thus challenging to audit web applications before they are rendered in the browser. We propose Accountable JS, a browser extension and opt in protocol for accountable delivery of active content on a web page. We prototype our protocol, formally model its security properties with the Tamarin Prover, and evaluate its compatibility and performance impact with case studies including WhatsApp Web, AdSense and Nimiq. Accountability is beginning to be deployed at scale, with Meta's recent announcement of Code Verify available to all 2 billion WhatsApp users, but there has been little formal analysis of such protocols. We formally model Code Verify using the Tamarin Prover and compare its properties to our Accountable JS protocol. We also compare Code Verify's and Accountable JS extension's performance impacts on WhatsApp Web.

preprint2016arXiv

From Closed-world Enforcement to Open-world Assessment of Privacy

In this paper, we develop a user-centric privacy framework for quantitatively assessing the exposure of personal information in open settings. Our formalization addresses key-challenges posed by such open settings, such as the unstructured dissemination of heterogeneous information and the necessity of user- and context-dependent privacy requirements. We propose a new definition of information sensitivity derived from our formalization of privacy requirements, and, as a sanity check, show that hard non-disclosure guarantees are impossible to achieve in open settings. After that, we provide an instantiation of our framework to address the identity disclosure problem, leading to the novel notion of d-convergence. d-convergence is based on indistinguishability of entities and it bounds the likelihood with which an adversary successfully links two profiles of the same user across online communities. Finally, we provide a large-scale evaluation of our framework on a collection of 15 million comments collected from the Online Social Network Reddit. Our evaluation validates the notion of d-convergence for assessing the linkability of entities in our data set and provides deeper insights into the data set's structure.