Source author record

James Cheney

James Cheney appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Programming Languages Logic in Computer Science Databases Cryptography and Security Artificial Intelligence Distributed, Parallel, and Cluster Computing Software Engineering

Catalog footprint

What is connected

23works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Formalization of SQL with Nulls

SQL is the world's most popular declarative language, forming the basis of the multi-billion-dollar database industry. Although SQL has been standardized, the full standard is based on ambiguous natural language rather than formal specification. Commercial SQL implementations interpret the standard in different ways, so that, given the same input data, the same query can yield different results depending on the SQL system it is run on. Even for a particular system, mechanically checked formalization of all widely-used features of SQL remains an open problem. The lack of a well-understood formal semantics makes it very difficult to validate the soundness of database implementations. Although formal semantics for fragments of SQL were designed in the past, they usually did not support set and bag operations, lateral joins, nested subqueries, and, crucially, null values. Null values complicate SQL's semantics in profound ways analogous to null pointers or side-effects in other programming languages. Since certain SQL queries are equivalent in the absence of null values, but produce different results when applied to tables containing incomplete data, semantics which ignore null values are able to prove query equivalences that are unsound in realistic databases. A formal semantics of SQL supporting all the aforementioned features was only proposed recently. In this paper, we report about our mechanization of SQL semantics covering set/bag operations, lateral joins, nested subqueries, and nulls, written in the Coq proof assistant, and describe the validation of key metatheoretic properties. Additionally, we are able to use the same framework to formalize the semantics of a flat relational calculus (with null values), and show a certified translation of its normal forms into SQL.

preprint2022arXiv

Constraint-based type inference for FreezeML

FreezeML is a new approach to first-class polymorphic type inference that employs term annotations to control when and how polymorphic types are instantiated and generalised. It conservatively extends Hindley-Milner type inference and was first presented as an extension to Algorithm W. More modern type inference techniques such as HM(X) and OutsideIn($X$) employ constraints to support features such as type classes, type families, rows, and other extensions. We take the first step towards modernising FreezeML by presenting a constraint-based type inference algorithm. We introduce a new constraint language, inspired by the Pottier/Rémy presentation of HM(X), in order to allow FreezeML type inference problems to be expressed as constraints. We present a deterministic stack machine for solving FreezeML constraints and prove its termination and correctness.

preprint2022arXiv

Nominal Matching Logic

We introduce Nominal Matching Logic (NML) as an extension of Matching Logic with names and binding following the Gabbay-Pitts nominal approach. Matching logic is the foundation of the $\mathbb{K}$ framework, used to specify programming languages and automatically derive associated tools (compilers, debuggers, model checkers, program verifiers). Matching logic does not include a primitive notion of name binding, though binding operators can be represented via an encoding that internalises the graph of a function from bound names to expressions containing bound names. This approach is sufficient to represent computations involving binding operators, but has not been reconciled with support for inductive reasoning over syntax with binding (e.g., reasoning over $λ$-terms). Nominal logic is a formal system for reasoning about names and binding, which provides well-behaved and powerful principles for inductive reasoning over syntax with binding, and NML inherits these principles. We discuss design alternatives for the syntax and the semantics of NML, prove meta-theoretical properties and give examples to illustrate its expressive power. In particular, we show how induction principles for $λ$-terms ($α$-structural induction) can be defined and used to prove standard properties of the $λ$-calculus.

preprint2020arXiv

A Polymorphic RPC Calculus

The RPC calculus is a simple semantic foundation for multi-tier programming languages such as Links in which located functions can be written for the client-server model. Subsequently, the typed RPC calculus is designed to capture the location information of functions by types and to drive location type-directed slicing compilations. However, the use of locations is currently limited to monomorphic ones, which is one of the gaps to overcome to put into practice the theory of RPC calculi for client-server model. This paper proposes a polymorphic RPC calculus to allow programmers to write succinct multi-tier programs using polymorphic location constructs. Then the polymorphic multi-tier programs can be automatically translated into programs only containing location constants amenable to the existing slicing compilation methods. We formulate a type system for the polymorphic RPC calculus, and prove its type soundness. Also, we design a monomorphization translation together with proofs on its type and semantic correctness for the translation.

preprint2020arXiv

Categorical anomaly detection in heterogeneous data using minimum description length clustering

Fast and effective unsupervised anomaly detection algorithms have been proposed for categorical data based on the minimum description length (MDL) principle. However, they can be ineffective when detecting anomalies in heterogeneous datasets representing a mixture of different sources, such as security scenarios in which system and user processes have distinct behavior patterns. We propose a meta-algorithm for enhancing any MDL-based anomaly detection model to deal with heterogeneous data by fitting a mixture model to the data, via a variant of k-means clustering. Our experimental results show that using a discrete mixture model provides competitive performance relative to two previous anomaly detection algorithms, while mixtures of more sophisticated models yield further gains, on both synthetic datasets and realistic datasets from a security scenario.

preprint2020arXiv

FreezeML: Complete and Easy Type Inference for First-Class Polymorphism

ML is remarkable in providing statically typed polymorphism without the programmer ever having to write any type annotations. The cost of this parsimony is that the programmer is limited to a form of polymorphism in which quantifiers can occur only at the outermost level of a type and type variables can be instantiated only with monomorphic types. Type inference for unrestricted System F-style polymorphism is undecidable in general. Nevertheless, the literature abounds with a range of proposals to bridge the gap between ML and System F. We put forth a new proposal, FreezeML, a conservative extension of ML with two new features. First, let- and lambda-binders may be annotated with arbitrary System F types. Second, variable occurrences may be frozen, explicitly disabling instantiation. FreezeML is equipped with type-preserving translations back and forth between System F and admits a type inference algorithm, an extension of algorithm W, that is sound and complete and which yields principal types.

preprint2019arXiv

A baseline for unsupervised advanced persistent threat detection in system-level provenance

Advanced persistent threats (APT) are stealthy, sophisticated, and unpredictable cyberattacks that can steal intellectual property, damage critical infrastructure, or cause millions of dollars in damage. Detecting APTs by monitoring system-level activity is difficult because manually inspecting the high volume of normal system activity is overwhelming for security analysts. We evaluate the effectiveness of unsupervised batch and streaming anomaly detection algorithms over multiple gigabytes of provenance traces recorded on four different operating systems to determine whether they can detect realistic APT-like attacks reliably and efficiently. This report is the first detailed study of the effectiveness of generic unsupervised anomaly detection techniques in this setting.

preprint2016arXiv

Advances in Property-Based Testing for $α$Prolog

$α$Check is a light-weight property-based testing tool built on top of $α$Prolog, a logic programming language based on nominal logic. $α$Prolog is particularly suited to the validation of the meta-theory of formal systems, for example correctness of compiler translations involving name-binding, alpha-equivalence and capture-avoiding substitution. In this paper we describe an alternative to the negation elimination algorithm underlying $α$Check that substantially improves its effectiveness. To substantiate this claim we compare the checker performances w.r.t. two of its main competitors in the logical framework niche, namely the QuickCheck/Nitpick combination offered by Isabelle/HOL and the random testing facility in PLT-Redex.

preprint2016arXiv

Causally consistent dynamic slicing

We offer a lattice-theoretic account of dynamic slicing for π-calculus, building on prior work in the sequential setting. For any run of a concurrent program, we exhibit a Galois connection relating forward slices of the start configuration to backward slices of the end configuration. We prove that, up to lattice isomorphism, the same Galois connection arises for any causally equivalent execution, allowing an efficient concurrent implementation of slicing via a standard interleaving semantics. Our approach has been formalised in the dependently-typed language Agda.

preprint2015arXiv

Proof-relevant pi-calculus

Formalising the pi-calculus is an illuminating test of the expressiveness of logical frameworks and mechanised metatheory systems, because of the presence of name binding, labelled transitions with name extrusion, bisimulation, and structural congruence. Formalisations have been undertaken in a variety of systems, primarily focusing on well-studied (and challenging) properties such as the theory of process bisimulation. We present a formalisation in Agda that instead explores the theory of concurrent transitions, residuation, and causal equivalence of traces, which has not previously been formalised for the pi-calculus. Our formalisation employs de Bruijn indices and dependently-typed syntax, and aligns the "proved transitions" proposed by Boudol and Castellani in the context of CCS with the proof terms naturally present in Agda's representation of the labelled transition relation. Our main contributions are proofs of the "diamond lemma" for residuation of concurrent transitions and a formal definition of equivalence of traces up to permutation of transitions.

preprint2015arXiv

YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.

preprint2014arXiv

A Core Calculus for Provenance

Provenance is an increasing concern due to the ongoing revolution in sharing and processing scientific data on the Web and in other computer systems. It is proposed that many computer systems will need to become provenance-aware in order to provide satisfactory accountability, reproducibility, and trust for scientific or other high-value data. To date, there is not a consensus concerning appropriate formal models or security properties for provenance. In previous work, we introduced a formal framework for provenance security and proposed formal definitions of properties called disclosure and obfuscation. In this article, we study refined notions of positive and negative disclosure and obfuscation in a concrete setting, that of a general-purpose programing language. Previous models of provenance have focused on special-purpose languages such as workflows and database queries. We consider a higher-order, functional language with sums, products, and recursive types and functions, and equip it with a tracing semantics in which traces themselves can be replayed as computations. We present an annotation-propagation framework that supports many provenance views over traces, including standard forms of provenance studied previously. We investigate some relationships among provenance views and develop some partial solutions to the disclosure and obfuscation problems, including correct algorithms for disclosure and positive obfuscation based on trace slicing.

preprint2014arXiv

An Analytical Survey of Provenance Sanitization

Security is likely becoming a critical factor in the future adoption of provenance technology, because of the risk of inadvertent disclosure of sensitive information. In this survey paper we review the state of the art in secure provenance, considering mechanisms for controlling access, and the extent to which these mechanisms preserve provenance integrity. We examine seven systems or approaches, comparing features and identifying areas for future work.

preprint2014arXiv

Database Queries that Explain their Work

Provenance for database queries or scientific workflows is often motivated as providing explanation, increasing understanding of the underlying data sources and processes used to compute the query, and reproducibility, the capability to recompute the results on different inputs, possibly specialized to a part of the output. Many provenance systems claim to provide such capabilities; however, most lack formal definitions or guarantees of these properties, while others provide formal guarantees only for relatively limited classes of changes. Building on recent work on provenance traces and slicing for functional programming languages, we introduce a detailed tracing model of provenance for multiset-valued Nested Relational Calculus, define trace slicing algorithms that extract subtraces needed to explain or recompute specific parts of the output, and define query slicing and differencing techniques that support explanation. We state and prove correctness properties for these techniques and present a proof-of-concept implementation in Haskell.

preprint2014arXiv

Dynamic Provenance for SPARQL Update

While the Semantic Web currently can exhibit provenance information by using the W3C PROV standards, there is a "missing link" in connecting PROV to storing and querying for dynamic changes to RDF graphs using SPARQL. Solving this problem would be required for such clear use-cases as the creation of version control systems for RDF. While some provenance models and annotation techniques for storing and querying provenance data originally developed with databases or workflows in mind transfer readily to RDF and SPARQL, these techniques do not readily adapt to describing changes in dynamic RDF datasets over time. In this paper we explore how to adapt the dynamic copy-paste provenance model of Buneman et al. [2] to RDF datasets that change over time in response to SPARQL updates, how to represent the resulting provenance records themselves as RDF in a manner compatible with W3C PROV, and how the provenance information can be defined by reinterpreting SPARQL updates. The primary contribution of this paper is a semantic framework that enables the semantics of SPARQL Update to be used as the basis for a 'cut-and-paste' provenance model in a principled manner.

preprint2014arXiv

Effective Quotation: relating approaches to language-integrated query

Language-integrated query techniques have been explored in a number of different language designs. We consider two different, type-safe approaches employed by Links and F#. Both approaches provide rich dynamic query generation capabilities, and thus amount to a form of heterogeneous staged computation, but to date there has been no formal investigation of their relative expressiveness. We present two core calculi Eff and Quot, respectively capturing the essential aspects of language-integrated querying using effects in Links and quotation in LINQ. We show via translations from Eff to Quot and back that the two approaches are equivalent in expressiveness. Based on the translation from Eff to Quot, we extend a simple Links compiler to handle queries.

preprint2014arXiv

Query shredding: Efficient relational evaluation of queries over nested multisets (extended version)

Nested relational query languages have been explored extensively, and underlie industrial language-integrated query systems such as Microsoft's LINQ. However, relational databases do not natively support nested collections in query results. This can lead to major performance problems: if programmers write queries that yield nested results, then such systems typically either fail or generate a large number of queries. We present a new approach to query shredding, which converts a query returning nested data to a fixed number of SQL queries. Our approach, in contrast to prior work, handles multiset semantics, and generates an idiomatic SQL:1999 query directly from a normal form for nested queries. We provide a detailed description of our translation and present experiments showing that it offers comparable or better performance than a recent alternative approach on a range of examples.

preprint2013arXiv

A simple sequent calculus for nominal logic

Nominal logic is a variant of first-order logic that provides support for reasoning about bound names in abstract syntax. A key feature of nominal logic is the new-quantifier, which quantifies over fresh names (names not appearing in any values considered so far). Previous attempts have been made to develop convenient rules for reasoning with the new-quantifier, but we argue that none of these attempts is completely satisfactory. In this article we develop a new sequent calculus for nominal logic in which the rules for the new- quantifier are much simpler than in previous attempts. We also prove several structural and metatheoretic properties, including cut-elimination, consistency, and equivalence to Pitts' axiomatization of nominal logic.

preprint2013arXiv

Static Enforceability of XPath-Based Access Control Policies

We consider the problem of extending XML databases with fine-grained, high-level access control policies specified using XPath expressions. Most prior work checks individual updates dynamically, which is expensive (requiring worst-case execution time proportional to the size of the database). On the other hand, static enforcement can be performed without accessing the database but may be incomplete, in the sense that it may forbid accesses that dynamic enforcement would allow. We introduce topological characterizations of XPath fragments in order to study the problem of determining when an access control policy can be enforced statically without loss of precision. We introduce the notion of fair policies that are statically enforceable, and study the complexity of determining fairness and of static enforcement itself.

preprint2012arXiv

A dependent nominal type theory

Nominal abstract syntax is an approach to representing names and binding pioneered by Gabbay and Pitts. So far nominal techniques have mostly been studied using classical logic or model theory, not type theory. Nominal extensions to simple, dependent and ML-like polymorphic languages have been studied, but decidability and normalization results have only been established for simple nominal type theories. We present a LF-style dependent type theory extended with name-abstraction types, prove soundness and decidability of beta-eta-equivalence checking, discuss adequacy and canonical forms via an example, and discuss extensions such as dependently-typed recursion and induction principles.

preprint2010arXiv

Causality and the semantics of provenance

Provenance, or information about the sources, derivation, custody or history of data, has been studied recently in a number of contexts, including databases, scientific workflows and the Semantic Web. Many provenance mechanisms have been developed, motivated by informal notions such as influence, dependence, explanation and causality. However, there has been little study of whether these mechanisms formally satisfy appropriate policies or even how to formalize relevant motivating concepts such as causality. We contend that mathematical models of these concepts are needed to justify and compare provenance techniques. In this paper we review a theory of causality based on structural models that has been developed in artificial intelligence, and describe work in progress on a causal semantics for provenance graphs.

preprint2010arXiv

Causality and the Semantics of Provenance

preprint2010arXiv

Mechanizing the Metatheory of LF

LF is a dependent type theory in which many other formal systems can be conveniently embedded. However, correct use of LF relies on nontrivial metatheoretic developments such as proofs of correctness of decision procedures for LF's judgments. Although detailed informal proofs of these properties have been published, they have not been formally verified in a theorem prover. We have formalized these properties within Isabelle/HOL using the Nominal Datatype Package, closely following a recent article by Harper and Pfenning. In the process, we identified and resolved a gap in one of the proofs and a small number of minor lacunae in others. We also formally derive a version of the type checking algorithm from which Isabelle/HOL can generate executable code. Besides its intrinsic interest, our formalization provides a foundation for studying the adequacy of LF encodings, the correctness of Twelf-style metatheoretic reasoning, and the metatheory of extensions to LF.

James Cheney

What is connected

Connect this record

See the researcher in context

Building this map preview

23 published item(s)

A Formalization of SQL with Nulls

Constraint-based type inference for FreezeML

Nominal Matching Logic

A Polymorphic RPC Calculus

Categorical anomaly detection in heterogeneous data using minimum description length clustering

FreezeML: Complete and Easy Type Inference for First-Class Polymorphism

A baseline for unsupervised advanced persistent threat detection in system-level provenance

Advances in Property-Based Testing for $α$Prolog

Causally consistent dynamic slicing

Proof-relevant pi-calculus

YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

A Core Calculus for Provenance

An Analytical Survey of Provenance Sanitization

Database Queries that Explain their Work

Dynamic Provenance for SPARQL Update

Effective Quotation: relating approaches to language-integrated query

Query shredding: Efficient relational evaluation of queries over nested multisets (extended version)

A simple sequent calculus for nominal logic

Static Enforceability of XPath-Based Access Control Policies

A dependent nominal type theory

Causality and the semantics of provenance

Causality and the Semantics of Provenance

Mechanizing the Metatheory of LF