Source author record

Paolo Missier

Paolo Missier appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Machine Learning Artificial Intelligence Computation and Language Cryptography and Security cs.CY Social and Information Networks Software Engineering

Catalog footprint

What is connected

12works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Benchmark time series data sets for PyTorch -- the torchtime package

The development of models for Electronic Health Record data is an area of active research featuring a small number of public benchmark data sets. Researchers typically write custom data processing code but this hinders reproducibility and can introduce errors. The Python package torchtime provides reproducible implementations of commonly used PhysioNet and UEA & UCR time series classification repository data sets for PyTorch. Features are provided for working with irregularly sampled and partially observed time series of unequal length. It aims to simplify access to PhysioNet data and enable fair comparisons of models in this exciting area of research.

preprint2022arXiv

Technologies for Trustworthy Machine Learning: A Survey in a Socio-Technical Context

Concerns about the societal impact of AI-based services and systems has encouraged governments and other organisations around the world to propose AI policy frameworks to address fairness, accountability, transparency and related topics. To achieve the objectives of these frameworks, the data and software engineers who build machine-learning systems require knowledge about a variety of relevant supporting tools and techniques. In this paper we provide an overview of technologies that support building trustworthy machine learning systems, i.e., systems whose properties justify that people place trust in them. We argue that four categories of system properties are instrumental in achieving the policy objectives, namely fairness, explainability, auditability and safety & security (FEAS). We discuss how these properties need to be considered across all stages of the machine learning life cycle, from data collection through run-time model inference. As a consequence, we survey in this paper the main technologies with respect to all four of the FEAS properties, for data-centric as well as model-centric stages of the machine learning system life cycle. We conclude with an identification of open research problems, with a particular focus on the connection between trustworthy machine learning technologies and their implications for individuals and society.

preprint2020arXiv

Towards Learning Instantiated Logical Rules from Knowledge Graphs

Efficiently inducing high-level interpretable regularities from knowledge graphs (KGs) is an essential yet challenging task that benefits many downstream applications. In this work, we present GPFL, a probabilistic rule learner optimized to mine instantiated first-order logic rules from KGs. Instantiated rules contain constants extracted from KGs. Compared to abstract rules that contain no constants, instantiated rules are capable of explaining and expressing concepts in more details. GPFL utilizes a novel two-stage rule generation mechanism that first generalizes extracted paths into templates that are acyclic abstract rules until a certain degree of template saturation is achieved, then specializes the generated templates into instantiated rules. Unlike existing works that ground every mined instantiated rule for evaluation, GPFL shares groundings between structurally similar rules for collective evaluation. Moreover, we reveal the presence of overfitting rules, their impact on the predictive performance, and the effectiveness of a simple validation method filtering out overfitting rules. Through extensive experiments on public benchmark datasets, we show that GPFL 1.) significantly reduces the runtime on evaluating instantiated rules; 2.) discovers much more quality instantiated rules than existing works; 3.) improves the predictive performance of learned rules by removing overfitting rules via validation; 4.) is competitive on knowledge graph completion task compared to state-of-the-art baselines.

preprint2016arXiv

Preserving the value of large scale data analytics over time through selective re-computation

A pervasive problem in Data Science is that the knowledge generated by possibly expensive analytics processes is subject to decay over time, as the data used to compute it drifts, the algorithms used in the processes are improved, and the external knowledge embodied by reference datasets used in the computation evolves. Deciding when such knowledge outcomes should be refreshed, following a sequence of data change events, requires problem-specific functions to quantify their value and its decay over time, as well as models for estimating the cost of their re-computation. What makes this problem challenging is the ambition to develop a decision support system for informing data analytics re-computation decisions over time, that is both generic and customisable. With the help of a case study from genomics, in this vision paper we offer an initial formalisation of this problem, highlight research challenges, and outline a possible approach based on the collection and analysis of metadata from a history of past computations.

preprint2016arXiv

TAPER: query-aware, partition-enhancement for large, heterogenous, graphs

Graph partitioning has long been seen as a viable approach to address Graph DBMS scalability. A partitioning, however, may introduce extra query processing latency unless it is sensitive to a specific query workload, and optimised to minimise inter-partition traversals for that workload. Additionally, it should also be possible to incrementally adjust the partitioning in reaction to changes in the graph topology, the query workload, or both. Because of their complexity, current partitioning algorithms fall short of one or both of these requirements, as they are designed for offline use and as one-off operations. The TAPER system aims to address both requirements, whilst leveraging existing partitioning algorithms. TAPER takes any given initial partitioning as a starting point, and iteratively adjusts it by swapping chosen vertices across partitions, heuristically reducing the probability of inter-partition traversals for a given pattern matching queries workload. Iterations are inexpensive thanks to time and space optimisations in the underlying support data structures. We evaluate TAPER on two different large test graphs and over realistic query workloads. Our results indicate that, given a hash-based partitioning, TAPER reduces the number of inter-partition traversals by around 80%; given an unweighted METIS partitioning, by around 30%. These reductions are achieved within 8 iterations and with the additional advantage of being workload-aware and usable online.

preprint2016arXiv

The data, they are a-changin'

The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive factors: low cost data generation, inexpensively scalable storage and processing infrastructure (cloud), software frameworks and tools for massively distributed data processing, and parallelisable data analytics algorithms. One observation that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time. This suggests that the value of such derivative knowledge may decay over time, unless it is preserved by reacting to those changes. Our broad research goal is to develop models, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes. In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions. We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of human variants in support of genetic diagnosis.

preprint2016arXiv

The lifecycle of provenance metadata and its associated challenges and opportunities

This chapter outlines some of the challenges and opportunities associated with adopting provenance principles and standards in a variety of disciplines, including data publication and reuse, and information sciences.

preprint2016arXiv

Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling

Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue and Zika in Brasil and other tropical regions has long been a priority for governments in affected areas. Streaming social media content, such as Twitter, is increasingly being used for health vigilance applications such as flu detection. However, previous work has not addressed the complexity of drastic seasonal changes on Twitter content across multiple epidemic outbreaks. In order to address this gap, this paper contrasts two complementary approaches to detecting Twitter content that is relevant for Dengue outbreak detection, namely supervised classification and unsupervised clustering using topic modelling. Each approach has benefits and shortcomings. Our classifier achieves a prediction accuracy of about 80\% based on a small training set of about 1,000 instances, but the need for manual annotation makes it hard to track seasonal changes in the nature of the epidemics, such as the emergence of new types of virus in certain geographical locations. In contrast, LDA-based topic modelling scales well, generating cohesive and well-separated clusters from larger samples. While clusters can be easily re-generated following changes in epidemics, however, this approach makes it hard to clearly segregate relevant tweets into well-defined clusters.

preprint2015arXiv

YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

Scientific workflow management systems offer features for composing complex computational pipelines from modular building blocks, for executing the resulting automated workflows, and for recording the provenance of data products resulting from workflow runs. Despite the advantages such features provide, many automated workflows continue to be implemented and executed outside of scientific workflow systems due to the convenience and familiarity of scripting languages (such as Perl, Python, R, and MATLAB), and to the high productivity many scientists experience when using these languages. YesWorkflow is a set of software tools that aim to provide such users of scripting languages with many of the benefits of scientific workflow systems. YesWorkflow requires neither the use of a workflow engine nor the overhead of adapting code to run effectively in such a system. Instead, YesWorkflow enables scientists to annotate existing scripts with special comments that reveal the computational modules and dataflows otherwise implicit in these scripts. YesWorkflow tools extract and analyze these comments, represent the scripts in terms of entities based on the typical scientific workflow model, and provide graphical renderings of this workflow-like view of the scripts. Future versions of YesWorkflow also will allow the prospective provenance of the data products of these scripts to be queried in ways similar to those available to users of scientific workflow systems.

preprint2014arXiv

ProvAbs: model, policy, and tooling for abstracting PROV graphs

Provenance metadata can be valuable in data sharing settings, where it can be used to help data consumers form judgements regarding the reliability of the data produced by third parties. However, some parts of provenance may be sensitive, requiring access control, or they may need to be simplified for the intended audience. Both these issues can be addressed by a single mechanism for creating abstractions over provenance, coupled with a policy model to drive the abstraction. Such mechanism, which we refer to as abstraction by grouping, simultaneously achieves partial disclosure of provenance, and facilitates its consumption. In this paper we introduce a formal foundation for this type of abstraction, grounded in the W3C PROV model; describe the associated policy model; and briefly present its implementation, the Provabs tool for interactive experimentation with policies and abstractions.

preprint2014arXiv

Provenance and data differencing for workflow reproducibility analysis

One of the foundations of science is that researchers must publish the methodology used to achieve their results so that others can attempt to reproduce them. This has the added benefit of allowing methods to be adopted and adapted for other purposes. In the field of e-Science, services -- often choreographed through workflow, process data to generate results. The reproduction of results is often not straightforward as the computational objects may not be made available or may have been updated since the results were generated. For example, services are often updated to fix bugs or improve algorithms. This paper addresses these problems in three ways. Firstly, it introduces a new framework to clarify the range of meanings of "reproducibility". Secondly, it describes a new algorithm, \PDIFF, that uses a comparison of workflow provenance traces to determine whether an experiment has been reproduced; the main innovation is that if this is not the case then the specific point(s) of divergence are identified through graph analysis, assisting any researcher wishing to understand those differences. One key feature is support for user-defined, semantic data comparison operators. Finally, the paper describes an implementation of \PDIFF that leverages the power of the e-Science Central platform which enacts workflows in the cloud. As well as automatically generating a provenance trace for consumption by \PDIFF, the platform supports the storage and re-use of old versions of workflows, data and services; the paper shows how this can be powerfully exploited in order to achieve reproduction and re-use.

preprint2014arXiv

ProvGen: generating synthetic PROV graphs with predictable structure

This paper introduces provGen, a generator aimed at producing large synthetic provenance graphs with predictable properties and of arbitrary size. Synthetic provenance graphs serve two main purposes. Firstly, they provide a variety of controlled workloads that can be used to test storage and query capabilities of provenance management systems at scale. Secondly, they provide challenging testbeds for experimenting with graph algorithms for provenance analytics, an area of increasing research interest. provGen produces PROV graphs and stores them in a graph DBMS (Neo4J). A key feature is to let users control the relationship makeup and topological features of the graph, by providing a seed provenance pattern along with a set of constraints, expressed using a custom Domain Specific Language. We also propose a simple method for evaluating the quality of the generated graphs, by measuring how realistically they simulate the structure of real-world patterns.

Paolo Missier

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

Benchmark time series data sets for PyTorch -- the torchtime package

Technologies for Trustworthy Machine Learning: A Survey in a Socio-Technical Context

Towards Learning Instantiated Logical Rules from Knowledge Graphs

Preserving the value of large scale data analytics over time through selective re-computation

TAPER: query-aware, partition-enhancement for large, heterogenous, graphs

The data, they are a-changin'

The lifecycle of provenance metadata and its associated challenges and opportunities

Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling

YesWorkflow: A User-Oriented, Language-Independent Tool for Recovering Workflow Information from Scripts

ProvAbs: model, policy, and tooling for abstracting PROV graphs

Provenance and data differencing for workflow reproducibility analysis

ProvGen: generating synthetic PROV graphs with predictable structure