Source author record

Matthew Roughan

Matthew Roughan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Information Theory math.IT Information Retrieval Methodology Social and Information Networks Applications Computation Data Structures and Algorithms Databases math.ST Networking and Internet Architecture physics.data-an physics.soc-ph Statistics Theory

Catalog footprint

What is connected

13works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

The entropy rate of Linear Additive Markov Processes

This work derives a theoretical value for the entropy of a Linear Additive Markov Process (LAMP), an expressive model able to generate sequences with a given autocorrelation structure. While a first-order Markov Chain model generates new values by conditioning on the current state, the LAMP model takes the transition state from the sequence's history according to some distribution which does not have to be bounded. The LAMP model captures complex relationships and long-range dependencies in data with similar expressibility to a higher-order Markov process. While a higher-order Markov process has a polynomial parameter space, a LAMP model is characterised only by a probability distribution and the transition matrix of an underlying first-order Markov Chain. We prove that the theoretical entropy rate of a LAMP is equivalent to the theoretical entropy rate of the underlying first-order Markov Chain. This surprising result is explained by the randomness introduced by the random process which selects the LAMP transitioning state, and provides a tool to model complex dependencies in data while retaining useful theoretical results. We use the LAMP model to estimate the entropy rate of the LastFM, BrightKite, Wikispeedia and Reuters-21578 datasets. We compare estimates calculated using frequency probability estimates, a first-order Markov model and the LAMP model, and consider two approaches to ensuring the transition matrix is irreducible. In most cases the LAMP entropy rates are lower than those of the alternatives, suggesting that LAMP model is better at accommodating structural dependencies in the processes.

preprint2022arXiv

#IStandWithPutin versus #IStandWithUkraine: The interaction of bots and humans in discussion of the Russia/Ukraine war

The 2022 Russian invasion of Ukraine emphasises the role social media plays in modern-day warfare, with conflict occurring in both the physical and information environments. There is a large body of work on identifying malicious cyber-activity, but less focusing on the effect this activity has on the overall conversation, especially with regards to the Russia/Ukraine Conflict. Here, we employ a variety of techniques including information theoretic measures, sentiment and linguistic analysis, and time series techniques to understand how bot activity influences wider online discourse. By aggregating account groups we find significant information flows from bot-like accounts to non-bot accounts with behaviour differing between sides. Pro-Russian non-bot accounts are most influential overall, with information flows to a variety of other account groups. No significant outward flows exist from pro-Ukrainian non-bot accounts, with significant flows from pro-Ukrainian bot accounts into pro-Ukrainian non-bot accounts. We find that bot activity drives an increase in conversations surrounding angst (with p = 2.450 x 1e-4) as well as those surrounding work/governance (with p = 3.803 x 1e-18). Bot activity also shows a significant relationship with non-bot sentiment (with p = 3.76 x 1e-4), where we find the relationship holds in both directions. This work extends and combines existing techniques to quantify how bots are influencing people in the online conversation around the Russia/Ukraine invasion. It opens up avenues for researchers to understand quantitatively how these malicious campaigns operate, and what makes them impactful.

preprint2022arXiv

Boolean Expressions in Firewall Analysis

Firewall policies are an important line of defence in cybersecurity, specifying which packets are allowed to pass through a network and which are not. These firewall policies are made up of a list of interacting rules. In practice, firewall can consist of hundreds or thousands of rules. This can be very difficult for a human to correctly configure. One proposed solution is to model firewall policies as Boolean expressions and use existing computer programs such as SAT solvers to verify that the firewall satisfies certain conditions. This paper takes an in-depth look at the Boolean expressions that represent firewall policies. We present an algorithm that translates a list of firewall rules into a Boolean expression in conjunctive normal form (CNF) or disjunctive normal form (DNF). We also place an upper bound on the size of the CNF and DNF that is polynomial in the number of rules in the firewall policy. This shows that past results suggesting a combinatorial explosion when converting from a Boolean expression in CNF to one in DNF does note occur in the context of firewall analysis

preprint2020arXiv

Likelihood-based inference for modelling packet transit from thinned flow summaries

The substantial growth of network traffic speed and volume presents practical challenges to network data analysis. Packet thinning and flow aggregation protocols such as NetFlow reduce the size of datasets by providing structured data summaries, but conversely this impedes statistical inference. Methods which aim to model patterns of traffic propagation typically do not account for the packet thinning and summarisation process into the analysis, and are often simplistic, e.g.~method-of-moments. As a result, they can be of limited practical use. We introduce a likelihood-based analysis which fully incorporates packet thinning and NetFlow summarisation into the analysis. As a result, inferences can be made for models on the level of individual packets while only observing thinned flow summary information. We establish consistency of the resulting maximum likelihood estimator, derive bounds on the volume of traffic which should be observed to achieve required levels of estimator accuracy, and identify an ideal family of models. The robust performance of the estimator is examined through simulated analyses and an application on a publicly available trace dataset containing over 36m packets over a 1 minute period.

preprint2020arXiv

Simulating Name-like Vectors for Testing Large-scale Entity Resolution

Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).

preprint2019arXiv

Verifying and Monitoring IoTs Network Behavior using MUD Profiles

IoT devices are increasingly being implicated in cyber-attacks, raising community concern about the risks they pose to critical infrastructure, corporations, and citizens. In order to reduce this risk, the IETF is pushing IoT vendors to develop formal specifications of the intended purpose of their IoT devices, in the form of a Manufacturer Usage Description (MUD), so that their network behavior in any operating environment can be locked down and verified rigorously. This paper aims to assist IoT manufacturers in developing and verifying MUD profiles, while also helping adopters of these devices to ensure they are compatible with their organizational policies and track devices network behavior based on their MUD profile. Our first contribution is to develop a tool that takes the traffic trace of an arbitrary IoT device as input and automatically generates the MUD profile for it. We contribute our tool as open source, apply it to 28 consumer IoT devices, and highlight insights and challenges encountered in the process. Our second contribution is to apply a formal semantic framework that not only validates a given MUD profile for consistency, but also checks its compatibility with a given organizational policy. We apply our framework to representative organizations and selected devices, to demonstrate how MUD can reduce the effort needed for IoT acceptance testing. Finally, we show how operators can dynamically identify IoT devices using known MUD profiles and monitor their behavioral changes on their network.

preprint2018arXiv

Clear as MUD: Generating, Validating and Applying IoT Behaviorial Profiles (Technical Report)

IoT devices are increasingly being implicated in cyber-attacks, driving community concern about the risks they pose to critical infrastructure, corporations, and citizens. In order to reduce this risk, the IETF is pushing IoT vendors to develop formal specifications of the intended purpose of their IoT devices, in the form of a Manufacturer Usage Description (MUD), so that their network behavior in any operating environment can be locked down and verified rigorously. This paper aims to assist IoT manufacturers in developing and verifying MUD profiles, while also helping adopters of these devices to ensure they are compatible with their organizational policies. Our first contribution is to develop a tool that takes the traffic trace of an arbitrary IoT device as input and automatically generates a MUD profile for it. We contribute our tool as open source, apply it to 28 consumer IoT devices, and highlight insights and challenges encountered in the process. Our second contribution is to apply a formal semantic framework that not only validates a given MUD profile for consistency, but also checks its compatibility with a given organizational policy. Finally, we apply our framework to representative organizations and selected devices, to demonstrate how MUD can reduce the effort needed for IoT acceptance testing.

preprint2016arXiv

The Mathematical Foundations for Mapping Policies to Network Devices (Technical Report)

A common requirement in policy specification languages is the ability to map policies to the underlying network devices. Doing so, in a provably correct way, is important in a security policy context, so administrators can be confident of the level of protection provided by the policies for their networks. Existing policy languages allow policy composition but lack formal semantics to allocate policy to network devices. Our research tackles this from first principles: we ask how network policies can be described at a high-level, independent of firewall-vendor and network minutiae. We identify the algebraic requirements of the policy mapping process and propose semantic foundations to formally verify if a policy is implemented by the correct set of policy-arbiters. We show the value of our proposed algebras in maintaining concise network-device configurations by applying them to real-world networks.

preprint2015arXiv

All networks look the same to me: Testing for homogeneity in networks

How can researchers test for heterogeneity in the local structure of a network? In this paper, we present a framework that utilizes random sampling to give subgraphs which are then used in a goodness of fit test to test for heterogeneity. We illustrate how to use the goodness of fit test for an analytically derived distribution as well as an empirical distribution. To demonstrate our framework, we consider the simple case of testing for edge probability heterogeneity. We examine the significance level, power and computation time for this case with appropriate examples. Finally we outline how to apply our framework to other heterogeneity problems.

preprint2015arXiv

Estimating the Parameters of the Waxman Random Graph

The Waxman random graph is a generalisation of the simple Erdős-Rényi or Gilbert random graph. It is useful for modelling physical networks where the increased cost of longer links means they are less likely to be built, and thus less numerous than shorter links. The model has been in continuous use for over two decades with many attempts to select parameters which match real networks. In most the parameters have been arbitrarily selected, but there are a few cases where they have been calculated using a formal estimator. However, the performance of the estimator was not evaluated in any of these cases. This paper presents both the first evaluation of formal estimators for the parameters of these graphs, and a new Maximum Likelihood Estimator with $O(n)$ computational time complexity that requires only link lengths as input.

preprint2015arXiv

Fast Generation of Spatially Embedded Random Networks

Spatially Embedded Random Networks such as the Waxman random graph have been used in a variety of settings for synthesizing networks. However, little thought has been put into fast generation of these networks. Existing techniques are $O(n^2)$ where $n$ is the number of nodes in the graph. In this paper we present an $O(n + e)$ algorithm, where $e$ is the number of edges.

preprint2015arXiv

Unravelling Graph-Exchange File Formats

A graph is used to represent data in which the relationships between the objects in the data are at least as important as the objects themselves. Over the last two decades nearly a hundred file formats have been proposed or used to provide portable access to such data. This paper seeks to review these formats, and provide some insight to both reduce the ongoing creation of unnecessary formats, and guide the development of new formats where needed.

preprint2013arXiv

Hidden Markov Model Identifiability via Tensors

The prevalence of hidden Markov models (HMMs) in various applications of statistical signal processing and communications is a testament to the power and flexibility of the model. In this paper, we link the identifiability problem with tensor decomposition, in particular, the Canonical Polyadic decomposition. Using recent results in deriving uniqueness conditions for tensor decomposition, we are able to provide a necessary and sufficient condition for the identification of the parameters of discrete time finite alphabet HMMs. This result resolves a long standing open problem regarding the derivation of a necessary and sufficient condition for uniquely identifying an HMM. We then further extend recent preliminary work on the identification of HMMs with multiple observers by deriving necessary and sufficient conditions for identifiability in this setting.

Matthew Roughan

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

The entropy rate of Linear Additive Markov Processes

#IStandWithPutin versus #IStandWithUkraine: The interaction of bots and humans in discussion of the Russia/Ukraine war

Boolean Expressions in Firewall Analysis

Likelihood-based inference for modelling packet transit from thinned flow summaries

Simulating Name-like Vectors for Testing Large-scale Entity Resolution

Verifying and Monitoring IoTs Network Behavior using MUD Profiles

Clear as MUD: Generating, Validating and Applying IoT Behaviorial Profiles (Technical Report)

The Mathematical Foundations for Mapping Policies to Network Devices (Technical Report)

All networks look the same to me: Testing for homogeneity in networks

Estimating the Parameters of the Waxman Random Graph

Fast Generation of Spatially Embedded Random Networks

Unravelling Graph-Exchange File Formats

Hidden Markov Model Identifiability via Tensors