Trust snapshot

Quick read

Trust 21 - Emerging
31works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

31 published item(s)

preprint2015arXiv

Big Data Scaling through Metric Mapping: Exploiting the Remarkable Simplicity of Very High Dimensional Spaces using Correspondence Analysis

We present new findings in regard to data analysis in very high dimensional spaces. We use dimensionalities up to around one million. A particular benefit of Correspondence Analysis is its suitability for carrying out an orthonormal mapping, or scaling, of power law distributed data. Power law distributed data are found in many domains. Correspondence factor analysis provides a latent semantic or principal axes mapping. Our experiments use data from digital chemistry and finance, and other statistically generated data.

preprint2015arXiv

Correspondence Factor Analysis of Big Data Sets: A Case Study of 30 Million Words; and Contrasting Analytics using Apache Solr and Correspondence Analysis in R

We consider a large number of text data sets. These are cooking recipes. Term distribution and other distributional properties of the data are investigated. Our aim is to look at various analytical approaches which allow for mining of information on both high and low detail scales. Metric space embedding is fundamental to our interest in the semantic properties of this data. We consider the projection of all data into analyses of aggregated versions of the data. We contrast that with projection of aggregated versions of the data into analyses of all the data. Analogously for the term set, we look at analysis of selected terms. We also look at inherent term associations such as between singular and plural. In addition to our use of Correspondence Analysis in R, for latent semantic space mapping, we also use Apache Solr. Setting up the Solr server and carrying out querying is described. A further novelty is that querying is supported in Solr based on the principal factor plane mapping of all the data. This uses a bounding box query, based on factor projections.

preprint2014arXiv

Visualizing and Quantifying Impact and Effect in Twitter Narrative using Geometric Data Analysis

We use geometric multivariate data analysis which has been termed a methodology for both the visualization and verbalization of data. The general objectives are data mining and knowledge discovery. In the first case study, we use the narrative surrounding very highly profiled tweets, and thus a Twitter event of significance and importance. In the second case study, we use eight carefully planned Twitter campaigns relating to environmental issues. The aim of these campaigns was to increase environmental awareness and behaviour. Unlike current marketing, political and other communication campaigns using Twitter, we develop an innovative approach to measuring bevavioural change. We show also how we can assess statistical significance of social media behaviour.

preprint2013arXiv

A History of Cluster Analysis Using the Classification Society's Bibliography Over Four Decades

The Classification Literature Automated Search Service, an annual bibliography based on citation of one or more of a set of around 80 book or journal publications, ran from 1972 to 2012. We analyze here the years 1994 to 2011. The Classification Society's Service, as it was termed, has been produced by the Classification Society. In earlier decades it was distributed as a diskette or CD with the Journal of Classification. Among our findings are the following: an enormous increase in scholarly production post approximately 2000; a very major increase in quantity, coupled with work in different disciplines, from approximately 2004; and a major shift also from cluster analysis in earlier times having mathematics and psychology as disciplines of the journals published in, and affiliations of authors, contrasted with, in more recent times, a "centre of gravity" in management and engineering.

preprint2013arXiv

Computational Properties of Fiction Writing and Collaborative Work

From the earliest days of computing, there have been tools to help shape narrative. Spell-checking, word counts, and readability analysis, give today's novelists tools that Dickens, Austen, and Shakespeare could only have dreamt of. However, such tools have focused on the word, or phrase levels. In the last decade, research focus has shifted to support for collaborative editing of documents. This work considers more sophisticated attempts to visualise the semantics, pace and rhythm within a narrative through data mining. We describe real life applications in two related domains.

preprint2013arXiv

Ultrametric Component Analysis with Application to Analysis of Text and of Emotion

We review the theory and practice of determining what parts of a data set are ultrametric. It is assumed that the data set, to begin with, is endowed with a metric, and we include discussion of how this can be brought about if a dissimilarity, only, holds. The basis for part of the metric-endowed data set being ultrametric is to consider triplets of the observables (vectors). We develop a novel consensus of hierarchical clusterings. We do this in order to have a framework (including visualization and supporting interpretation) for the parts of the data that are determined to be ultrametric. Furthermore a major objective is to determine locally ultrametric relationships as opposed to non-local ultrametric relationships. As part of this work, we also study a particular property of our ultrametricity coefficient, namely, it being a function of the difference of angles of the base angles of the isosceles triangle. This work is completed by a review of related work, on consensus hierarchies, and of a major new application, namely quantifying and interpreting the emotional content of narrative.

preprint2012arXiv

The Future of Search and Discovery in Big Data Analytics: Ultrametric Information Spaces

Consider observation data, comprised of n observation vectors with values on a set of attributes. This gives us n points in attribute space. Having data structured as a tree, implied by having our observations embedded in an ultrametric topology, offers great advantage for proximity searching. If we have preprocessed data through such an embedding, then an observation's nearest neighbor is found in constant computational time, i.e. O(1) time. A further powerful approach is discussed in this work: the inducing of a hierarchy, and hence a tree, in linear computational time, i.e. O(n) time for n observations. It is with such a basis for proximity search and best match that we can address the burgeoning problems of processing very large, and possibly also very high dimensional, data sets.

preprint2012arXiv

Ultrametric Model of Mind, I: Review

We mathematically model Ignacio Matte Blanco's principles of symmetric and asymmetric being through use of an ultrametric topology. We use for this the highly regarded 1975 book of this Chilean psychiatrist and pyschoanalyst (born 1908, died 1995). Such an ultrametric model corresponds to hierarchical clustering in the empirical data, e.g. text. We show how an ultrametric topology can be used as a mathematical model for the structure of the logic that reflects or expresses Matte Blanco's symmetric being, and hence of the reasoning and thought processes involved in conscious reasoning or in reasoning that is lacking, perhaps entirely, in consciousness or awareness of itself. In a companion paper we study how symmetric (in the sense of Matte Blanco's) reasoning can be demarcated in a context of symmetric and asymmetric reasoning provided by narrative text.

preprint2012arXiv

Ultrametric Model of Mind, II: Application to Text Content Analysis

In a companion paper, Murtagh (2012), we discussed how Matte Blanco's work linked the unrepressed unconscious (in the human) to symmetric logic and thought processes. We showed how ultrametric topology provides a most useful representational and computational framework for this. Now we look at the extent to which we can find ultrametricity in text. We use coherent and meaningful collections of nearly 1000 texts to show how we can measure inherent ultrametricity. On the basis of our findings we hypothesize that inherent ultrametricty is a basis for further exploring unconscious thought processes.

preprint2011arXiv

Current Trends in Evolving Specialization in UK Universities

There are very significant changes taking place in the university sector and in related higher education institutes in many parts of the world. In this work we look at financial data from 2010 and 2011 from the UK higher education sector. Situating ourselves to begin with in the context of teaching versus research in universities, we look at the data in order to explore the new divergence between the broad agendas of teaching and research in universities. The innovation agenda has become at least equal to the research and teaching objectives of universities. From the financial data, published in the Times Higher Education weekly newspaper, we explore the interesting contrast, and very opposite orientations, in specialization of universities in the UK. We find a polarity in specialism that goes considerably beyond the usual one of research-led elite versus more teaching-oriented new universities. Instead we point to the role of medical/bioscience research income in the former, and economic and business sectoral niche player roles in the latter.

preprint2011arXiv

Fast, Linear Time Hierarchical Clustering using the Baire Metric

The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. In this work we evaluate empirically this new approach to hierarchical clustering. We compare hierarchical clustering based on the Baire metric with (i) agglomerative hierarchical clustering, in terms of algorithm properties; (ii) generalized ultrametrics, in terms of definition; and (iii) fast clustering through k-means partititioning, in terms of quality of results. For the latter, we carry out an in depth astronomical study. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more costly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we use clusterwise regression for this.

preprint2011arXiv

Fast, Linear Time, m-Adic Hierarchical Clustering for Search and Retrieval using the Baire Metric, with linkages to Generalized Ultrametrics, Hashing, Formal Concept Analysis, and Precision of Data Measurement

We describe many vantage points on the Baire metric and its use in clustering data, or its use in preprocessing and structuring data in order to support search and retrieval operations. In some cases, we proceed directly to clusters and do not directly determine the distances. We show how a hierarchical clustering can be read directly from one pass through the data. We offer insights also on practical implications of precision of data measurement. As a mechanism for treating multidimensional data, including very high dimensional data, we use random projections.

preprint2011arXiv

Methods of Hierarchical Clustering

We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.

preprint2011arXiv

Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm

The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.

preprint2010arXiv

Hierarchical Clustering for Finding Symmetries and Other Patterns in Massive, High Dimensional Datasets

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. "Structure" can be understood as symmetry and a range of symmetries are expressed by hierarchy. Such symmetries directly point to invariants, that pinpoint intrinsic properties of the data and of the background empirical domain of interest. We review many aspects of hierarchy here, including ultrametric topology, generalized ultrametric, linkages with lattices and other discrete algebraic structures and with p-adic number representations. By focusing on symmetries in data we have a powerful means of structuring and analyzing massive, high dimensional data stores. We illustrate the powerfulness of hierarchical clustering in case studies in chemistry and finance, and we provide pointers to other published case studies.

preprint2010arXiv

New Methods of Analysis of Narrative and Semantics in Support of Interactivity

Our work has focused on support for film or television scriptwriting. Since this involves potentially varied story-lines, we note the implicit or latent support for interactivity. Furthermore the film, television, games, publishing and other sectors are converging, so that cross-over and re-use of one form of product in another of these sectors is ever more common. Technically our work has been largely based on mathematical algorithms for data clustering and display. Operationally, we also discuss how our algorithms can support collective, distributed problem-solving.

preprint2010arXiv

Segmentation and Nodal Points in Narrative: Study of Multiple Variations of a Ballad

The Lady Maisry ballads afford us a framework within which to segment a storyline into its major components. Segments and as a consequence nodal points are discussed for nine different variants of the Lady Maisry story of a (young) woman being burnt to death by her family, on account of her becoming pregnant by a foreign personage. We motivate the importance of nodal points in textual and literary analysis. We show too how the openings of the nine variants can be analyzed comparatively, and also the conclusions of the ballads.

preprint2010arXiv

Ultrametric and Generalized Ultrametric in Computational Logic and in Data Analysis

Following a review of metric, ultrametric and generalized ultrametric, we review their application in data analysis. We show how they allow us to explore both geometry and topology of information, starting with measured data. Some themes are then developed based on the use of metric, ultrametric and generalized ultrametric in logic. In particular we study approximation chains in an ultrametric or generalized ultrametric context. Our aim in this work is to extend the scope of data analysis by facilitating reasoning based on the data analysis; and to show how quantitative and qualitative data analysis can be incorporated into logic programming.

preprint2009arXiv

Open Access, Intellectual Property, and How Biotechnology Becomes a New Software Science

Innovation is slowing greatly in the pharmaceutical sector. It is considered here how part of the problem is due to overly limiting intellectual property relations in the sector. On the other hand, computing and software in particular are characterized by great richness of intellectual property frameworks. Could the intellectual property ecosystem of computing come to the aid of the biosciences and life sciences? We look at how the answer might well be yes, by looking at (i) the extent to which a drug mirrors a software program, and (ii) what is to be gleaned from trends in research publishing in the life and biosciences.

preprint2009arXiv

Scale-Based Gaussian Coverings: Combining Intra and Inter Mixture Models in Image Segmentation

By a "covering" we mean a Gaussian mixture model fit to observed data. Approximations of the Bayes factor can be availed of to judge model fit to the data within a given Gaussian mixture model. Between families of Gaussian mixture models, we propose the Rényi quadratic entropy as an excellent and tractable model comparison framework. We exemplify this using the segmentation of an MRI image volume, based (1) on a direct Gaussian mixture model applied to the marginal distribution function, and (2) Gaussian model fit through k-means applied to the 4D multivalued image volume furnished by the wavelet transform. Visual preference for one model over another is not immediate. The Rényi quadratic entropy allows us to show clearly that one of these modelings is superior to the other.

preprint2009arXiv

Symmetry in Data Mining and Analysis: A Unifying View based on Hierarchy

Data analysis and data mining are concerned with unsupervised pattern finding and structure determination in data sets. The data sets themselves are explicitly linked as a form of representation to an observational or otherwise empirical domain of interest. "Structure" has long been understood as symmetry which can take many forms with respect to any transformation, including point, translational, rotational, and many others. Beginning with the role of number theory in expressing data, we show how we can naturally proceed to hierarchical structures. We show how this both encapsulates traditional paradigms in data analysis, and also opens up new perspectives towards issues that are on the order of the day, including data mining of massive, high dimensional, heterogeneous data sets. Linkages with other fields are also discussed including computational logic and symbolic dynamics. The structures in data surveyed here are based on hierarchy, represented as p-adic numbers or an ultrametric topology.

preprint2009arXiv

Ultrametric Wavelet Regression of Multivariate Time Series: Application to Colombian Conflict Analysis

We first pursue the study of how hierarchy provides a well-adapted tool for the analysis of change. Then, using a time sequence-constrained hierarchical clustering, we develop the practical aspects of a new approach to wavelet regression. This provides a new way to link hierarchical relationships in a multivariate time series data set with external signals. Violence data from the Colombian conflict in the years 1990 to 2004 is used throughout. We conclude with some proposals for further study on the relationship between social violence and market forces, viz. between the Colombian conflict and the US narcotics market.

preprint2008arXiv

From Data to the p-Adic or Ultrametric Model

We model anomaly and change in data by embedding the data in an ultrametric space. Taking our initial data as cross-tabulation counts (or other input data formats), Correspondence Analysis allows us to endow the information space with a Euclidean metric. We then model anomaly or change by an induced ultrametric. The induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We apply this work to the flow of narrative expressed in the film script of the Casablanca movie; and to the evolution between 1988 and 2004 of the Colombian social conflict and violence.

preprint2008arXiv

The Correspondence Analysis Platform for Uncovering Deep Structure in Data and Information

We study two aspects of information semantics: (i) the collection of all relationships, (ii) tracking and spotting anomaly and change. The first is implemented by endowing all relevant information spaces with a Euclidean metric in a common projected space. The second is modelled by an induced ultrametric. A very general way to achieve a Euclidean embedding of different information spaces based on cross-tabulation counts (and from other input data formats) is provided by Correspondence Analysis. From there, the induced ultrametric that we are particularly interested in takes a sequential - e.g. temporal - ordering of the data into account. We employ such a perspective to look at narrative, "the flow of thought and the flow of language" (Chafe). In application to policy decision making, we show how we can focus analysis in a small number of dimensions.

preprint2008arXiv

The Remarkable Simplicity of Very High Dimensional Data: Application of Model-Based Clustering

An ultrametric topology formalizes the notion of hierarchical structure. An ultrametric embedding, referred to here as ultrametricity, is implied by a hierarchical embedding. Such hierarchical structure can be global in the data set, or local. By quantifying extent or degree of ultrametricity in a data set, we show that ultrametricity becomes pervasive as dimensionality and/or spatial sparsity increases. This leads us to assert that very high dimensional data are of simple structure. We exemplify this finding through a range of simulated data cases. We discuss also application to very high frequency time series segmentation and modeling.

preprint2008arXiv

The Structure of Narrative: the Case of Film Scripts

We analyze the style and structure of story narrative using the case of film scripts. The practical importance of this is noted, especially the need to have support tools for television movie writing. We use the Casablanca film script, and scripts from six episodes of CSI (Crime Scene Investigation). For analysis of style and structure, we quantify various central perspectives discussed in McKee's book, "Story: Substance, Structure, Style, and the Principles of Screenwriting". Film scripts offer a useful point of departure for exploration of the analysis of more general narratives. Our methodology, using Correspondence Analysis, and hierarchical clustering, is innovative in a range of areas that we discuss. In particular this work is groundbreaking in taking the qualitative analysis of McKee and grounding this analysis in a quantitative and algorithmic framework.

preprint2008arXiv

Wavelet and Curvelet Moments for Image Classification: Application to Aggregate Mixture Grading

We show the potential for classifying images of mixtures of aggregate, based themselves on varying, albeit well-defined, sizes and shapes, in order to provide a far more effective approach compared to the classification of individual sizes and shapes. While a dominant (additive, stationary) Gaussian noise component in image data will ensure that wavelet coefficients are of Gaussian distribution, long tailed distributions (symptomatic, for example, of extreme values) may well hold in practice for wavelet coefficients. Energy (2nd order moment) has often been used for image characterization for image content-based retrieval, and higher order moments may be important also, not least for capturing long tailed distributional behavior. In this work, we assess 2nd, 3rd and 4th order moments of multiresolution transform -- wavelet and curvelet transform -- coefficients as features. As analysis methodology, taking account of image types, multiresolution transforms, and moments of coefficients in the scales or bands, we use correspondence analysis as well as k-nearest neighbors supervised classification.

preprint2007arXiv

On Ultrametric Algorithmic Information

How best to quantify the information of an object, whether natural or artifact, is a problem of wide interest. A related problem is the computability of an object. We present practical examples of a new way to address this problem. By giving an appropriate representation to our objects, based on a hierarchical coding of information, we exemplify how it is remarkably easy to compute complex objects. Our algorithmic complexity is related to the length of the class of objects, rather than to the length of the object.

preprint2007arXiv

The Haar Wavelet Transform of a Dendrogram

We describe a new wavelet transform, for use on hierarchies or binary rooted trees. The theoretical framework of this approach to data analysis is described. Case studies are used to further exemplify this approach. A first set of application studies deals with data array smoothing, or filtering. A second set of application studies relates to hierarchical tree condensation. Finally, a third study explores the wavelet decomposition, and the reproducibility of data sets such as text, including a new perspective on the generation or computability of such data objects.

preprint2007arXiv

Ultrametric embedding: application to data fingerprinting and to fast data clustering

We begin with pervasive ultrametricity due to high dimensionality and/or spatial sparsity. How extent or degree of ultrametricity can be quantified leads us to the discussion of varied practical cases when ultrametricity can be partially or locally present in data. We show how the ultrametricity can be assessed in text or document collections, and in time series signals. An aspect of importance here is that to draw benefit from this perspective the data may need to be recoded. Such data recoding can also be powerful in proximity searching, as we will show, where the data is embedded globally and not locally in an ultrametric space.