Researcher profile

J. D. Knowles

J. D. Knowles contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 15 - Baseline
3works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

3 published item(s)

preprint2016arXiv

Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach

Improving survey specifications are causing an exponential rise in pulsar candidate numbers and data volumes. We study the candidate filters used to mitigate these problems during the past fifty years. We find that some existing methods such as applying constraints on the total number of candidates collected per observation, may have detrimental effects on the success of pulsar searches. Those methods immune to such effects are found to be ill-equipped to deal with the problems associated with increasing data volumes and candidate numbers, motivating the development of new approaches. We therefore present a new method designed for on-line operation. It selects promising candidates using a purpose-built tree-based machine learning classifier, the Gaussian Hellinger Very Fast Decision Tree (GH-VFDT), and a new set of features for describing candidates. The features have been chosen so as to i) maximise the separation between candidates arising from noise and those of probable astrophysical origin, and ii) be as survey-independent as possible. Using these features our new approach can process millions of candidates in seconds (~1 million every 15 seconds), with high levels of pulsar recall (90%+). This technique is therefore applicable to the large volumes of data expected to be produced by the Square Kilometre Array (SKA). Use of this approach has assisted in the discovery of 20 new pulsars in data obtained during the LOFAR Tied-Array All-Sky Survey (LOTAAS).

preprint2014arXiv

Hellinger Distance Trees for Imbalanced Streams

Classifiers trained on data sets possessing an imbalanced class distribution are known to exhibit poor generalisation performance. This is known as the imbalanced learning problem. The problem becomes particularly acute when we consider incremental classifiers operating on imbalanced data streams, especially when the learning objective is rare class identification. As accuracy may provide a misleading impression of performance on imbalanced data, existing stream classifiers based on accuracy can suffer poor minority class performance on imbalanced streams, with the result being low minority class recall rates. In this paper we address this deficiency by proposing the use of the Hellinger distance measure, as a very fast decision tree split criterion. We demonstrate that by using Hellinger a statistically significant improvement in recall rates on imbalanced data streams can be achieved, with an acceptable increase in the false positive rate.

preprint2013arXiv

A Study on Classification in Imbalanced and Partially-Labelled Data Streams

The domain of radio astronomy is currently facing significant computational challenges, foremost amongst which are those posed by the development of the world's largest radio telescope, the Square Kilometre Array (SKA). Preliminary specifications for this instrument suggest that the final design will incorporate between 2000 and 3000 individual 15 metre receiving dishes, which together can be expected to produce a data rate of many TB/s. Given such a high data rate, it becomes crucial to consider how this information will be processed and stored to maximise its scientific utility. In this paper, we consider one possible data processing scenario for the SKA, for the purposes of an all-sky pulsar survey. In particular we treat the selection of promising signals from the SKA processing pipeline as a data stream classification problem. We consider the feasibility of classifying signals that arrive via an unlabelled and heavily class imbalanced data stream, using currently available algorithms and frameworks. Our results indicate that existing stream learners exhibit unacceptably low recall on real astronomical data when used in standard configuration; however, good false positive performance and comparable accuracy to static learners, suggests they have definite potential as an on-line solution to this particular big data challenge.