Source author record

Nicholas M. Ball

Nicholas M. Ball appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

astro-ph.IM astro-ph.CO astro-ph Machine Learning

Catalog footprint

What is connected

7works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2013arXiv

CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy

At the Canadian Astronomy Data Centre, we have combined our cloud computing system, CANFAR, with the world's most advanced machine learning software, Skytree, to create the world's first cloud computing system for data mining in astronomy. CANFAR provides a generic environment for the storage and processing of large datasets, removing the requirement to set up and maintain a computing system when implementing an extensive undertaking such as a survey pipeline. 500 processor cores and several hundred terabytes of persistent storage are currently available to users. The storage is implemented via the International Virtual Observatory Alliance's VOSpace protocol, and is accessible both interactively, and to all processing jobs. The user interacts with CANFAR by utilizing virtual machines, which appear to them as equivalent to a desktop. Each machine is replicated as desired to perform large-scale parallel processing. Such an arrangement enables the user to immediately install and run the same astronomy code that they already utilize, in the same way as on a desktop. In addition, unlike many cloud systems, batch job scheduling is handled for the user on multiple virtual machines by the Condor job queueing system. Skytree is installed and run just as any other software on the system, and thus acts as a library of command line data mining functions that can be integrated into one's wider analysis. Thus we have created a generic environment for large-scale analysis by data mining, in the same way that CANFAR itself has done for storage and processing. Because Skytree scales to large data in linear runtime, this allows the full sophistication of the huge fields of data mining and machine learning to be applied to the hundreds of millions of objects that make up current large datasets. We demonstrate the utility of the CANFAR+Skytree system by showing science results obtained. [Abridged]

preprint2013arXiv

Focus Demo: CANFAR+Skytree: A Cloud Computing and Data Mining System for Astronomy

This is a companion Focus Demonstration article to the CANFAR+Skytree poster (Ball 2012), demonstrating the usage of the Skytree machine learning software on the Canadian Advanced Network for Astronomical Research (CANFAR) cloud computing system. CANFAR+Skytree is the world's first cloud computing system for data mining in astronomy.

preprint2011arXiv

Discussion on "Techniques for Massive-Data Machine Learning in Astronomy" by A. Gray

Astronomy is increasingly encountering two fundamental truths: (1) The field is faced with the task of extracting useful information from extremely large, complex, and high dimensional datasets; (2) The techniques of astroinformatics and astrostatistics are the only way to make this tractable, and bring the required level of sophistication to the analysis. Thus, an approach which provides these tools in a way that scales to these datasets is not just desirable, it is vital. The expertise required spans not just astronomy, but also computer science, statistics, and informatics. As a computer scientist and expert in machine learning, Alex's contribution of expertise and a large number of fast algorithms designed to scale to large datasets, is extremely welcome. We focus in this discussion on the questions raised by the practical application of these algorithms to real astronomical datasets. That is, what is needed to maximally leverage their potential to improve the science return? This is not a trivial task. While computing and statistical expertise are required, so is astronomical expertise. Precedent has shown that, to-date, the collaborations most productive in producing astronomical science results (e.g, the Sloan Digital Sky Survey), have either involved astronomers expert in computer science and/or statistics, or astronomers involved in close, long-term collaborations with experts in those fields. This does not mean that the astronomers are giving the most important input, but simply that their input is crucial in guiding the effort in the most fruitful directions, and coping with the issues raised by real data. Thus, the tools must be useable and understandable by those whose primary expertise is not computing or statistics, even though they may have quite extensive knowledge of those fields.

preprint2011arXiv

Utilizing Astroinformatics to Maximize the Science Return of the Next Generation Virgo Cluster Survey

The Next Generation Virgo Cluster Survey is a 104 square degree survey of the Virgo Cluster, carried out using the MegaPrime camera of the Canada-France-Hawaii telescope, from semesters 2009A-2012A. The survey will provide coverage of this nearby dense environment in the universe to unprecedented depth, providing profound insights into galaxy formation and evolution, including definitive measurements of the properties of galaxies in a dense environment in the local universe, such as the luminosity function. The limiting magnitude of the survey is g_AB = 25.7 (10 sigma point source), and the 2 sigma surface brightness limit is g_AB ~ 29 mag arcsec^-2. The data volume of the survey (approximately 50 terabytes of images), while large by contemporary astronomical standards, is not intractable. This renders the survey amenable to the methods of astroinformatics. The enormous dynamic range of objects, from the giant elliptical galaxy M87 at M(B) = -21.6, to the faintest dwarf ellipticals at M(B) ~ -6, combined with photometry in 5 broad bands (u* g' r' i' z'), and unprecedented depth revealing many previously unseen structures, creates new challenges in object detection and classification. We present results from ongoing work on the survey, including photometric redshifts, Virgo cluster membership, and the implementation of fast data mining algorithms on the infrastructure of the Canadian Astronomy Data Centre, as part of the Canadian Advanced Network for Astronomical Research (CANFAR).

preprint2010arXiv

Data Mining and Machine Learning in Astronomy

We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.

preprint2009arXiv

Incorporating Photometric Redshift Probability Density Information into Real-Space Clustering Measurements

The use of photometric redshifts in cosmology is increasing. Often, however these photo-zs are treated like spectroscopic observations, in that the peak of the photometric redshift, rather than the full probability density function (PDF), is used. This overlooks useful information inherent in the full PDF. We introduce a new real-space estimator for one of the most used cosmological statistics, the 2-point correlation function, that weights by the PDF of individual photometric objects in a manner that is optimal when Poisson statistics dominate. As our estimator does not bin based on the PDF peak it substantially enhances the clustering signal by usefully incorporating information from all photometric objects that overlap the redshift bin of interest. As a real-world application, we measure QSO clustering in the Sloan Digital Sky Survey (SDSS). We find that our simplest binned estimator improves the clustering signal by a factor equivalent to increasing the survey size by a factor of 2-3. We also introduce a new implementation that fully weights between pairs of objects in constructing the cross-correlation and find that this pair-weighted estimator improves clustering signal in a manner equivalent to increasing the survey size by a factor of 4-5. Our technique uses spectroscopic data to anchor the distance scale and it will be particularly useful where spectroscopic data (e.g, from BOSS) overlaps deeper photometry (e.g.,from Pan-STARRS, DES or the LSST). We additionally provide simple, informative expressions to determine when our estimator will be competitive with the autocorrelation of spectroscopic objects. Although we use QSOs as an example population, our estimator can and should be applied to any clustering estimate that uses photometric objects.

preprint2007arXiv

Robust Machine Learning Applied to Astronomical Datasets II: Quantifying Photometric Redshifts for Quasars Using Instance-Based Learning

We apply instance-based machine learning in the form of a k-nearest neighbor algorithm to the task of estimating photometric redshifts for 55,746 objects spectroscopically classified as quasars in the Fifth Data Release of the Sloan Digital Sky Survey. We compare the results obtained to those from an empirical color-redshift relation (CZR). In contrast to previously published results using CZRs, we find that the instance-based photometric redshifts are assigned with no regions of catastrophic failure. Remaining outliers are simply scattered about the ideal relation, in a similar manner to the pattern seen in the optical for normal galaxies at redshifts z < ~1. The instance-based algorithm is trained on a representative sample of the data and pseudo-blind-tested on the remaining unseen data. The variance between the photometric and spectroscopic redshifts is sigma^2 = 0.123 +/- 0.002 (compared to sigma^2 = 0.265 +/- 0.006 for the CZR), and 54.9 +/- 0.7%, 73.3 +/- 0.6%, and 80.7 +/- 0.3% of the objects are within delta z < 0.1, 0.2, and 0.3 respectively. We also match our sample to the Second Data Release of the Galaxy Evolution Explorer legacy data and the resulting 7,642 objects show a further improvement, giving a variance of sigma^2 = 0.054 +/- 0.005, and 70.8 +/- 1.2%, 85.8 +/- 1.0%, and 90.8 +/- 0.7% of objects within delta z < 0.1, 0.2, and 0.3. We show that the improvement is indeed due to the extra information provided by GALEX, by training on the same dataset using purely SDSS photometry, which has a variance of sigma^2 = 0.090 +/- 0.007. Each set of results represents a realistic standard for application to further datasets for which the spectra are representative.