Source author record

Michael Kirby

Michael Kirby appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Computational Geometry hep-ex Machine Learning math.OC physics.comp-ph Distributed, Parallel, and Cluster Computing math.AT physics.data-an Computational Engineering, Finance, and Science math.DS math.GT math.MG math.PR physics.ins-det Quantitative Methods

Catalog footprint

What is connected

15works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

A Granular Grassmannian Clustering Framework via the Schubert Variety of Best Fit

In many classification and clustering tasks, it is useful to compute a geometric representative for a dataset or a cluster, such as a mean or median. When datasets are represented by subspaces, these representatives become points on the Grassmann or flag manifold, with distances induced by their geometry, often via principal angles. We introduce a subspace clustering algorithm that replaces subspace means with a trainable prototype defined as a Schubert Variety of Best Fit (SVBF) - a subspace that comes as close as possible to intersecting each cluster member in at least one fixed direction. Integrated in the Linde-Buzo-Grey (LBG) pipeline, this SVBF-LBG scheme yields improved cluster purity on synthetic, image, spectral, and video action data, while retaining the mathematical structure required for downstream analysis.

preprint2022arXiv

Portability: A Necessary Approach for Future Scientific Software

Today's world of scientific software for High Energy Physics (HEP) is powered by x86 code, while the future will be much more reliant on accelerators like GPUs and FPGAs. The portable parallelization strategies (PPS) project of the High Energy Physics Center for Computational Excellence (HEP/CCE) is investigating solutions for portability techniques that will allow the coding of an algorithm once, and the ability to execute it on a variety of hardware products from many vendors, especially including accelerators. We think without these solutions, the scientific success of our experiments and endeavors is in danger, as software development could be expert driven and costly to be able to run on available hardware infrastructure. We think the best solution for the community would be an extension to the C++ standard with a very low entry bar for users, supporting all hardware forms and vendors. We are very far from that ideal though. We argue that in the future, as a community, we need to request and work on portability solutions and strive to reach this ideal.

preprint2022arXiv

Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection

Autoencoders have been widely used as a nonlinear tool for data dimensionality reduction. While autoencoders don't utilize the label information, Centroid-Encoders (CE)\cite{ghosh2022supervised} use the class label in their learning process. In this study, we propose a sparse optimization using the Centroid-Encoder architecture to determine a minimal set of features that discriminate between two or more classes. The resulting algorithm, Sparse Centroid-Encoder (SCE), extracts discriminatory features in groups using a sparsity inducing $\ell_1$-norm while mapping a point to its class centroid. One key attribute of SCE is that it can extract informative features from a multi-modal data set, i.e., data sets whose classes appear to have multiple clusters. The algorithm is applied to a wide variety of real world data sets, including single-cell data, high dimensional biological data, image data, speech data, and accelerometer sensor data. We compared our method to various state-of-the-art feature selection techniques, including supervised Concrete Autoencoders (SCAE), Feature Selection Network (FsNet), deep feature selection (DFS), Stochastic Gate (STG), and LassoNet. We empirically showed that SCE features often produced better classification accuracy than other methods on sequester test set.

preprint2022arXiv

The Flag Median and FlagIRLS

Finding prototypes (e.g., mean and median) for a dataset is central to a number of common machine learning algorithms. Subspaces have been shown to provide useful, robust representations for datasets of images, videos and more. Since subspaces correspond to points on a Grassmann manifold, one is led to consider the idea of a subspace prototype for a Grassmann-valued dataset. While a number of different subspace prototypes have been described, the calculation of some of these prototypes has proven to be computationally expensive while other prototypes are affected by outliers and produce highly imperfect clustering on noisy data. This work proposes a new subspace prototype, the flag median, and introduces the FlagIRLS algorithm for its calculation. We provide evidence that the flag median is robust to outliers and can be used effectively in algorithms like Linde-Buzo-Grey (LBG) to produce improved clusterings on Grassmannians. Numerical experiments include a synthetic dataset, the MNIST handwritten digits dataset, the Mind's Eye video dataset and the UCF YouTube action dataset. The flag median is compared the other leading algorithms for computing prototypes on the Grassmannian, namely, the $\ell_2$-median and to the flag mean. We find that using FlagIRLS to compute the flag median converges in $4$ iterations on a synthetic dataset. We also see that Grassmannian LBG with a codebook size of $20$ and using the flag median produces at least a $10\%$ improvement in cluster purity over Grassmannian LBG using the flag mean or $\ell_2$-median on the Mind's Eye dataset.

preprint2020arXiv

Supervised Dimensionality Reduction and Visualization using Centroid-encoder

Visualizing high-dimensional data is an essential task in Data Science and Machine Learning. The Centroid-Encoder (CE) method is similar to the autoencoder but incorporates label information to keep objects of a class close together in the reduced visualization space. CE exploits nonlinearity and labels to encode high variance in low dimensions while capturing the global structure of the data. We present a detailed analysis of the method using a wide variety of data sets and compare it with other supervised dimension reduction techniques, including NCA, nonlinear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. We empirically show that centroid-encoder outperforms most of these techniques. We also show that when the data variance is spread across multiple modalities, centroid-encoder extracts a significant amount of information from the data in low dimensional space. This key feature establishes its value to use it as a tool for data visualization.

preprint2020arXiv

The flag manifold as a tool for analyzing and comparing data sets

The shape and orientation of data clouds reflect variability in observations that can confound pattern recognition systems. Subspace methods, utilizing Grassmann manifolds, have been a great aid in dealing with such variability. However, this usefulness begins to falter when the data cloud contains sufficiently many outliers corresponding to stray elements from another class or when the number of data points is larger than the number of features. We illustrate how nested subspace methods, utilizing flag manifolds, can help to deal with such additional confounding factors. Flag manifolds, which are parameter spaces for nested subspaces, are a natural geometric generalization of Grassmann manifolds. To make practical comparisons on a flag manifold, algorithms are proposed for determining the distances between points $[A], [B]$ on a flag manifold, where $A$ and $B$ are arbitrary orthogonal matrix representatives for $[A]$ and $[B]$, and for determining the initial direction of these minimal length geodesics. The approach is illustrated in the context of (hyper) spectral imagery showing the impact of ambient dimension, sample dimension, and flag structure.

preprint2019arXiv

A fractal dimension for measures via persistent homology

We use persistent homology in order to define a family of fractal dimensions, denoted $\mathrm{dim}_{\mathrm{PH}}^i(μ)$ for each homological dimension $i\ge 0$, assigned to a probability measure $μ$ on a metric space. The case of $0$-dimensional homology ($i=0$) relates to work by Michael J Steele (1988) studying the total length of a minimal spanning tree on a random sampling of points. Indeed, if $μ$ is supported on a compact subset of Euclidean space $\mathbb{R}^m$ for $m\ge2$, then Steele's work implies that $\mathrm{dim}_{\mathrm{PH}}^0(μ)=m$ if the absolutely continuous part of $μ$ has positive mass, and otherwise $\mathrm{dim}_{\mathrm{PH}}^0(μ)<m$. Experiments suggest that similar results may be true for higher-dimensional homology $0<i<m$, though this is an open question. Our fractal dimension is defined by considering a limit, as the number of points $n$ goes to infinity, of the total sum of the $i$-dimensional persistent homology interval lengths for $n$ random points selected from $μ$ in an i.i.d. fashion. To some measures $μ,$ we are able to assign a finer invariant, a curve measuring the limiting distribution of persistent homology interval lengths as the number of points goes to infinity. We prove this limiting curve exists in the case of $0$-dimensional homology when $μ$ is the uniform distribution over the unit interval, and conjecture that it exists when $μ$ is the rescaled probability measure for a compact set in Euclidean space with positive Lebesgue measure.

preprint2018arXiv

A Roadmap for HEP Software and Computing R&D for the 2020s

Particle physics has an ambitious and broad experimental programme for the coming decades. This programme requires large investments in detector hardware, either to build new facilities and experiments, or to upgrade existing ones. Similarly, it requires commensurate investment in the R&D of software to acquire, manage, process, and analyse the shear amounts of data to be recorded. In planning for the HL-LHC in particular, it is critical that all of the collaborating stakeholders agree on the software goals and priorities, and that the efforts complement each other. In this spirit, this white paper describes the R&D activities required to prepare for this software upgrade.

preprint2016arXiv

Persistent Homology on Grassmann Manifolds for Analysis of Hyperspectral Movies

The existence of characteristic structure, or shape, in complex data sets has been recognized as increasingly important for mathematical data analysis. This realization has motivated the development of new tools such as persistent homology for exploring topological invariants, or features, in large data sets. In this paper we apply persistent homology to the characterization of gas plumes in time dependent sequences of hyperspectral cubes, i.e. the analysis of 4-way arrays. We investigate hyperspectral movies of Long-Wavelength Infrared data monitoring an experimental release of chemical simulant into the air. Our approach models regions of interest within the hyperspectral data cubes as points on the real Grassmann manifold $G(k, n)$ (whose points parameterize the $k$-dimensional subspaces of $\mathbb{R}^n$), contrasting our approach with the more standard framework in Euclidean space. An advantage of this approach is that it allows a sequence of time slices in a hyperspectral movie to be collapsed to a sequence of points in such a way that some of the key structure within and between the slices is encoded by the points on the Grassmann manifold. This motivates the search for topological features, associated with the evolution of the frames of a hyperspectral movie, within the corresponding points on the Grassmann manifold. The proposed mathematical model affords the processing of large data sets while retaining valuable discriminatory information. In this paper, we discuss how embedding our data in the Grassmann manifold, together with topological data analysis, captures dynamical events that occur as the chemical plume is released and evolves.

preprint2016arXiv

Stratifying High Dimensional Data Based on Proximity to the Convex Hull Boundary

The convex hull of a set of points, $C$, serves to expose extremal properties of $C$ and can help identify elements in $C$ of high interest. For many problems, particularly in the presence of noise, the true vertex set (and facets) may be difficult to determine. One solution is to expand the list of high interest candidates to points lying near the boundary of the convex hull. We propose a quadratic program for the purpose of stratifying points in a data cloud based on proximity to the boundary of the convex hull. For each data point, a quadratic program is solved to determine an associated weight vector. We show that the weight vector encodes geometric information concerning the point's relationship to the boundary of the convex hull. The computation of the weight vectors can be carried out in parallel, and for a fixed number of points and fixed neighborhood size, the overall computational complexity of the algorithm grows linearly with dimension. As a consequence, meaningful computations can be completed on reasonably large, high dimensional data sets.

preprint2015arXiv

Classification of Hyperspectral Imagery on Embedded Grassmannians

We propose an approach for capturing the signal variability in hyperspectral imagery using the framework of the Grassmann manifold. Labeled points from each class are sampled and used to form abstract points on the Grassmannian. The resulting points on the Grassmannian have representations as orthonormal matrices and as such do not reside in Euclidean space in the usual sense. There are a variety of metrics which allow us to determine a distance matrices that can be used to realize the Grassmannian as an embedding in Euclidean space. We illustrate that we can achieve an approximately isometric embedding of the Grassmann manifold using the chordal metric while this is not the case with geodesic distances. However, non-isometric embeddings generated by using a pseudometric on the Grassmannian lead to the best classification results. We observe that as the dimension of the Grassmannian grows, the accuracy of the classification grows to 100% on two illustrative examples. We also observe a decrease in classification rates if the dimension of the points on the Grassmannian is too large for the dimension of the Euclidean space. We use sparse support vector machines to perform additional model reduction. The resulting classifier selects a subset of dimensions of the embedding without loss in classification performance.

preprint2015arXiv

HEP-FCE Working Group on Libraries and Tools

This is a report from the Libraries and Tools Working Group of the High Energy Physics Forum for Computational Excellence. It presents the vision of the working group for how the HEP software community may organize and be supported in order to more efficiently share and develop common software libraries and tools across the world's diverse set of HEP experiments. It gives prioritized recommendations for achieving this goal and provides a survey of a select number of areas in the current HEP software library and tools landscape. The survey identifies aspects which support this goal and areas with opportunities for improvements. The survey covers event processing software frameworks, software development, data management, workflow and workload management, geometry information management and conditions databases.

preprint2015arXiv

High Energy Physics Forum for Computational Excellence: Working Group Reports (I. Applications Software II. Software Libraries and Tools III. Systems)

Computing plays an essential role in all aspects of high energy physics. As computational technology evolves rapidly in new directions, and data throughput and volume continue to follow a steep trend-line, it is important for the HEP community to develop an effective response to a series of expected challenges. In order to help shape the desired response, the HEP Forum for Computational Excellence (HEP-FCE) initiated a roadmap planning activity with two key overlapping drivers -- 1) software effectiveness, and 2) infrastructure and expertise advancement. The HEP-FCE formed three working groups, 1) Applications Software, 2) Software Libraries and Tools, and 3) Systems (including systems software), to provide an overview of the current status of HEP computing and to present findings and opportunities for the desired HEP computational roadmap. The final versions of the reports are combined in this document, and are presented along with introductory material.

preprint2015arXiv

Outcome prediction in mathematical models of immune response to infection

Clinicians need to predict patient outcomes with high accuracy as early as possible after disease inception. In this manuscript, we show that patient-to-patient variability sets a fundamental limit on outcome prediction accuracy for a general class of mathematical models for the immune response to infection. However, accuracy can be increased at the expense of delayed prognosis. We investigate several systems of ordinary differential equations (ODEs) that model the host immune response to a pathogen load. Advantages of systems of ODEs for investigating the immune response to infection include the ability to collect data on large numbers of `virtual patients', each with a given set of model parameters, and obtain many time points during the course of the infection. We implement patient-to-patient variability $v$ in the ODE models by randomly selecting the model parameters from Gaussian distributions with variance $v$ that are centered on physiological values. We use logistic regression with one-versus-all classification to predict the discrete steady-state outcomes of the system. We find that the prediction algorithm achieves near $100\%$ accuracy for $v=0$, and the accuracy decreases with increasing $v$ for all ODE models studied. The fact that multiple steady-state outcomes can be obtained for a given initial condition, i.e. the basins of attraction overlap in the space of initial conditions, limits the prediction accuracy for $v>0$. Increasing the elapsed time of the variables used to train and test the classifier, increases the prediction accuracy, while adding explicit external noise to the ODE models decreases the prediction accuracy. Our results quantify the competition between early prognosis and high prediction accuracy that is frequently encountered by clinicians.

preprint2012arXiv

Locally Linear Embedding Clustering Algorithm for Natural Imagery

The ability to characterize the color content of natural imagery is an important application of image processing. The pixel by pixel coloring of images may be viewed naturally as points in color space, and the inherent structure and distribution of these points affords a quantization, through clustering, of the color information in the image. In this paper, we present a novel topologically driven clustering algorithm that permits segmentation of the color features in a digital image. The algorithm blends Locally Linear Embedding (LLE) and vector quantization by mapping color information to a lower dimensional space, identifying distinct color regions, and classifying pixels together based on both a proximity measure and color content. It is observed that these techniques permit a significant reduction in color resolution while maintaining the visually important features of images.

Michael Kirby

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

A Granular Grassmannian Clustering Framework via the Schubert Variety of Best Fit

Portability: A Necessary Approach for Future Scientific Software

Sparse Centroid-Encoder: A Nonlinear Model for Feature Selection

The Flag Median and FlagIRLS

Supervised Dimensionality Reduction and Visualization using Centroid-encoder

The flag manifold as a tool for analyzing and comparing data sets

A fractal dimension for measures via persistent homology

A Roadmap for HEP Software and Computing R&D for the 2020s

Persistent Homology on Grassmann Manifolds for Analysis of Hyperspectral Movies

Stratifying High Dimensional Data Based on Proximity to the Convex Hull Boundary

Classification of Hyperspectral Imagery on Embedded Grassmannians

HEP-FCE Working Group on Libraries and Tools

High Energy Physics Forum for Computational Excellence: Working Group Reports (I. Applications Software II. Software Libraries and Tools III. Systems)

Outcome prediction in mathematical models of immune response to infection

Locally Linear Embedding Clustering Algorithm for Natural Imagery