Source author record

James P. Long

James P. Long appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications astro-ph.IM Methodology cond-mat.mtrl-sci Machine Learning physics.optics astro-ph.SR cond-mat.mes-hall math.ST Quantitative Methods Statistics Theory

Catalog footprint

What is connected

11works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Causal Discovery with Mixed Latent Confounding via Precision Decomposition

We study causal discovery from observational data in linear Gaussian systems affected by \emph{mixed latent confounding}, where some unobserved factors act broadly across many variables while others influence only small subsets. This setting is common in practice and poses a challenge for existing methods: differentiable and score-based DAG learners can misinterpret global latent effects as causal edges, while latent-variable graphical models recover only undirected structure. We propose \textsc{DCL-DECOR}, a modular, precision-led pipeline that separates these roles. The method first isolates pervasive latent effects by decomposing the observed precision matrix into a structured component and a low-rank component. The structured component corresponds to the conditional distribution after accounting for pervasive confounders and retains only local dependence induced by the causal graph and localized confounding. A correlated-noise DAG learner is then applied to this deconfounded representation to recover directed edges while modeling remaining structured error correlations, followed by a simple reconciliation step to enforce bow-freeness. We provide identifiability results that characterize the recoverable causal target under mixed confounding and show how the overall problem reduces to well-studied subproblems with modular guarantees. Synthetic experiments that vary the strength and dimensionality of pervasive confounding demonstrate consistent improvements in directed edge recovery over applying correlated-noise DAG learning directly to the confounded data.

preprint2022arXiv

Causal Models, Prediction, and Extrapolation in Cell Line Perturbation Experiments

In cell line perturbation experiments, a collection of cells is perturbed with external agents (e.g. drugs) and responses such as protein expression measured. Due to cost constraints, only a small fraction of all possible perturbations can be tested in vitro. This has led to the development of computational (in silico) models which can predict cellular responses to perturbations. Perturbations with clinically interesting predicted responses can be prioritized for in vitro testing. In this work, we compare causal and non-causal regression models for perturbation response prediction in a Melanoma cancer cell line. The current best performing method on this data set is Cellbox which models how proteins causally effect each other using a system of ordinary differential equations (ODEs). We derive a closed form solution to the Cellbox system of ODEs in the linear case. These analytic results facilitate comparison of Cellbox to regression approaches. We show that causal models such as Cellbox, while requiring more assumptions, enable extrapolation in ways that non-causal regression models cannot. For example, causal models can predict responses for never before tested drugs. We illustrate these strengths and weaknesses in simulations. In an application to the Melanoma cell line data, we find that regression models outperform the Cellbox causal model.

preprint2020arXiv

A Flexible Procedure for Mixture Proportion Estimation in Positive-Unlabeled Learning

Positive--unlabeled (PU) learning considers two samples, a positive set P with observations from only one class and an unlabeled set U with observations from two classes. The goal is to classify observations in U. Class mixture proportion estimation (MPE) in U is a key step in PU learning. Blanchard et al. [2010] showed that MPE in PU learning is a generalization of the problem of estimating the proportion of true null hypotheses in multiple testing problems. Motivated by this idea, we propose reducing the problem to one dimension via construction of a probabilistic classifier trained on the P and U data sets followed by application of a one--dimensional mixture proportion method from the multiple testing literature to the observation class probabilities. The flexibility of this framework lies in the freedom to choose the classifier and the one--dimensional MPE method. We prove consistency of two mixture proportion estimators using bounds from empirical process theory, develop tuning parameter free implementations, and demonstrate that they have competitive performance on simulated waveform data and a protein signaling problem.

preprint2016arXiv

A Study of Functional Depths

Functional depth is used for ranking functional observations from most outlying to most typical. The ranks produced by functional depth have been proposed as the basis for functional classifiers, rank tests, and data visualization procedures. Many of the proposed functional depths are invariant to domain permutation, an unusual property for a functional data analysis procedure. Essentially these depths treat functional data as if it were multivariate data. In this work, we compare the performance of several existing functional depths to a simple adaptation of an existing multivariate depth notion, $L^\infty$ depth ($L^{\infty}D$). On simulated and real data, we show $L^{\infty}D$ has performance comparable or superior to several existing notions of functional depth. In addition, we review how depth functions are evaluated and propose some improvements. In particular, we show that empirical depth function asymptotics can be mis--leading and instead propose a new method, the rank--rank plot, for evaluating empirical depth rank stability.

preprint2015arXiv

A Multiband Generalization of the Analysis of Variance Period Estimation Algorithm and the Effect of Inter-band Observing Cadence on Period Recovery Rate

We present a new method of extending the single band Analysis of Variance period estimation algorithm to multiple bands. We use SDSS Stripe 82 RR Lyrae to show that in the case of low number of observations per band and non-simultaneous observations, improvements in period recovery rates of up to $\approx$60\% are observed. We also investigate the effect of inter-band observing cadence on period recovery rates. We find that using non-simultaneous observation times between bands is ideal for the multiband method, and using simultaneous multiband data is only marginally better than using single band data. These results will be particularly useful in planning observing cadences for wide-field astronomical imaging surveys such as LSST. They also have the potential to improve the extraction of transient data from surveys with few ($\lesssim 30$) observations per band across several bands, such as the Dark Energy Survey.

preprint2015arXiv

Photoinduced tunability of the Reststrahlen band in 4H-SiC

Materials with a negative dielectric permittivity (e.g. metals) display high reflectance and can be shaped into nanoscale optical-resonators exhibiting extreme mode confinement, a central theme of nanophotonics. However, the ability to $actively$ tune these effects remains elusive. By photoexciting free carriers in 4H-SiC, we induce dramatic changes in reflectance near the "Reststrahlen band" where the permittivity is negative due to charge oscillations of the polar optical phonons in the mid-infrared. We infer carrier-induced changes in the permittivity required for useful tunability (~ 40 cm$^{-1}$) in nanoscale resonators, providing a direct avenue towards the realization of actively tunable nanophotonic devices in the mid-infrared to terahertz spectral range.

preprint2014arXiv

Estimating a Common Period for a Set of Irregularly Sampled Functions with Applications to Periodic Variable Star Data

We consider the estimation of a common period for a set of functions sampled at irregular intervals. The problem arises in astronomy, where the functions represent a star's brightness observed over time through different photometric filters. While current methods can estimate periods accurately provided that the brightness is well--sampled in at least one filter, there are no existing methods that can provide accurate estimates when no brightness function is well--sampled. In this paper we introduce two new methods for period estimation when brightnesses are poorly--sampled in all filters. The first, multiband generalized Lomb-Scargle (MGLS), extends the frequently used Lomb-Scargle method in a way that naïvely combines information across filters. The second, penalized generalized Lomb-Scargle (PGLS), builds on the first by more intelligently borrowing strength across filters. Specifically, we incorporate constraints on the phases and amplitudes across the different functions using a non--convex penalized likelihood function. We develop a fast algorithm to optimize the penalized likelihood by combining block coordinate descent with the majorization-minimization (MM) principle. We illustrate our methods on synthetic and real astronomy data. Both advance the state-of-the-art in period estimation; however, PGLS significantly outperforms MGLS when all functions are extremely poorly--sampled.

preprint2014arXiv

Kernel Density Estimation with Berkson Error

Given a sample $\{X_i\}_{i=1}^n$ from $f_X$, we construct kernel density estimators for $f_Y$, the convolution of $f_X$ with a known error density $f_ε$. This problem is known as density estimation with Berkson error and has applications in epidemiology and astronomy. Little is understood about bandwidth selection for Berkson density estimation. We compare three approaches to selecting the bandwidth both asymptotically, using large sample approximations to the MISE, and at finite samples, using simulations. Our results highlight the relationship between the structure of the error $f_ε$ and the optimal bandwidth. In particular, the results demonstrate the importance of smoothing when the error term $f_ε$ is concentrated near 0. We propose a data--driven bandwidth estimator and test its performance on NO$_2$ exposure data.

preprint2013arXiv

Electronic Hybridization of Large-Area Stacked Graphene Films

Direct, tunable coupling between individually assembled graphene layers is a next step towards designer two-dimensional (2D) crystal systems, with relevance for fundamental studies and technological applications. Here we describe the fabrication and characterization of large-area (> cm^2), coupled bilayer graphene on SiO2/Si substrates. Stacking two graphene films leads to direct electronic interactions between layers, where the resulting film properties are determined by the local twist angle. Polycrystalline bilayer films have a "stained-glass window" appearance explained by the emergence of a narrow absorption band in the visible spectrum that depends on twist angle. Direct measurement of layer orientation via electron diffraction, together with Raman and optical spectroscopy, confirms the persistence of clean interfaces over large areas. Finally, we demonstrate that interlayer coupling can be reversibly turned off through chemical modification, enabling optical-based chemical detection schemes. Together, these results suggest that individual 2D crystals can be individually assembled to form electronically coupled systems suitable for large-scale applications.

preprint2012arXiv

Optimizing Automated Classification of Periodic Variable Stars in New Synoptic Surveys

Efficient and automated classification of periodic variable stars is becoming increasingly important as the scale of astronomical surveys grows. Several recent papers have used methods from machine learning and statistics to construct classifiers on databases of labeled, multi--epoch sources with the intention of using these classifiers to automatically infer the classes of unlabeled sources from new surveys. However, the same source observed with two different synoptic surveys will generally yield different derived metrics (features) from the light curve. Since such features are used in classifiers, this survey-dependent mismatch in feature space will typically lead to degraded classifier performance. In this paper we show how and why feature distributions change using OGLE and \textit{Hipparcos} light curves. To overcome survey systematics, we apply a method, \textit{noisification}, which attempts to empirically match distributions of features between the labeled sources used to construct the classifier and the unlabeled sources we wish to classify. Results from simulated and real--world light curves show that noisification can significantly improve classifier performance. In a three--class problem using light curves from \textit{Hipparcos} and OGLE, noisification reduces the classifier error rate from 27.0% to 7.0%. We recommend that noisification be used for upcoming surveys such as Gaia and LSST and describe some of the promises and challenges of applying noisification to these surveys.

preprint2011arXiv

Active Learning to Overcome Sample Selection Bias: Application to Photometric Variable Star Classification

Despite the great promise of machine-learning algorithms to classify and predict astrophysical parameters for the vast numbers of astrophysical sources and transients observed in large-scale surveys, the peculiarities of the training data often manifest as strongly biased predictions on the data of interest. Typically, training sets are derived from historical surveys of brighter, more nearby objects than those from more extensive, deeper surveys (testing data). This sample selection bias can cause catastrophic errors in predictions on the testing data because a) standard assumptions for machine-learned model selection procedures break down and b) dense regions of testing space might be completely devoid of training data. We explore possible remedies to sample selection bias, including importance weighting (IW), co-training (CT), and active learning (AL). We argue that AL---where the data whose inclusion in the training set would most improve predictions on the testing set are queried for manual follow-up---is an effective approach and is appropriate for many astronomical applications. For a variable star classification problem on a well-studied set of stars from Hipparcos and OGLE, AL is the optimal method in terms of error rate on the testing data, beating the off-the-shelf classifier by 3.4% and the other proposed methods by at least 3.0%. To aid with manual labeling of variable stars, we developed a web interface which allows for easy light curve visualization and querying of external databases. Finally, we apply active learning to classify variable stars in the ASAS survey, finding dramatic improvement in our agreement with the ACVS catalog, from 65.5% to 79.5%, and a significant increase in the classifier's average confidence for the testing set, from 14.6% to 42.9%, after a few AL iterations.