Source author record

Ben Johnson

Ben Johnson appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Applications astro-ph.CO astro-ph.EP astro-ph.GA Biological Physics Digital Libraries Information Retrieval Methodology Software Engineering

Catalog footprint

What is connected

7works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

preprint2026arXiv

Query-efficient model evaluation using cached responses

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

preprint2020arXiv

COVID-19 Kaggle Literature Organization

The world has faced the devastating outbreak of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), or COVID-19, in 2020. Research in the subject matter was fast-tracked to such a point that scientists were struggling to keep up with new findings. With this increase in the scientific literature, there arose a need for organizing those documents. We describe an approach to organize and visualize the scientific literature on or related to COVID-19 using machine learning techniques so that papers on similar topics are grouped together. By doing so, the navigation of topics and related papers is simplified. We implemented this approach using the widely recognized CORD-19 dataset to present a publicly available proof of concept.

preprint2016arXiv

Single-Molecule Imaging of Nav1.6 on the Surface of Hippocampal Neurons Reveals Somatic Nanoclusters

Voltage-gated sodium (Na$_\mathrm{v}$) channels are responsible for the depolarizing phase of the action potential in most nerve cells, and Na$_\mathrm{v}$ channel localization to the axon initial segment is vital to action potential initiation. Na$_\mathrm{v}$ channels in the soma play a role in the transfer of axonal output information to the rest of the neuron and in synaptic plasticity, although little is known about Na$_\mathrm{v}$ channel localization and dynamics within this neuronal compartment. This study uses single-particle tracking and photoactivation localization microscopy to analyze cell-surface Na$_\mathrm{v}$1.6 within the soma of cultured hippocampal neurons. Mean-square displacement analysis of individual trajectories indicated that half of the somatic Na$_\mathrm{v}$1.6 channels localized to stable nanoclusters $\sim$230 nm in diameter. Strikingly, these domains were stabilized at specific sites on the cell membrane for >30 min, notably via an ankyrin-independent mechanism, indicating that the means by which Na$_\mathrm{v}$1.6 nanoclusters are maintained in the soma is biologically different from axonal localization. Nonclustered Na$_\mathrm{v}$1.6 channels showed anomalous diffusion, as determined by mean-square-displacement analysis. High-density single-particle tracking of Na$_\mathrm{v}$ channels labeled with photoactivatable fluorophores in combination with Bayesian inference analysis was employed to characterize the surface nanoclusters. A subpopulation of mobile Na$_\mathrm{v}$1.6 was observed to be transiently trapped in the nanoclusters. Somatic Na$_\mathrm{v}$1.6 nanoclusters represent a new, to our knowledge, type of Na$_\mathrm{v}$ channel localization, and are hypothesized to be sites of localized channel regulation.

preprint2015arXiv

Bounds for maximum likelihood regular and non-regular DoA estimation in $K$-distributed noise

We consider the problem of estimating the direction of arrival of a signal embedded in $K$-distributed noise, when secondary data which contains noise only are assumed to be available. Based upon a recent formula of the Fisher information matrix (FIM) for complex elliptically distributed data, we provide a simple expression of the FIM with the two data sets framework. In the specific case of $K$-distributed noise, we show that, under certain conditions, the FIM for the deterministic part of the model can be unbounded, while the FIM for the covariance part of the model is always bounded. In the general case of elliptical distributions, we provide a sufficient condition for unboundedness of the FIM. Accurate approximations of the FIM for $K$-distributed noise are also derived when it is bounded. Additionally, the maximum likelihood estimator of the signal DoA and an approximated version are derived, assuming known covariance matrix: the latter is then estimated from secondary data using a conventional regularization technique. When the FIM is unbounded, an analysis of the estimators reveals a rate of convergence much faster than the usual $T^{-1}$. Simulations illustrate the different behaviors of the estimators, depending on the FIM being bounded or not.

preprint2015arXiv

The Nitrogen Budget of Earth

We comprehensively compile and review N content in geologic materials to calculate a new N budget for Earth. Using analyses of rocks and minerals in conjunction with N-Ar geochemistry demonstrates that the Bulk Silicate Earth (BSE) contains \sim7\pm4 times present atmospheric N (4\times10^18 kg N, PAN), with 27\pm16\times10^18 kg N. Comparison to chondritic composition, after subtracting N sequestered into the core, yields a consistent result, with BSE N between 17\pm13\times10^18 kg to 31\pm24\times10^18 kg N. In the chondritic comparison we calculate a N mass in Earth's core (180\pm110 to 300\pm180\times10^18 kg) and discuss the Moon as a proxy for the early mantle. Significantly, we find the majority of the planetary budget of N is in the solid Earth. The N estimate herein precludes the need for a "missing N" reservoir. Nitrogen-Ar systematics in mantle rocks and basalts identify two mantle reservoirs: MORB-source like (MSL) and high-N. High-N mantle is composed of young, N-rich material subducted from the surface and is identified in OIB and some xenoliths. In contrast, MSL appears to be made of old material, though a component of subducted material is evident in this reservoir as well. Using our new budget, we calculate a δ15N value for BSE plus atmosphere of \sim2\permil. This value should be used when discussing bulk Earth N isotope evolution. Additionally, our work indicates that all surface N could pass through the mantle over Earth history, and the mantle may act as a long-term sink for N. Since N acts as a tracer of exchange between the atmosphere, oceans, and mantle over time, clarifying its distribution in the Earth is critical for evolutionary models concerned with Earth system evolution. We suggest that N be viewed in the same vein as carbon: it has a fast, biologically mediated cycle which connects it to a slow, tectonically-controlled geologic cycle.

preprint2010arXiv

Mid-Infrared Spectral Indicators of Star-Formation and AGN Activity in Normal Galaxies

We investigate the use of mid-infrared PAH bands, continuum and emission lines as probes of star-formation and AGN activity in a sample of 100 `normal' and local (z~0.1) galaxies. The MIR spectra were obtained with the Spitzer IRS as part of the Spitzer-SDSS-GALEX Spectroscopic Survey (SSGSS) which includes multi-wavelength photometry from the UV to the FIR and optical spectroscopy. The spectra were decomposed using PAHFIT (Smith et al. 2007), which we find to yield PAH equivalent widths (EW) up to ~30 times larger than the commonly used spline methods. Based on correlations between PAH, continuum and emission line properties and optically derived physical properties (gas phase metallicity, radiation field hardness), we revisit the diagnostic diagram relating PAH EWs and [NeII]/[OIV] and find it more efficient as distinguishing weak AGNs from star-forming galaxies than when spline decompositions are used. The luminosity of individual MIR component (PAH, continuum, Ne and molecular hydrogen lines) are found to be tightly correlated to the total IR luminosity and can be used to estimate dust attenuation in the UV and in Ha lines based on energy balance arguments.

Ben Johnson

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Query-efficient model evaluation using cached responses

COVID-19 Kaggle Literature Organization

Single-Molecule Imaging of Nav1.6 on the Surface of Hippocampal Neurons Reveals Somatic Nanoclusters

Bounds for maximum likelihood regular and non-regular DoA estimation in $K$-distributed noise

The Nitrogen Budget of Earth

Mid-Infrared Spectral Indicators of Star-Formation and AGN Activity in Normal Galaxies