Source author record

Bill Howe

Bill Howe appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Databases Distributed, Parallel, and Cluster Computing Artificial Intelligence Computation and Language cs.CY Digital Libraries Information Retrieval Mathematical Software Programming Languages Social and Information Networks

Catalog footprint

What is connected

6works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

preprint2023arXiv

Adapting to Skew: Imputing Spatiotemporal Urban Data with 3D Partial Convolutions and Biased Masking

We adapt image inpainting techniques to impute large, irregular missing regions in urban settings characterized by sparsity, variance in both space and time, and anomalous events. Missing regions in urban data can be caused by sensor or software failures, data quality issues, interference from weather events, incomplete data collection, or varying data use regulations; any missing data can render the entire dataset unusable for downstream applications. To ensure coverage and utility, we adapt computer vision techniques for image inpainting to operate on 3D histograms (2D space + 1D time) commonly used for data exchange in urban settings. Adapting these techniques to the spatiotemporal setting requires handling skew: urban data tend to follow population density patterns (small dense regions surrounded by large sparse areas); these patterns can dominate the learning process and fool the model into ignoring local or transient effects. To combat skew, we 1) train simultaneously in space and time, and 2) focus attention on dense regions by biasing the masks used for training to the skew in the data. We evaluate the core model and these two extensions using the NYC taxi data and the NYC bikeshare data, simulating different conditions for missing data. We show that the core model is effective qualitatively and quantitatively, and that biased masking during training reduces error in a variety of scenarios. We also articulate a tradeoff in varying the number of timesteps per training sample: too few timesteps and the model ignores transient events; too many timesteps and the model is slow to train with limited performance gain.

preprint2016arXiv

From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database

Google BigTable's scale-out design for distributed key-value storage inspired a generation of NoSQL databases. Recently the NewSQL paradigm emerged in response to analytic workloads that demand distributed computation local to data storage. Many such analytics take the form of graph algorithms, a trend that motivated the GraphBLAS initiative to standardize a set of matrix math kernels for building graph algorithms. In this article we show how it is possible to implement the GraphBLAS kernels in a BigTable database by presenting the design of Graphulo, a library for executing graph algorithms inside the Apache Accumulo database. We detail the Graphulo implementation of two graph algorithms and conduct experiments comparing their performance to two main-memory matrix math systems. Our results shed insight into the conditions that determine when executing a graph algorithm is faster inside a database versus an external system---in short, that memory requirements and relative I/O are critical factors.

preprint2016arXiv

Lara: A Key-Value Algebra underlying Arrays and Relations

Data processing systems roughly group into families such as relational, array, graph, and key-value. Many data processing tasks exceed the capabilities of any one family, require data stored across families, or run faster when partitioned onto multiple families. Discovering ways to execute computation among multiple available systems, let alone discovering an optimal execution plan, is challenging given semantic differences between disparate families of systems. In this paper we introduce a new algebra, Lara, which underlies and unifies algebras representing the families above in order to facilitate translation between systems. We describe the operations and objects of Lara---union, join, and ext on associative tables---and show her properties and equivalences to other algebras. Multi-system optimization has a bright future, in which we proffer Lara for the role of universal connector.

preprint2016arXiv

Viziometrics: Analyzing Visual Information in the Scientific Literature

Scientific results are communicated visually in the literature through diagrams, visualizations, and photographs. These information-dense objects have been largely ignored in bibliometrics and scientometrics studies when compared to citations and text. In this paper, we use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into 5 figure types and study the resulting patterns of visual information as they relate to impact. We find that the distribution of figures and figure types in the literature has remained relatively constant over time, but can vary widely across field and topic. Remarkably, we find a significant correlation between scientific impact and the use of visual information, where higher impact papers tend to include more diagrams, and to a lesser extent more plots and photographs. To explore these results and other ways of extracting this visual information, we have built a visual browser to illustrate the concept and explore design alternatives for supporting viziometric analysis and organizing visual information. We use these results to articulate a new research agenda -- viziometrics -- to study the organization and presentation of visual information in the scientific literature.

preprint2010arXiv

Astronomy in the Cloud: Using MapReduce for Image Coaddition

In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computation challenges such as anomaly detection and classification, and moving object tracking. Since such studies benefit from the highest quality data, methods such as image coaddition (stacking) will be a critical preprocessing step prior to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources or transient objects, these data streams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this paper we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data is partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources, e.g., Amazon's EC2. We report on our experience implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multi-terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image coaddition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results comparing their performance.

Bill Howe

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

Adapting to Skew: Imputing Spatiotemporal Urban Data with 3D Partial Convolutions and Biased Masking

From NoSQL Accumulo to NewSQL Graphulo: Design and Utility of Graph Algorithms inside a BigTable Database

Lara: A Key-Value Algebra underlying Arrays and Relations

Viziometrics: Analyzing Visual Information in the Scientific Literature

Astronomy in the Cloud: Using MapReduce for Image Coaddition