Source author record

Bobbie Chern

Bobbie Chern appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Theory math.IT math.CO math.PR Artificial Intelligence Machine Learning Methodology

Catalog footprint

What is connected

6works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Adaptive Sampling Strategies to Construct Equitable Training Datasets

In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities, often performing worse for members of traditionally underserved groups. One factor contributing to these performance gaps is a lack of representation in the data the models are trained on. It is often unclear, however, how to operationalize representativeness in specific applications. Here we formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem. We consider a setting where a model builder must decide how to allocate a fixed data collection budget to gather training data from different subgroups. We then frame dataset creation as a constrained optimization problem, in which one maximizes a function of group-specific performance metrics based on (estimated) group-specific learning rates and costs per sample. This flexible approach incorporates preferences of model-builders and other stakeholders, as well as the statistical properties of the learning task. When data collection decisions are made sequentially, we show that under certain conditions this optimization problem can be efficiently solved even without prior knowledge of the learning rates. To illustrate our approach, we conduct a simulation study of polygenic risk scores on synthetic genomic data -- an application domain that often suffers from non-representative data collection. We find that our adaptive sampling strategy outperforms several common data collection heuristics, including equal and proportional sampling, demonstrating the value of strategic dataset design for building equitable models.

preprint2015arXiv

Central Limit Theorems for some Set Partition Statistics

We prove the conjectured limiting normality for the number of crossings of a uniformly chosen set partition of [n] = {1,2,...,n}. The arguments use a novel stochastic representation and are also used to prove central limit theorems for the dimension index and the number of levels.

preprint2014arXiv

On feedback in Gaussian multi-hop networks

The study of feedback has been mostly limited to single-hop communication settings. In this paper, we consider Gaussian networks where sources and destinations can communicate with the help of intermediate relays over multiple hops. We assume that links in the network can be bidirected providing opportunities for feedback. We ask the following question: can the information transfer in both directions of a link be critical to maximizing the end-to-end communication rates in the network? Equivalently, could one of the directions in each bidirected link (and more generally at least one of the links forming a cycle) be shut down and the capacity of the network still be approximately maintained? We show that in any arbitrary Gaussian network with bidirected edges and cycles and unicast traffic, we can always identify a directed acyclic subnetwork that approximately maintains the capacity of the original network. For Gaussian networks with multiple-access and broadcast traffic, an acyclic subnetwork is sufficient to achieve every rate point in the capacity region of the original network, however, there may not be a single acyclic subnetwork that maintains the whole capacity region. For networks with multicast and multiple unicast traffic, on the other hand, bidirected information flow across certain links can be critically needed to maximize the end-to-end capacity region. These results can be regarded as generalizations of the conclusions regarding the usefulness of feedback in various single-hop Gaussian settings and can provide opportunities for simplifying operation in Gaussian multi-hop networks.

preprint2013arXiv

Achieving the Capacity of the N-Relay Gaussian Diamond Network Within log N Bits

We consider the N-relay Gaussian diamond network where a source node communicates to a destination node via N parallel relays through a cascade of a Gaussian broadcast (BC) and a multiple access (MAC) channel. Introduced in 2000 by Schein and Gallager, the capacity of this relay network is unknown in general. The best currently available capacity approximation, independent of the coefficients and the SNR's of the constituent channels, is within an additive gap of 1.3 N bits, which follows from the recent capacity approximations for general Gaussian relay networks with arbitrary topology. In this paper, we approximate the capacity of this network within 2 log N bits. We show that two strategies can be used to achieve the information-theoretic cutset upper bound on the capacity of the network up to an additive gap of O(log N) bits, independent of the channel configurations and the SNR's. The first of these strategies is simple partial decode-and-forward. Here, the source node uses a superposition codebook to broadcast independent messages to the relays at appropriately chosen rates; each relay decodes its intended message and then forwards it to the destination over the MAC channel. A similar performance can be also achieved with compress-and-forward type strategies (such as quantize-map-and-forward and noisy network coding) that provide the 1.3 N-bit approximation for general Gaussian networks, but only if the relays quantize their observed signals at a resolution inversely proportional to the number of relay nodes N. This suggest that the rule-of-thumb to quantize the received signals at the noise level in the current literature can be highly suboptimal.

preprint2013arXiv

Closed expressions for averages of set partition statistics

In studying the enumerative theory of super characters' of the group of upper triangular matrices over a finite field we found that the moments (mean, variance and higher moments) of novel statistics on set partitions have simple closed expressions as linear combinations of shifted bell numbers. It is shown here that families of other statistics have similar moments. The coefficients in the linear combinations are polynomials in $n$. This allows exact enumeration of the moments for small $n$ to determine exact formulae for all $n$.

preprint2012arXiv

Reference Based Genome Compression

DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a mapping from the reference to the target genome, and then compresses this mapping with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB.