Source author record

Christopher Jung

Christopher Jung appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computer Science and Game Theory Digital Libraries econ.EM econ.TH Multiagent Systems physics.comp-ph physics.soc-ph Quantitative Methods

Catalog footprint

What is connected

8works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Convex Dataset Valuation for Post-Training

Improving LLM performance on downstream tasks sometimes requires leveraging auxiliary datasets during post-training. In practice, however, developers face constraints on compute, labeling, and licensing costs that preclude using all available data, necessitating principled dataset-level selection. These constraints are increasingly shaped by dataset marketplaces, where data acquisition is governed by budgets and negotiation. We study dataset valuation as a subset selection problem during LLM post-training. Our goal is to identify and weight auxiliary datasets so as to maximize target task performance given constrained budgets. We first show that commonly used gradient alignment scores provide a reasonable yet incomplete valuation signal, as they ignore redundancy among datasets. To address this, we propose a scalable convex dataset-level valuation method based on kernel mean matching (KMM) in gradient space, which jointly accounts for alignment with the target task and redundancy across auxiliary datasets. Through extensive experiments across diverse post-training settings and tasks, we show that our approach consistently outperforms existing valuation baselines, achieving stronger performance with low computational overhead. Our results position dataset valuation as a practical decision tool for post-training data selection in market-constrained large language model settings. The code is available at https://github.com/uiuctml/convex_data_valuation.

preprint2022arXiv

Getting more with less? Why repowering onshore wind farms does not always lead to more wind power generation -- a German case study

The best wind locations are nowadays often occupied by old, less efficient and relatively small wind turbines. Many of them will soon reach the end of their operating lifetime, or lose financial support. Therefore, repowering comes to the fore. However, social acceptance and land use restrictions have been under constant change since the initial expansions, which makes less area available for new turbines, even on existing sites. For the example of Germany, this study assesses the repowering potential for onshore wind energy in high detail, on the basis of regionally differentiated land eligibility criteria. The results show that under the given regional criteria, repowering will decrease both operating capacity and annual energy yield by roughly 40\,\% compared to the status quo. This is because around half of the wind turbines are currently located in restricted areas, given newly enacted exclusion criteria. Sensitivity analyses on the exclusion criteria show that the minimum distance to discontinuous urban fabric is the most sensitive criterion in determining the number of turbines that can be repowered. As regulations on this can vary substantially across different regions, the location-specific methodology chosen here can assess the repowering potential more realistically than existing approaches.

preprint2022arXiv

Metric-Free Individual Fairness in Online Learning

We study an online learning problem subject to the constraint of individual fairness, which requires that similar individuals are treated similarly. Unlike prior work on individual fairness, we do not assume the similarity measure among individuals is known, nor do we assume that such measure takes a certain parametric form. Instead, we leverage the existence of an auditor who detects fairness violations without enunciating the quantitative measure. In each round, the auditor examines the learner's decisions and attempts to identify a pair of individuals that are treated unfairly by the learner. We provide a general reduction framework that reduces online classification in our model to standard online classification, which allows us to leverage existing online learning algorithms to achieve sub-linear regret and number of fairness violations. Surprisingly, in the stochastic setting where the data are drawn independently from a distribution, we are also able to establish PAC-style fairness and accuracy generalization guarantees (Rothblum and Yona [2018]), despite only having access to a very restricted form of fairness feedback. Our fairness generalization bound qualitatively matches the uniform convergence bound of Rothblum and Yona [2018], while also providing a meaningful accuracy generalization guarantee. Our results resolve an open question by Gillen et al. [2018] by showing that online learning under an unknown individual fairness constraint is possible even without assuming a strong parametric form of the underlying similarity measure.

preprint2022arXiv

Practical Adversarial Multivalid Conformal Prediction

We give a simple, generic conformal prediction method for sequential prediction that achieves target empirical coverage guarantees against adversarially chosen data. It is computationally lightweight -- comparable to split conformal prediction -- but does not require having a held-out validation set, and so all data can be used for training models from which to derive a conformal score. It gives stronger than marginal coverage guarantees in two ways. First, it gives threshold calibrated prediction sets that have correct empirical coverage even conditional on the threshold used to form the prediction set from the conformal score. Second, the user can specify an arbitrary collection of subsets of the feature space -- possibly intersecting -- and the coverage guarantees also hold conditional on membership in each of these subsets. We call our algorithm MVP, short for MultiValid Prediction. We give both theory and an extensive set of empirical evaluations.

preprint2022arXiv

Quantifying the Burden of Exploration and the Unfairness of Free Riding

We consider the multi-armed bandit setting with a twist. Rather than having just one decision maker deciding which arm to pull in each round, we have $n$ different decision makers (agents). In the simple stochastic setting, we show that a "free-riding" agent observing another "self-reliant" agent can achieve just $O(1)$ regret, as opposed to the regret lower bound of $Ω(\log t)$ when one decision maker is playing in isolation. This result holds whenever the self-reliant agent's strategy satisfies either one of two assumptions: (1) each arm is pulled at least $γ\ln t$ times in expectation for a constant $γ$ that we compute, or (2) the self-reliant agent achieves $o(t)$ realized regret with high probability. Both of these assumptions are satisfied by standard zero-regret algorithms. Under the second assumption, we further show that the free rider only needs to observe the number of times each arm is pulled by the self-reliant agent, and not the rewards realized. In the linear contextual setting, each arm has a distribution over parameter vectors, each agent has a context vector, and the reward realized when an agent pulls an arm is the inner product of that agent's context vector with a parameter vector sampled from the pulled arm's distribution. We show that the free rider can achieve $O(1)$ regret in this setting whenever the free rider's context is a small (in $L_2$-norm) linear combination of other agents' contexts and all other agents pull each arm $Ω(\log t)$ times with high probability. Again, this condition on the self-reliant players is satisfied by standard zero-regret algorithms like UCB. We also prove a number of lower bounds.

preprint2020arXiv

Fair Prediction with Endogenous Behavior

There is increasing regulatory interest in whether machine learning algorithms deployed in consequential domains (e.g. in criminal justice) treat different demographic groups "fairly." However, there are several proposed notions of fairness, typically mutually incompatible. Using criminal justice as an example, we study a model in which society chooses an incarceration rule. Agents of different demographic groups differ in their outside options (e.g. opportunity for legal employment) and decide whether to commit crimes. We show that equalizing type I and type II errors across groups is consistent with the goal of minimizing the overall crime rate; other popular notions of fairness are not.

preprint2020arXiv

Moment Multicalibration for Uncertainty Estimation

We show how to achieve the notion of "multicalibration" from Hébert-Johnson et al. [2018] not just for means, but also for variances and other higher moments. Informally, it means that we can find regression functions which, given a data point, can make point predictions not just for the expectation of its label, but for higher moments of its label distribution as well-and those predictions match the true distribution quantities when averaged not just over the population as a whole, but also when averaged over an enormous number of finely defined subgroups. It yields a principled way to estimate the uncertainty of predictions on many different subgroups-and to diagnose potential sources of unfairness in the predictive power of features across subgroups. As an application, we show that our moment estimates can be used to derive marginal prediction intervals that are simultaneously valid as averaged over all of the (sufficiently large) subgroups for which moment multicalibration has been obtained.

preprint2012arXiv

Data Life Cycle Labs, A New Concept to Support Data-Intensive Science

In many sciences the increasing amounts of data are reaching the limit of established data handling and processing. With four large research centers of the German Helmholtz association the Large Scale Data Management and Analysis (LSDMA) project supports an initial set of scientific projects, initiatives and instruments to organize and efficiently analyze the increasing amount of data produced in modern science. LSDMA bridges the gap between data production and data analysis using a novel approach by combining specific community support and generic, cross community development. In the Data Life Cycle Labs (DLCL) experts from the data domain work closely with scientific groups of selected research domains in joint R&D where community-specific data life cycles are iteratively optimized, data and meta-data formats are defined and standardized, simple access and use is established as well as data and scientific insights are preserved in long-term and open accessible archives.

Christopher Jung

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Convex Dataset Valuation for Post-Training

Getting more with less? Why repowering onshore wind farms does not always lead to more wind power generation -- a German case study

Metric-Free Individual Fairness in Online Learning

Practical Adversarial Multivalid Conformal Prediction

Quantifying the Burden of Exploration and the Unfairness of Free Riding

Fair Prediction with Endogenous Behavior

Moment Multicalibration for Uncertainty Estimation

Data Life Cycle Labs, A New Concept to Support Data-Intensive Science