Researcher profile

Daniel Kang

Daniel Kang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2023arXiv

Identifying and Mitigating the Security Risks of Generative AI

Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address.

preprint2022arXiv

Approximate Selection with Guarantees using Proxies

Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30x for both the precision and recall targets in both real and synthetic datasets.

preprint2022arXiv

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

ML is being deployed in complex, real-world scenarios where errors have impactful consequences. In these systems, thorough testing of the ML pipelines is critical. A key component in ML deployment pipelines is the curation of labeled training data. Common practice in the ML literature assumes that labels are the ground truth. However, in our experience in a large autonomous vehicle development center, we have found that vendors can often provide erroneous labels, which can lead to downstream safety risks in trained models. To address these issues, we propose a new abstraction, learned observation assertions, and implement it in a system called Fixy. Fixy leverages existing organizational resources, such as existing (possibly noisy) labeled datasets or previously trained ML models, to learn a probabilistic model for finding errors in human- or model-generated labels. Given user-provided features and these existing resources, Fixy learns feature distributions that specify likely and unlikely values (e.g., that a speed of 30mph is likely but 300mph is unlikely). It then uses these feature distributions to score labels for potential errors. We show that FIxy can automatically rank potential errors in real datasets with up to 2$\times$ higher precision compared to recent work on model assertions and standard techniques such as uncertainty sampling.

preprint2022arXiv

Semantic Indexes for Machine Learning-based Queries over Unstructured Data

Unstructured data (e.g., video or text) is now commonly queried by using computationally expensive deep neural networks or human labelers to produce structured information, e.g., object types and positions in video. To accelerate queries, many recent systems (e.g., BlazeIt, NoScope, Tahoma, SUPG, etc.) train a query-specific proxy model to approximate a large target labelers (i.e., these expensive neural networks or human labelers). These models return proxy scores that are then used in query processing algorithms. Unfortunately, proxy models usually have to be trained per query and require large amounts of annotations from the target labelers. In this work, we develop an index (trainable semantic index, TASTI) that simultaneously removes the need for per-query proxies and is more efficient to construct than prior indexes. TASTI accomplishes this by leveraging semantic similarity across records in a given dataset. Specifically, it produces embeddings for each record such that records with close embeddings have similar target labeler outputs. TASTI then generates high-quality proxy scores via embeddings without needing to train a per-query proxy. These scores can be used in existing proxy-based query processing algorithms (e.g., for aggregation, selection, etc.). We theoretically analyze TASTI and show that a low embedding training error guarantees downstream query accuracy for a natural class of queries. We evaluate TASTI on five video, text, and speech datasets, and three query types. We show that TASTI's indexes can be 10$\times$ less expensive to construct than generating annotations for current proxy-based methods, and accelerate queries by up to 24$\times$.

preprint2020arXiv

Improved Natural Language Generation via Loss Truncation

Neural language models are usually trained to match the distributional properties of a large-scale corpus by minimizing the log loss. While straightforward to optimize, this approach forces the model to reproduce all variations in the dataset, including noisy and invalid references (e.g., misannotation and hallucinated facts). Worse, the commonly used log loss is overly sensitive to such phenomena and even a small fraction of noisy data can degrade performance. In this work, we show that the distinguishability of the models and reference serves as a principled and robust alternative for handling invalid references. To optimize distinguishability, we propose loss truncation, which adaptively removes high loss examples during training. We show this is as easy to optimize as log loss and tightly bounds distinguishability under noise. Empirically, we demonstrate that loss truncation outperforms existing baselines on distinguishability on a summarization task, and show that samples generated by the loss truncation model have factual accuracy ratings that exceed those of baselines and match human references.

preprint2020arXiv

Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics

While deep neural networks (DNNs) are an increasingly popular way to query large corpora of data, their significant runtime remains an active area of research. As a result, researchers have proposed systems and optimizations to reduce these costs by allowing users to trade off accuracy and speed. In this work, we examine end-to-end DNN execution in visual analytics systems on modern accelerators. Through a novel measurement study, we show that the preprocessing of data (e.g., decoding, resizing) can be the bottleneck in many visual analytics systems on modern hardware. To address the bottleneck of preprocessing, we introduce two optimizations for end-to-end visual analytics systems. First, we introduce novel methods of achieving accuracy and throughput trade-offs by using natively present, low-resolution visual data. Second, we develop a runtime engine for efficient visual DNN inference. This runtime engine a) efficiently pipelines preprocessing and DNN execution for inference, b) places preprocessing operations on the CPU or GPU in a hardware- and input-aware manner, and c) efficiently manages memory and threading for high throughput execution. We implement these optimizations in a novel system, Smol, and evaluate Smol on eight visual datasets. We show that its optimizations can achieve up to 5.9x end-to-end throughput improvements at a fixed accuracy over recent work in visual analytics.

preprint2020arXiv

MLPerf Training Benchmark

Machine learning (ML) needs industry-standard performance benchmarks to support design and competitive evaluation of the many emerging software and hardware solutions for ML. But ML training presents three unique benchmarking challenges absent from other domains: optimizations that improve training throughput can increase the time to solution, training is stochastic and time to solution exhibits high variance, and software and hardware systems are so diverse that fair benchmarking with the same binary, code, and even hyperparameters is difficult. We therefore present MLPerf, an ML benchmark that overcomes these challenges. Our analysis quantitatively evaluates MLPerf's efficacy at driving performance and scalability improvements across two rounds of results from multiple vendors.

preprint2020arXiv

Model Assertions for Monitoring and Improving ML Models

ML models are increasingly deployed in settings with real world interactions such as vehicles, but unfortunately, these models can fail in systematic ways. To prevent errors, ML engineering teams monitor and continuously improve these models. We propose a new abstraction, model assertions, that adapts the classical use of program assertions as a way to monitor and improve ML models. Model assertions are arbitrary functions over a model's input and output that indicate when errors may be occurring, e.g., a function that triggers if an object rapidly changes its class in a video. We propose methods of using model assertions at all stages of ML system deployment, including runtime monitoring, validating labels, and continuously improving ML models. For runtime monitoring, we show that model assertions can find high confidence errors, where a model returns the wrong output with high confidence, which uncertainty-based monitoring techniques would not detect. For training, we propose two methods of using model assertions. First, we propose a bandit-based active learning algorithm that can sample from data flagged by assertions and show that it can reduce labeling costs by up to 40% over traditional uncertainty-based methods. Second, we propose an API for generating "consistency assertions" (e.g., the class change example) and weak labels for inputs where the consistency assertions fail, and show that these weak labels can improve relative model quality by up to 46%. We evaluate model assertions on four real-world tasks with video, LIDAR, and ECG data.

preprint2020arXiv

Willump: A Statistically-Aware End-to-end Optimizer for Machine Learning Inference

Systems for ML inference are widely deployed today, but they typically optimize ML inference workloads using techniques designed for conventional data serving workloads and miss critical opportunities to leverage the statistical nature of ML. In this paper, we present Willump, an optimizer for ML inference that introduces two statistically-motivated optimizations targeting ML applications whose performance bottleneck is feature computation. First, Willump automatically cascades feature computation for classification queries: Willump classifies most data inputs using only high-value, low-cost features selected through empirical observations of ML model performance, improving query performance by up to 5x without statistically significant accuracy loss. Second, Willump accurately approximates ML top-K queries, discarding low-scoring inputs with an automatically constructed approximate model and then ranking the remainder with a more powerful model, improving query performance by up to 10x with minimal accuracy loss. Willump automatically tunes these optimizations' parameters to maximize query performance while meeting an accuracy target. Moreover, Willump complements these statistical optimizations with compiler optimizations to automatically generate fast inference code for ML applications. We show that Willump improves the end-to-end performance of real-world ML inference pipelines curated from major data science competitions by up to 16x without statistically significant loss of accuracy.