Source author record

Stefan Roth

Stefan Roth appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Robotics Artificial Intelligence eess.SP Information Theory math.IT physics.ins-det Computation and Language Cryptography and Security hep-ex math.PR math.ST Networking and Internet Architecture Statistics Theory

Catalog footprint

What is connected

27works

15topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Approximation-based Threshold Optimization from Single Antenna to Massive SIMO Authentication

In a wireless sensor network, data from various sensors are gathered to estimate the system-state of the process system. However, adversaries aim at distorting the system-state estimate, for which they may infiltrate sensors or position additional devices in the environment. To authenticate the received process values, the integrity of the measurements from different sensors can be evaluated jointly with the temporal integrity of channel measurements from each sensor. For this purpose, we design a security protocol, in which Kalman filters are used to predict the system-state and the channel-state values, and the received data are authenticated by a hypothesis test. We theoretically analyze the adversarial success probability and the reliability rate obtained in the hypothesis test in two ways, based on a chi-square approximation and on a Gaussian approximation. The two approximations are exact for small and large data vectors, respectively. The Gaussian approximation is suitable for analyzing massive single-input multiple-output (SIMO) setups. To obtain additional insights, the approximation is further adapted for the case of channel hardening, which occurs in massive SIMO fading channels. As adversaries always look for the weakest point of a system, a time-constant security level is required. To provide such a service, the approximations are used to propose time-varying threshold values for the hypothesis test, which approximately attain a constant security level. Numerical results show that a constant security level can only be achieved by a time-varying threshold choice, while a constant threshold value leads to a time-varying security level.

preprint2022arXiv

Diverse Image Captioning with Grounded Style

Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions with diversity in styles that are grounded in the image.

preprint2022arXiv

IRShield: A Countermeasure Against Adversarial Physical-Layer Wireless Sensing

Wireless radio channels are known to contain information about the surrounding propagation environment, which can be extracted using established wireless sensing methods. Thus, today's ubiquitous wireless devices are attractive targets for passive eavesdroppers to launch reconnaissance attacks. In particular, by overhearing standard communication signals, eavesdroppers obtain estimations of wireless channels which can give away sensitive information about indoor environments. For instance, by applying simple statistical methods, adversaries can infer human motion from wireless channel observations, allowing to remotely monitor premises of victims. In this work, building on the advent of intelligent reflecting surfaces (IRSs), we propose IRShield as a novel countermeasure against adversarial wireless sensing. IRShield is designed as a plug-and-play privacy-preserving extension to existing wireless networks. At the core of IRShield, we design an IRS configuration algorithm to obfuscate wireless channels. We validate the effectiveness with extensive experimental evaluations. In a state-of-the-art human motion detection attack using off-the-shelf Wi-Fi devices, IRShield lowered detection rates to 5% or less.

preprint2022arXiv

xGQA: Cross-Lingual Visual Question Answering

Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual, and -- vice versa -- multilingual models to become multimodal. Our proposed methods outperform current state-of-the-art multilingual multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the accuracy remains low across the board; a performance drop of around 38 accuracy points in target languages showcases the difficulty of zero-shot cross-lingual transfer for this task. Our results suggest that simple cross-lingual transfer of multimodal models yields latent multilingual multimodal misalignment, calling for more sophisticated methods for vision and multilingual language modeling.

preprint2020arXiv

A solution to a linear integral equation with an application to statistics of infinitely divisible moving averages

For a stationary moving average random field, a non-parametric low frequency estimator of the Lévy density of its infinitely divisible independently scattered integrator measure is given. The plug-in estimate is based on the solution $w$ of the linear integral equation $v(x) = \int_{\mathbb{R}^d} g(s) w(h(s)x)ds$, where $g,h:\mathbb{R}^d \rightarrow \mathbb{R}$ are given measurable functions and $v$ is a (weighted) $L^2$-function on $\mathbb{R}$. We investigate conditions for the existence and uniqueness of this solution and give $L^2$-error bounds for the resulting estimates. An application to pure jump moving averages and a simulation study round off the paper.

preprint2020arXiv

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

General-purpose planning algorithms for automated driving combine mission, behavior, and local motion planning. Such planning algorithms map features of the environment and driving kinematics into complex reward functions. To achieve this, planning experts often rely on linear reward functions. The specification and tuning of these reward functions is a tedious process and requires significant experience. Moreover, a manually designed linear reward function does not generalize across different driving situations. In this work, we propose a deep learning approach based on inverse reinforcement learning that generates situation-dependent reward functions. Our neural network provides a mapping between features and actions of sampled driving policies of a model-predictive control-based planner and predicts reward functions for upcoming planning cycles. In our evaluation, we compare the driving style of reward functions predicted by our deep network against clustered and linear reward functions. Our proposed deep learning approach outperforms clustered linear reward functions and is at par with linear reward functions with a-priori knowledge about the situation.

preprint2020arXiv

Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving

Behavior and motion planning play an important role in automated driving. Traditionally, behavior planners instruct local motion planners with predefined behaviors. Due to the high scene complexity in urban environments, unpredictable situations may occur in which behavior planners fail to match predefined behavior templates. Recently, general-purpose planners have been introduced, combining behavior and local motion planning. These general-purpose planners allow behavior-aware motion planning given a single reward function. However, two challenges arise: First, this function has to map a complex feature space into rewards. Second, the reward function has to be manually tuned by an expert. Manually tuning this reward function becomes a tedious task. In this paper, we propose an approach that relies on human driving demonstrations to automatically tune reward functions. This study offers important insights into the driving style optimization of general-purpose planners with maximum entropy inverse reinforcement learning. We evaluate our approach based on the expected value difference between learned and demonstrated policies. Furthermore, we compare the similarity of human driven trajectories with optimal policies of our planner under learned and expert-tuned reward functions. Our experiments show that we are able to learn reward functions exceeding the level of manual expert tuning without prior domain knowledge.

preprint2020arXiv

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of several important cross-domain tasks such as image captioning. Prior work mostly maps both domains into a common latent representation in a purely supervised fashion. This is rather restrictive, however, as the two domains follow distinct generative processes. Therefore, we propose a novel semi-supervised framework, which models shared information between domains and domain-specific information separately. The information shared between the domains is aligned with an invertible neural network. Our model integrates normalizing flow-based priors for the domain-specific information, which allows us to learn diverse many-to-many mappings between the two domains. We demonstrate the effectiveness of our model on diverse tasks, including image captioning and text-to-image synthesis.

preprint2020arXiv

LR-CNN: Local-aware Region CNN for Vehicle Detection in Aerial Imagery

State-of-the-art object detection approaches such as Fast/Faster R-CNN, SSD, or YOLO have difficulties detecting dense, small targets with arbitrary orientation in large aerial images. The main reason is that using interpolation to align RoI features can result in a lack of accuracy or even loss of location information. We present the Local-aware Region Convolutional Neural Network (LR-CNN), a novel two-stage approach for vehicle detection in aerial imagery. We enhance translation invariance to detect dense vehicles and address the boundary quantization issue amongst dense vehicles by aggregating the high-precision RoIs' features. Moreover, we resample high-level semantic pooled features, making them regain location information from the features of a shallower convolutional block. This strengthens the local feature invariance for the resampled features and enables detecting vehicles in an arbitrary orientation. The local feature invariance enhances the learning ability of the focal loss function, and the focal loss further helps to focus on the hard examples. Taken together, our method better addresses the challenges of aerial imagery. We evaluate our approach on several challenging datasets (VEDAI, DOTA), demonstrating a significant improvement over state-of-the-art methods. We demonstrate the good generalization ability of our approach on the DLR 3K dataset.

preprint2020arXiv

MOT20: A benchmark for multi object tracking in crowded scenes

Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for research. The benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal to establish a standardized evaluation of multiple object tracking methods. The challenge focuses on multiple people tracking, since pedestrians are well studied in the tracking community, and precise tracking and detection has high practical relevance. Since the first release, MOT15, MOT16, and MOT17 have tremendously contributed to the community by introducing a clean dataset and precise framework to benchmark multi-object trackers. In this paper, we present our MOT20benchmark, consisting of 8 new sequences depicting very crowded challenging scenes. The benchmark was presented first at the 4thBMTT MOT Challenge Workshop at the Computer Vision and Pattern Recognition Conference (CVPR) 2019, and gives to chance to evaluate state-of-the-art methods for multiple object tracking when handling extremely crowded scenarios.

preprint2020arXiv

Normalizing Flows with Multi-Scale Autoregressive Priors

Flow-based generative models are an important class of exact inference models that admit efficient inference and sampling for image synthesis. Owing to the efficiency constraints on the design of the flow layers, e.g. split coupling flow layers in which approximately half the pixels do not undergo further transformations, they have limited expressiveness for modeling long-range data dependencies compared to autoregressive models that rely on conditional pixel-wise generation. In this work, we improve the representational power of flow-based models by introducing channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR). Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data. The resulting model achieves state-of-the-art density estimation results on MNIST, CIFAR-10, and ImageNet. Furthermore, we show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.

preprint2020arXiv

Optical Flow Estimation in the Deep Learning Age

Akin to many subareas of computer vision, the recent advances in deep learning have also significantly influenced the literature on optical flow. Previously, the literature had been dominated by classical energy-based models, which formulate optical flow estimation as an energy minimization problem. However, as the practical benefits of Convolutional Neural Networks (CNNs) over conventional methods have become apparent in numerous areas of computer vision and beyond, they have also seen increased adoption in the context of motion estimation to the point where the current state of the art in terms of accuracy is set by CNN approaches. We first review this transition as well as the developments from early work to the current state of CNNs for optical flow estimation. Alongside, we discuss some of their technical details and compare them to recapitulate which technical contribution led to the most significant accuracy improvements. Then we provide an overview of the various optical flow approaches introduced in the deep learning age, including those based on alternative learning paradigms (e.g., unsupervised and semi-supervised methods) as well as the extension to the multi-frame case, which is able to yield further accuracy improvements.

preprint2020arXiv

Probabilistic Pixel-Adaptive Refinement Networks

Encoder-decoder networks have found widespread use in various dense prediction tasks. However, the strong reduction of spatial resolution in the encoder leads to a loss of location information as well as boundary artifacts. To address this, image-adaptive post-processing methods have shown beneficial by leveraging the high-resolution input image(s) as guidance data. We extend such approaches by considering an important orthogonal source of information: the network's confidence in its own predictions. We introduce probabilistic pixel-adaptive convolutions (PPACs), which not only depend on image guidance data for filtering, but also respect the reliability of per-pixel predictions. As such, PPACs allow for image-adaptive smoothing and simultaneously propagating pixels of high confidence into less reliable regions, while respecting object boundaries. We demonstrate their utility in refinement networks for optical flow and semantic segmentation, where PPACs lead to a clear reduction in boundary artifacts. Moreover, our proposed refinement step is able to substantially improve the accuracy on various widely used benchmarks.

preprint2020arXiv

Remote Short Blocklength Process Monitoring: Trade-off Between Resolution and Data Freshness

In cyber-physical systems, as in 5G and beyond, multiple physical processes require timely online monitoring at a remote device. There, the received information is used to estimate current and future process values. When transmitting the process data over a communication channel, source-channel coding is used in order to reduce data errors. During transmission, a high data resolution is helpful to capture the value of the process variables precisely. However, this typically comes with long transmission delays reducing the utilizability of the data, since the estimation quality gets reduced over time. In this paper, the trade-off between having recent data and precise measurements is captured for a Gauss-Markov process. An Age-of-Information (AoI) metric is used to assess data timeliness, while mean square error (MSE) is used to assess the precision of the predicted process values. AoI appears inherently within the MSE expressions, yet it can be relatively easier to optimize. Our goal is to minimize a time-averaged version of both metrics. We follow a short blocklength source-channel coding approach, and optimize the parameters of the codes being used in order to describe an achievability region between MSE and AoI.

preprint2020arXiv

Self-Supervised Monocular Scene Flow Estimation

Scene flow estimation has been receiving increasing attention for 3D environment perception. Monocular scene flow estimation -- obtaining 3D structure and 3D motion from two temporally consecutive images -- is a highly ill-posed problem, and practical solutions are lacking to date. We propose a novel monocular scene flow method that yields competitive accuracy and real-time performance. By taking an inverse problem view, we design a single convolutional neural network (CNN) that successfully estimates depth and 3D motion simultaneously from a classical optical flow cost volume. We adopt self-supervised learning with 3D loss functions and occlusion reasoning to leverage unlabeled data. We validate our design choices, including the proxy loss and augmentation setup. Our model achieves state-of-the-art accuracy among unsupervised/self-supervised learning approaches to monocular scene flow, and yields competitive results for the optical flow and monocular depth estimation sub-tasks. Semi-supervised fine-tuning further improves the accuracy and yields promising results in real-time.

preprint2020arXiv

Single-Stage Semantic Segmentation from Image Labels

Recent years have seen a rapid growth in new approaches improving the accuracy of semantic segmentation in a weakly supervised setting, i.e. with only image-level labels available for training. However, this has come at the cost of increased model complexity and sophisticated multi-stage training procedures. This is in contrast to earlier work that used only a single stage $-$ training one segmentation network on image labels $-$ which was abandoned due to inferior segmentation accuracy. In this work, we first define three desirable properties of a weakly supervised method: local consistency, semantic fidelity, and completeness. Using these properties as guidelines, we then develop a segmentation-based network model and a self-supervised training scheme to train for semantic masks from image-level annotations in a single stage. We show that despite its simplicity, our method achieves results that are competitive with significantly more complex pipelines, substantially outperforming earlier single-stage methods.

preprint2016arXiv

A Time Projection Chamber with GEM-Based Readout

For the International Large Detector concept at the planned International Linear Collider, the use of time projection chambers (TPC) with micro-pattern gas detector readout as the main tracking detector is investigated. In this paper, results from a prototype TPC, placed in a 1 T solenoidal field and read out with three independent GEM-based readout modules, are reported. The TPC was exposed to a 6 GeV electron beam at the DESY II synchrotron. The efficiency for reconstructing hits, the measurement of the drift velocity, the space point resolution and the control of field inhomogeneities are presented.

preprint2016arXiv

Joint Optical Flow and Temporally Consistent Semantic Segmentation

The importance and demands of visual scene understanding have been steadily increasing along with the active development of autonomous systems. Consequently, there has been a large amount of research dedicated to semantic segmentation and dense motion estimation. In this paper, we propose a method for jointly estimating optical flow and temporally consistent semantic segmentation, which closely connects these two problem domains and leverages each other. Semantic segmentation provides information on plausible physical motion to its associated pixels, and accurate pixel-level temporal correspondences enhance the accuracy of semantic segmentation in the temporal domain. We demonstrate the benefits of our approach on the KITTI benchmark, where we observe performance gains for flow and segmentation. We achieve state-of-the-art optical flow results, and outperform all published algorithms by a large margin on challenging, but crucial dynamic objects.

preprint2016arXiv

MOT16: A Benchmark for Multi-Object Tracking

Standardized benchmarks are crucial for the majority of computer vision applications. Although leaderboards and ranking tables should not be over-claimed, benchmarks often provide the most objective measure of performance and are therefore important guides for reseach. Recently, a new benchmark for Multiple Object Tracking, MOTChallenge, was launched with the goal of collecting existing and new data and creating a framework for the standardized evaluation of multiple object tracking methods. The first release of the benchmark focuses on multiple people tracking, since pedestrians are by far the most studied object in the tracking community. This paper accompanies a new release of the MOTChallenge benchmark. Unlike the initial release, all videos of MOT16 have been carefully annotated following a consistent protocol. Moreover, it not only offers a significant increase in the number of labeled boxes, but also provides multiple object classes beside pedestrians and the level of visibility for every single object of interest.

preprint2016arXiv

Parametric Object Motion from Blur

Motion blur can adversely affect a number of vision tasks, hence it is generally considered a nuisance. We instead treat motion blur as a useful signal that allows to compute the motion of objects from a single image. Drawing on the success of joint segmentation and parametric motion models in the context of optical flow estimation, we propose a parametric object motion model combined with a segmentation mask to exploit localized, non-uniform motion blur. Our parametric image formation model is differentiable w.r.t. the motion parameters, which enables us to generalize marginal-likelihood techniques from uniform blind deblurring to localized, non-uniform blur. A two-stage pipeline, first in derivative space and then in image space, allows to estimate both parametric object motion as well as a motion segmentation from a single image alone. Our experiments demonstrate its ability to cope with very challenging cases of object motion blur.

preprint2016arXiv

Playing for Data: Ground Truth from Computer Games

Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data to supplement real-world images significantly increases accuracy and that the acquired data enables reducing the amount of hand-labeled real-world data: models trained with game data and just 1/3 of the CamVid training set outperform models trained on the complete CamVid training set.

preprint2016arXiv

Stereo Video Deblurring

Videos acquired in low-light conditions often exhibit motion blur, which depends on the motion of the objects relative to the camera. This is not only visually unpleasing, but can hamper further processing. With this paper we are the first to show how the availability of stereo video can aid the challenging video deblurring task. We leverage 3D scene flow, which can be estimated robustly even under adverse conditions. We go beyond simply determining the object motion in two ways: First, we show how a piecewise rigid 3D scene flow representation allows to induce accurate blur kernels via local homographies. Second, we exploit the estimated motion boundaries of the 3D scene flow to mitigate ringing artifacts using an iterative weighting scheme. Being aware of 3D object motion, our approach can deal robustly with an arbitrary number of independently moving objects. We demonstrate its benefit over state-of-the-art video deblurring using quantitative and qualitative experiments on rendered scenes and real videos.

preprint2016arXiv

The Cityscapes Dataset for Semantic Urban Scene Understanding

Visual understanding of complex urban street scenes is an enabling factor for a wide range of applications. Object detection has benefited enormously from large-scale datasets, especially in the context of deep learning. For semantic urban scene understanding, however, no current dataset adequately captures the complexity of real-world urban scenes. To address this, we introduce Cityscapes, a benchmark suite and large-scale dataset to train and test approaches for pixel-level and instance-level semantic labeling. Cityscapes is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. 5000 of these images have high quality pixel-level annotations; 20000 additional images have coarse annotations to enable methods that leverage large volumes of weakly-labeled data. Crucially, our effort exceeds previous attempts in terms of dataset size, annotation richness, scene variability, and complexity. Our accompanying empirical study provides an in-depth analysis of the dataset characteristics, as well as a performance evaluation of several state-of-the-art approaches based on our benchmark.

preprint2015arXiv

MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking

In the recent past, the computer vision community has developed centralized benchmarks for the performance evaluation of a variety of tasks, including generic object and pedestrian detection, 3D reconstruction, optical flow, single-object short-term tracking, and stereo estimation. Despite potential pitfalls of such benchmarks, they have proved to be extremely helpful to advance the state of the art in the respective area. Interestingly, there has been rather limited work on the standardization of quantitative benchmarks for multiple target tracking. One of the few exceptions is the well-known PETS dataset, targeted primarily at surveillance applications. Despite being widely used, it is often applied inconsistently, for example involving using different subsets of the available data, different ways of training the models, or differing evaluation scripts. This paper describes our work toward a novel multiple object tracking benchmark aimed to address such issues. We discuss the challenges of creating such a framework, collecting existing and new data, gathering state-of-the-art methods to be tested on the datasets, and finally creating a unified evaluation system. With MOTChallenge we aim to pave the way toward a unified evaluation framework for a more meaningful quantification of multi-target tracking.

preprint2014arXiv

Cascades of Regression Tree Fields for Image Restoration

Conditional random fields (CRFs) are popular discriminative models for computer vision and have been successfully applied in the domain of image restoration, especially to image denoising. For image deblurring, however, discriminative approaches have been mostly lacking. We posit two reasons for this: First, the blur kernel is often only known at test time, requiring any discriminative approach to cope with considerable variability. Second, given this variability it is quite difficult to construct suitable features for discriminative prediction. To address these challenges we first show a connection between common half-quadratic inference for generative image priors and Gaussian CRFs. Based on this analysis, we then propose a cascade model for image restoration that consists of a Gaussian CRF at each stage. Each stage of our cascade is semi-parametric, i.e. it depends on the instance-specific parameters of the restoration problem, such as the blur kernel. We train our model by loss minimization with synthetically generated training data. Our experiments show that when applied to non-blind image deblurring, the proposed approach is efficient and yields state-of-the-art restoration quality on images corrupted with synthetic and real blur. Moreover, we demonstrate its suitability for image denoising, where we achieve competitive results for grayscale and color images.

preprint2013arXiv

Afterpulse Measurements of R7081 Photomultipliers for the Double Chooz Experiment

We present the results of afterpulse measurements performed as qualification test for 473 inner detector photomultipliers of the Double Chooz experiment. The measurements include the determination of a total afterpulse occurrence probability as well as an average time distribution of these pulses. Additionally, more detailed measurements with different light sources and simultaneous charge and timing measurements were performed with a few photomultipliers to allow a more detailed understanding of the effect. The results of all measurements are presented and discussed.

preprint2013arXiv

Extrapolation of Stationary Random Fields

We introduce basic statistical methods for the extrapolation of stationary random fields. For square integrable fields, we set out basics of the kriging extrapolation techniques. For (non--Gaussian) stable fields, which are known to be heavy tailed, we describe further extrapolation methods and discuss their properties. Two of them can be seen as direct generalizations of kriging.

Stefan Roth

What is connected

Connect this record

See the researcher in context

Building this map preview

27 published item(s)

Approximation-based Threshold Optimization from Single Antenna to Massive SIMO Authentication

Diverse Image Captioning with Grounded Style

IRShield: A Countermeasure Against Adversarial Physical-Layer Wireless Sensing

xGQA: Cross-Lingual Visual Question Answering

A solution to a linear integral equation with an application to statistics of infinitely divisible moving averages

Driving Style Encoder: Situational Reward Adaptation for General-Purpose Planning in Automated Driving

Driving with Style: Inverse Reinforcement Learning in General-Purpose Planning for Automated Driving

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

LR-CNN: Local-aware Region CNN for Vehicle Detection in Aerial Imagery

MOT20: A benchmark for multi object tracking in crowded scenes

Normalizing Flows with Multi-Scale Autoregressive Priors

Optical Flow Estimation in the Deep Learning Age

Probabilistic Pixel-Adaptive Refinement Networks

Remote Short Blocklength Process Monitoring: Trade-off Between Resolution and Data Freshness

Self-Supervised Monocular Scene Flow Estimation

Single-Stage Semantic Segmentation from Image Labels

A Time Projection Chamber with GEM-Based Readout

Joint Optical Flow and Temporally Consistent Semantic Segmentation

MOT16: A Benchmark for Multi-Object Tracking

Parametric Object Motion from Blur

Playing for Data: Ground Truth from Computer Games

Stereo Video Deblurring

The Cityscapes Dataset for Semantic Urban Scene Understanding

MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking

Cascades of Regression Tree Fields for Image Restoration

Afterpulse Measurements of R7081 Photomultipliers for the Double Chooz Experiment

Extrapolation of Stationary Random Fields