Researcher profile

Sven Behnke

Sven Behnke contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
33works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

33 published item(s)

preprint2026arXiv

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. We will open source our code and more at https://dream-mpc.github.io.

preprint2026arXiv

Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at https://slot-mpc.github.io.

preprint2023arXiv

Bimanual Telemanipulation with Force and Haptic Feedback through an Anthropomorphic Avatar System

Robotic teleoperation is a key technology for a wide variety of applications. It allows sending robots instead of humans in remote, possibly dangerous locations while still using the human brain with its enormous knowledge and creativity, especially for solving unexpected problems. A main challenge in teleoperation consists of providing enough feedback to the human operator for situation awareness and thus create full immersion, as well as offering the operator suitable control interfaces to achieve efficient and robust task fulfillment. We present a bimanual telemanipulation system consisting of an anthropomorphic avatar robot and an operator station providing force and haptic feedback to the human operator. The avatar arms are controlled in Cartesian space with a direct mapping of the operator movements. The measured forces and torques on the avatar side are haptically displayed to the operator. We developed a predictive avatar model for limit avoidance which runs on the operator side, ensuring low latency. The system was successfully evaluated during the ANA Avatar XPRIZE competition semifinals. In addition, we performed in lab experiments and carried out a small user study with mostly untrained operators.

preprint2022arXiv

3D Semantic Scene Perception using Distributed Smart Edge Sensors

We present a system for 3D semantic scene perception consisting of a network of distributed smart edge sensors. The sensor nodes are based on an embedded CNN inference accelerator and RGB-D and thermal cameras. Efficient vision CNN models for object detection, semantic segmentation, and human pose estimation run on-device in real time. 2D human keypoint estimations, augmented with the RGB-D depth estimate, as well as semantically annotated point clouds are streamed from the sensors to a central backend, where multiple viewpoints are fused into an allocentric 3D semantic scene model. As the image interpretation is computed locally, only semantic information is sent over the network. The raw images remain on the sensor boards, significantly reducing the required bandwidth, and mitigating privacy risks for the observed persons. We evaluate the proposed system in challenging real-world multi-person scenes in our lab. The proposed perception system provides a complete scene view containing semantically annotated 3D geometry and estimates 3D poses of multiple persons in real time.

preprint2022arXiv

A Study on the Ambiguity in Human Annotation of German Oral History Interviews for Perceived Emotion Recognition and Sentiment Analysis

For research in audiovisual interview archives often it is not only of interest what is said but also how. Sentiment analysis and emotion recognition can help capture, categorize and make these different facets searchable. In particular, for oral history archives, such indexing technologies can be of great interest. These technologies can help understand the role of emotions in historical remembering. However, humans often perceive sentiments and emotions ambiguously and subjectively. Moreover, oral history interviews have multi-layered levels of complex, sometimes contradictory, sometimes very subtle facets of emotions. Therefore, the question arises of the chance machines and humans have capturing and assigning these into predefined categories. This paper investigates the ambiguity in human perception of emotions and sentiment in German oral history interviews and the impact on machine learning systems. Our experiments reveal substantial differences in human perception for different emotions. Furthermore, we report from ongoing machine learning experiments with different modalities. We show that the human perceptual ambiguity and other challenges, such as class imbalance and lack of training data, currently limit the opportunities of these technologies for oral history archives. Nonetheless, our work uncovers promising observations and possibilities for further research.

preprint2022arXiv

Abstract Flow for Temporal Semantic Segmentation on the Permutohedral Lattice

Semantic segmentation is a core ability required by autonomous agents, as being able to distinguish which parts of the scene belong to which object class is crucial for navigation and interaction with the environment. Approaches which use only one time-step of data cannot distinguish between moving objects nor can they benefit from temporal integration. In this work, we extend a backbone LatticeNet to process temporal point cloud data. Additionally, we take inspiration from optical flow methods and propose a new module called Abstract Flow which allows the network to match parts of the scene with similar abstract features and gather the information temporally. We obtain state-of-the-art results on the SemanticKITTI dataset that contains LiDAR scans from real urban environments. We share the PyTorch implementation of TemporalLatticeNet at https://github.com/AIS-Bonn/temporal_latticenet .

preprint2022arXiv

Autonomous Flight in Unknown GNSS-denied Environments for Disaster Examination

Micro aerial vehicles (MAVs) have high potential for information gathering tasks to support situation awareness in search and rescue scenarios. Manually controlling MAVs in such scenarios requires experienced pilots and is error-prone, especially in stressful situations of real emergencies. The conditions of disaster scenarios are also challenging for autonomous MAV systems. The environment is usually not known in advance and GNSS might not always be available. We present a system for autonomous MAV flights in unknown environments which does not rely on global positioning systems. The method is evaluated in multiple search and rescue scenarios and allows for safe autonomous flights, even when transitioning between indoor and outdoor areas.

preprint2022arXiv

ConvPoseCNN2: Prediction and Refinement of Dense 6D Object Poses

Object pose estimation is a key perceptual capability in robotics. We propose a fully-convolutional extension of the PoseCNN method, which densely predicts object translations and orientations. This has several advantages such as improving the spatial resolution of the orientation predictions -- useful in highly-cluttered arrangements, significant reduction in parameters by avoiding full connectivity, and fast inference. We propose and discuss several aggregation methods for dense orientation predictions that can be applied as a post-processing step, such as averaging and clustering techniques. We demonstrate that our method achieves the same accuracy as PoseCNN on the challenging YCB-Video dataset and provide a detailed ablation study of several variants of our method. Finally, we demonstrate that the model can be further improved by inserting an iterative refinement module into the middle of the network, which enforces consistency of the prediction.

preprint2022arXiv

Direct Centroidal Control for Balanced Humanoid Locomotion

We present an integrated approach to locomotion and balancing of humanoid robots based on direct centroidal control. Our method uses a five-mass description of a humanoid. It generates whole-body motions from desired foot trajectories and centroidal parameters of the robot. A set of simplified models is used to formulate general and intuitive control laws, which are then applied in real-time for estimating and regulating the center of mass position and orientation of the multibody's principal axes of inertia. The combination of proposed algorithms produces a stretched-leg gait with naturally looking upper body motions. As only a 6-axis IMU and joint encoders are necessary for the implementation, the portability between robots is high. Our method has been experimentally verified using an igus Humanoid Open Platform, demonstrating whole-body locomotion and push rejection capabilities.

preprint2022arXiv

FaDIV-Syn: Fast Depth-Independent View Synthesis using Soft Masks and Implicit Blending

Novel view synthesis is required in many robotic applications, such as VR teleoperation and scene reconstruction. Existing methods are often too slow for these contexts, cannot handle dynamic scenes, and are limited by their explicit depth estimation stage, where incorrect depth predictions can lead to large projection errors. Our proposed method runs in real time on live streaming data and avoids explicit depth estimation by efficiently warping input images into the target frame for a range of assumed depth planes. The resulting plane sweep volume (PSV) is directly fed into our network, which first estimates soft PSV masks in a self-supervised manner, and then directly produces the novel output view. This improves efficiency and performance on transparent, reflective, thin, and feature-less scene parts. FaDIV-Syn can perform both interpolation and extrapolation tasks at 540p in real-time and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. We thoroughly evaluate ablations, such as removing the Soft-Masking network, training from fewer examples as well as generalization to higher resolutions and stronger depth discretization. Our implementation is available.

preprint2022arXiv

Human and Automatic Speech Recognition Performance on German Oral History Interviews

Automatic speech recognition systems have accomplished remarkable improvements in transcription accuracy in recent years. On some domains, models now achieve near-human performance. However, transcription performance on oral history has not yet reached human accuracy. In the present work, we investigate how large this gap between human and machine transcription still is. For this purpose, we analyze and compare transcriptions of three humans on a new oral history data set. We estimate a human word error rate of 8.7% for recent German oral history interviews with clean acoustic conditions. For comparison with recent machine transcription accuracy, we present experiments on the adaptation of an acoustic model achieving near-human performance on broadcast speech. We investigate the influence of different adaptation data on robustness and generalization for clean and noisy oral history interviews. We optimize our acoustic models by 5 to 8% relative for this task and achieve 23.9% WER on noisy and 15.6% word error rate on clean oral history interviews.

preprint2022arXiv

Neural Strands: Learning Hair Geometry and Appearance from Multi-View Images

We present Neural Strands, a novel learning framework for modeling accurate hair geometry and appearance from multi-view image inputs. The learned hair model can be rendered in real-time from any viewpoint with high-fidelity view-dependent effects. Our model achieves intuitive shape and style control unlike volumetric counterparts. To enable these properties, we propose a novel hair representation based on a neural scalp texture that encodes the geometry and appearance of individual strands at each texel location. Furthermore, we introduce a novel neural rendering framework based on rasterization of the learned hair strands. Our neural rendering is strand-accurate and anti-aliased, making the rendering view-consistent and photorealistic. Combining appearance with a multi-view geometric prior, we enable, for the first time, the joint learning of appearance and explicit hair geometry from a multi-view setup. We demonstrate the efficacy of our approach in terms of fidelity and efficiency for various hairstyles.

preprint2022arXiv

NeuralMVS: Bridging Multi-View Stereo and Novel View Synthesis

Multi-View Stereo (MVS) is a core task in 3D computer vision. With the surge of novel deep learning methods, learned MVS has surpassed the accuracy of classical approaches, but still relies on building a memory intensive dense cost volume. Novel View Synthesis (NVS) is a parallel line of research and has recently seen an increase in popularity with Neural Radiance Field (NeRF) models, which optimize a per scene radiance field. However, NeRF methods do not generalize to novel scenes and are slow to train and test. We propose to bridge the gap between these two methodologies with a novel network that can recover 3D scene geometry as a distance function, together with high-resolution color images. Our method uses only a sparse set of images as input and can generalize well to novel scenes. Additionally, we propose a coarse-to-fine sphere tracing approach in order to significantly increase speed. We show on various datasets that our method reaches comparable accuracy to per-scene optimized methods while being able to generalize and running significantly faster. We provide the source code at https://github.com/AIS-Bonn/neural_mvs

preprint2022arXiv

Predicting Physical Object Properties from Video

We present a novel approach to estimating physical properties of objects from video. Our approach consists of a physics engine and a correction estimator. Starting from the initial observed state, object behavior is simulated forward in time. Based on the simulated and observed behavior, the correction estimator then determines refined physical parameters for each object. The method can be iterated for increased precision. Our approach is generic, as it allows for the use of an arbitrary - not necessarily differentiable - physics engine and correction estimator. For the latter, we evaluate both gradient-free hyperparameter optimization and a deep convolutional neural network. We demonstrate faster and more robust convergence of the learned method in several simulated 2D scenarios focusing on bin situations.

preprint2022arXiv

Predictive Angular Potential Field-based Obstacle Avoidance for Dynamic UAV Flights

In recent years, unmanned aerial vehicles (UAVs) are used for numerous inspection and video capture tasks. Manually controlling UAVs in the vicinity of obstacles is challenging, however, and poses a high risk of collisions. Even for autonomous flight, global navigation planning might be too slow to react to newly perceived obstacles. Disturbances such as wind might lead to deviations from the planned trajectories. In this work, we present a fast predictive obstacle avoidance method that does not depend on higher-level localization or mapping and maintains the dynamic flight capabilities of UAVs. It directly operates on LiDAR range images in real time and adjusts the current flight direction by computing angular potential fields within the range image. The velocity magnitude is subsequently determined based on a trajectory prediction and time-to-contact estimation. Our method is evaluated using Hardware-in-the-Loop simulations. It keeps the UAV at a safe distance to obstacles, while allowing higher flight velocities than previous reactive obstacle avoidance methods that directly operate on sensor data.

preprint2022arXiv

Real-Robot Deep Reinforcement Learning: Improving Trajectory Tracking of Flexible-Joint Manipulator with Reference Correction

Flexible-joint manipulators are governed by complex nonlinear dynamics, defining a challenging control problem. In this work, we propose an approach to learn an outer-loop joint trajectory tracking controller with deep reinforcement learning. The controller represented by a stochastic policy is learned in under two hours directly on the real robot. This is achieved through bounded reference correction actions and use of a model-free off-policy learning method. In addition, an informed policy initialization is proposed, where the agent is pre-trained in a learned simulation. We test our approach on the 7 DOF manipulator of a Baxter robot. We demonstrate that the proposed method is capable of consistent learning across multiple runs when applied directly on the real robot. Our method yields a policy which significantly improves the trajectory tracking accuracy in comparison to the vendor-provided controller, generalizing to an unseen payload.

preprint2022arXiv

Target Chase, Wall Building, and Fire Fighting: Autonomous UAVs of Team NimbRo at MBZIRC 2020

The Mohamed Bin Zayed International Robotics Challenge (MBZIRC) 2020 posed diverse challenges for unmanned aerial vehicles (UAVs). We present our four tailored UAVs, specifically developed for individual aerial-robot tasks of MBZIRC, including custom hardware- and software components. In Challenge 1, a target UAV is pursued using a high-efficiency, onboard object detection pipeline to capture a ball from the target UAV. A second UAV uses a similar detection method to find and pop balloons scattered throughout the arena. For Challenge 2, we demonstrate a larger UAV capable of autonomous aerial manipulation: Bricks are found and tracked from camera images. Subsequently, they are approached, picked, transported, and placed on a wall. Finally, in Challenge 3, our UAV autonomously finds fires using LiDAR and thermal cameras. It extinguishes the fires with an onboard fire extinguisher. While every robot features task-specific subsystems, all UAVs rely on a standard software stack developed for this particular and future competitions. We present our mostly open-source software solutions, including tools for system configuration, monitoring, robust wireless communication, high-level control, and agile trajectory generation. For solving the MBZIRC 2020 tasks, we advanced the state of the art in multiple research areas like machine vision and trajectory generation. We present our scientific contributions that constitute the foundation for our algorithms and systems and analyze the results from the MBZIRC competition 2020 in Abu Dhabi, where our systems reached second place in the Grand Challenge. Furthermore, we discuss lessons learned from our participation in this complex robotic challenge.

preprint2022arXiv

Two-step Planning of Dynamic UAV Trajectories using Iterative $δ$-Spaces

UAV trajectory planning is often done in a two-step approach, where a low-dimensional path is refined to a dynamic trajectory. The resulting trajectories are only locally optimal, however. On the other hand, direct planning in higher-dimensional state spaces generates globally optimal solutions but is time-consuming and thus infeasible for time-constrained applications. To address this issue, we propose $δ$-Spaces, a pruned high-dimensional state space representation for trajectory refinement. It does not only contain the area around a single lower-dimensional path but consists of the union of multiple near-optimal paths. Thus, it is less prone to local minima. Furthermore, we propose an anytime algorithm using $δ$-Spaces of increasing sizes. We compare our method against state-of-the-art search-based trajectory planning methods and evaluate it in 2D and 3D environments to generate second-order and third-order UAV trajectories.

preprint2022arXiv

Unsupervised Image Decomposition with Phase-Correlation Networks

The ability to decompose scenes into their object components is a desired property for autonomous agents, allowing them to reason and act in their surroundings. Recently, different methods have been proposed to learn object-centric representations from data in an unsupervised manner. These methods often rely on latent representations learned by deep neural networks, hence requiring high computational costs and large amounts of curated data. Such models are also difficult to interpret. To address these challenges, we propose the Phase-Correlation Decomposition Network (PCDNet), a novel model that decomposes a scene into its object components, which are represented as transformed versions of a set of learned object prototypes. The core building block in PCDNet is the Phase-Correlation Cell (PC Cell), which exploits the frequency-domain representation of the images in order to estimate the transformation between an object prototype and its transformed version in the image. In our experiments, we show how PCDNet outperforms state-of-the-art methods for unsupervised object discovery and segmentation on simple benchmark datasets and on more challenging data, while using a small number of learnable parameters and being fully interpretable.

preprint2022arXiv

YOLOPose: Transformer-based Multi-Object 6D Pose Estimation using Keypoint Regression

6D object pose estimation is a crucial prerequisite for autonomous robot manipulation applications. The state-of-the-art models for pose estimation are convolutional neural network (CNN)-based. Lately, Transformers, an architecture originally proposed for natural language processing, is achieving state-of-the-art results in many computer vision tasks as well. Equipped with the multi-head self-attention mechanism, Transformers enable simple single-stage end-to-end architectures for learning object detection and 6D object pose estimation jointly. In this work, we propose YOLOPose (short form for You Only Look Once Pose estimation), a Transformer-based multi-object 6D pose estimation method based on keypoint regression. In contrast to the standard heatmaps for predicting keypoints in an image, we directly regress the keypoints. Additionally, we employ a learnable orientation estimation module to predict the orientation from the keypoints. Along with a separate translation estimation module, our model is end-to-end differentiable. Our method is suitable for real-time applications and achieves results comparable to state-of-the-art methods.

preprint2021arXiv

Bipedal Humanoid Hardware Design: A Technology Review

Purpose of Review: As new technological advancements are made, humanoid robots that utilise them are being designed and manufactured. For optimal design choices, a broad overview with insight on the advantages and disadvantages of available technologies is necessary. This article intends to provide an analysis on the established approaches and contrast them with emerging ones. Recent Findings: A clear shift in the recent design features of humanoid robots is developing, which is supported by literature. As humanoid robots are meant to leave laboratories and traverse the world, compliance and more efficient locomotion is necessary. The limitations of highly rigid actuation are being tackled by different research groups in unique ways. Some focus on modifying the kinematic structure, while others change the actuation scheme. With new manufacturing capabilities, previously impossible designs are becoming feasible. Summary: A comprehensive review on the technologies crucial for bipedal humanoid robots was performed. Different mechanical concepts have been discussed, along with the advancements in actuation, sensing and manufacturing. The paper is supplemented with a list of the recently developed platforms along with a selection of their specifications.

preprint2021arXiv

Search-based Planning of Dynamic MAV Trajectories Using Local Multiresolution State Lattices

Search-based methods that use motion primitives can incorporate the system's dynamics into the planning and thus generate dynamically feasible MAV trajectories that are globally optimal. However, searching high-dimensional state lattices is computationally expensive. Local multiresolution is a commonly used method to accelerate spatial path planning. While paths within the vicinity of the robot are represented at high resolution, the representation gets coarser for more distant parts. In this work, we apply the concept of local multiresolution to high-dimensional state lattices that include velocities and accelerations. Experiments show that our proposed approach significantly reduces planning times. Thus, it increases the applicability to large dynamic environments, where frequent replanning is necessary.

preprint2020arXiv

3D U-Net for Segmentation of Plant Root MRI Images in Super-Resolution

Magnetic resonance imaging (MRI) enables plant scientists to non-invasively study root system development and root-soil interaction. Challenging recording conditions, such as low resolution and a high level of noise hamper the performance of traditional root extraction algorithms, though. We propose to increase signal-to-noise ratio and resolution by segmenting the scanned volumes into root and soil in super-resolution using a 3D U-Net. Tests on real data show that the trained network is capable to detect most roots successfully and even finds roots that were missed by human annotators. Our experiments show that the segmentation performance can be further improved with modifications of the loss function.

preprint2020arXiv

A VR System for Immersive Teleoperation and Live Exploration with a Mobile Robot

Applications like disaster management and industrial inspection often require experts to enter contaminated places. To circumvent the need for physical presence, it is desirable to generate a fully immersive individual live teleoperation experience. However, standard video-based approaches suffer from a limited degree of immersion and situation awareness due to the restriction to the camera view, which impacts the navigation. In this paper, we present a novel VR-based practical system for immersive robot teleoperation and scene exploration. While being operated through the scene, a robot captures RGB-D data that is streamed to a SLAM-based live multi-client telepresence system. Here, a global 3D model of the already captured scene parts is reconstructed and streamed to the individual remote user clients where the rendering for e.g. head-mounted display devices (HMDs) is performed. We introduce a novel lightweight robot client component which transmits robot-specific data and enables a quick integration into existing robotic systems. This way, in contrast to first-person exploration systems, the operators can explore and navigate in the remote site completely independent of the current position and view of the capturing robot, complementing traditional input devices for teleoperation. We provide a proof-of-concept implementation and demonstrate the capabilities as well as the performance of our system regarding interactive object measurements and bandwidth-efficient data streaming and visualization. Furthermore, we show its benefits over purely video-based teleoperation in a user study revealing a higher degree of situation awareness and a more precise navigation in challenging environments.

preprint2020arXiv

Beyond Photometric Consistency: Gradient-based Dissimilarity for Improving Visual Odometry and Stereo Matching

Pose estimation and map building are central ingredients of autonomous robots and typically rely on the registration of sensor data. In this paper, we investigate a new metric for registering images that builds upon on the idea of the photometric error. Our approach combines a gradient orientation-based metric with a magnitude-dependent scaling term. We integrate both into stereo estimation as well as visual odometry systems and show clear benefits for typical disparity and direct image registration tasks when using our proposed metric. Our experimental evaluation indicats that our metric leads to more robust and more accurate estimates of the scene depth as well as camera trajectory. Thus, the metric improves camera pose estimation and in turn the mapping capabilities of mobile robots. We believe that a series of existing visual odometry and visual SLAM systems can benefit from the findings reported in this paper.

preprint2020arXiv

Category-Level 3D Non-Rigid Registration from Single-View RGB Images

In this paper, we propose a novel approach to solve the 3D non-rigid registration problem from RGB images using Convolutional Neural Networks (CNNs). Our objective is to find a deformation field (typically used for transferring knowledge between instances, e.g., grasping skills) that warps a given 3D canonical model into a novel instance observed by a single-view RGB image. This is done by training a CNN that infers a deformation field for the visible parts of the canonical model and by employing a learned shape (latent) space for inferring the deformations of the occluded parts. As result of the registration, the observed model is reconstructed. Because our method does not need depth information, it can register objects that are typically hard to perceive with RGB-D sensors, e.g. with transparent or shiny surfaces. Even without depth data, our approach outperforms the Coherent Point Drift (CPD) registration method for the evaluated object categories.

preprint2020arXiv

Fast Whole-Body Motion Control of Humanoid Robots with Inertia Constraints

We introduce a new, analytical method for generating whole-body motions for humanoid robots, which approximate the desired Composite Rigid Body (CRB) inertia. Our approach uses a reduced five mass model, where four of the masses are attributed to the limbs and one is used for the trunk. This compact formulation allows for finding an analytical solution that combines the kinematics with mass distribution and inertial properties of a humanoid robot. The positioning of the masses in Cartesian space is then directly used to obtain joint angles with relations based on simple geometry. Motions are achieved through the time evolution of poses generated through the desired foot positioning and CRB inertia properties. As a result, we achieve short computation times in the order of tens of microseconds. This makes the method suited for applications with limited computation resources, or leaving them to be spent on higher-layer tasks such as model predictive control. The approach is evaluated by performing a dynamic kicking motion with an igus Humanoid Open Platform robot.

preprint2020arXiv

LatticeNet: Fast Point Cloud Segmentation Using Permutohedral Lattices

Deep convolutional neural networks (CNNs) have shown outstanding performance in the task of semantically segmenting images. However, applying the same methods on 3D data still poses challenges due to the heavy memory requirements and the lack of structured data. Here, we propose LatticeNet, a novel approach for 3D semantic segmentation, which takes as input raw point clouds. A PointNet describes the local geometry which we embed into a sparse permutohedral lattice. The lattice allows for fast convolutions while keeping a low memory footprint. Further, we introduce DeformSlice, a novel learned data-dependent interpolation for projecting lattice features back onto the point cloud. We present results of 3D segmentation on various datasets where our method achieves state-of-the-art performance.

preprint2020arXiv

Motion Segmentation using Frequency Domain Transformer Networks

Self-supervised prediction is a powerful mechanism to learn representations that capture the underlying structure of the data. Despite recent progress, the self-supervised video prediction task is still challenging. One of the critical factors that make the task hard is motion segmentation, which is segmenting individual objects and the background and estimating their motion separately. In video prediction, the shape, appearance, and transformation of each object should be understood only by predicting the next frame in pixel space. To address this task, we propose a novel end-to-end learnable architecture that predicts the next frame by modeling foreground and background separately while simultaneously estimating and predicting the foreground motion using Frequency Domain Transformer Networks. Experimental evaluations show that this yields interpretable representations and that our approach can outperform some widely used video prediction methods like Video Ladder Network and Predictive Gated Pyramids on synthetic data.

preprint2020arXiv

Multi-Staged Cross-Lingual Acoustic Model Adaption for Robust Speech Recognition in Real-World Applications -- A Case Study on German Oral History Interviews

While recent automatic speech recognition systems achieve remarkable performance when large amounts of adequate, high quality annotated speech data is used for training, the same systems often only achieve an unsatisfactory result for tasks in domains that greatly deviate from the conditions represented by the training data. For many real-world applications, there is a lack of sufficient data that can be directly used for training robust speech recognition systems. To address this issue, we propose and investigate an approach that performs a robust acoustic model adaption to a target domain in a cross-lingual, multi-staged manner. Our approach enables the exploitation of large-scale training data from other domains in both the same and other languages. We evaluate our approach using the challenging task of German oral history interviews, where we achieve a relative reduction of the word error rate by more than 30% compared to a model trained from scratch only on the target domain, and 6-7% relative compared to a model trained robustly on 1000 hours of same-language out-of-domain training data.

preprint2020arXiv

NAT: Noise-Aware Training for Robust Neural Sequence Labeling

Sequence labeling systems should perform reliably not only under ideal conditions but also with corrupted inputs - as these systems often process user-generated text or follow an error-prone upstream component. To this end, we formulate the noisy sequence labeling problem, where the input may undergo an unknown noising process and propose two Noise-Aware Training (NAT) objectives that improve robustness of sequence labeling performed on perturbed input: Our data augmentation method trains a neural model using a mixture of clean and noisy samples, whereas our stability training algorithm encourages the model to create a noise-invariant latent representation. We employ a vanilla noise model at training time. For evaluation, we use both the original data and its variants perturbed with real OCR errors and misspellings. Extensive experiments on English and German named entity recognition benchmarks confirmed that NAT consistently improved robustness of popular sequence labeling models, preserving accuracy on the original input. We make our code and data publicly available for the research community.

preprint2020arXiv

Stillleben: Realistic Scene Synthesis for Deep Learning in Robotics

Training data is the key ingredient for deep learning approaches, but difficult to obtain for the specialized domains often encountered in robotics. We describe a synthesis pipeline capable of producing training data for cluttered scene perception tasks such as semantic segmentation, object detection, and correspondence or pose estimation. Our approach arranges object meshes in physically realistic, dense scenes using physics simulation. The arranged scenes are rendered using high-quality rasterization with randomized appearance and material parameters. Noise and other transformations introduced by the camera sensors are simulated. Our pipeline can be run online during training of a deep neural network, yielding applications in life-long learning and in iterative render-and-compare approaches. We demonstrate the usability by learning semantic segmentation on the challenging YCB-Video dataset without actually using any training frames, where our method achieves performance comparable to a conventionally trained model. Additionally, we show successful application in a real-world regrasping system.

preprint2020arXiv

Visual Descriptor Learning from Monocular Video

Correspondence estimation is one of the most widely researched and yet only partially solved area of computer vision with many applications in tracking, mapping, recognition of objects and environment. In this paper, we propose a novel way to estimate dense correspondence on an RGB image where visual descriptors are learned from video examples by training a fully convolutional network. Most deep learning methods solve this by training the network with a large set of expensive labeled data or perform labeling through strong 3D generative models using RGB-D videos. Our method learns from RGB videos using contrastive loss, where relative labeling is estimated from optical flow. We demonstrate the functionality in a quantitative analysis on rendered videos, where ground truth information is available. Not only does the method perform well on test data with the same background, it also generalizes to situations with a new background. The descriptors learned are unique and the representations determined by the network are global. We further show the applicability of the method to real-world videos.