Source author record

Markus Vincze

Markus Vincze appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Robotics Computer Vision Machine Learning

Catalog footprint

What is connected

16works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.

preprint2022arXiv

BURG-Toolkit: Robot Grasping Experiments in Simulation and the Real World

This paper presents BURG-Toolkit, a set of open-source tools for Benchmarking and Understanding Robotic Grasping. Our tools allow researchers to: (1) create virtual scenes for generating training data and performing grasping in simulation; (2) recreate the scene by arranging the corresponding objects accurately in the physical world for real robot experiments, supporting an analysis of the sim-to-real gap; and (3) share the scenes with other researchers to foster comparability and reproducibility of experimental results. We explain how to use our tools by describing some potential use cases. We further provide proof-of-concept experimental results quantifying the sim-to-real gap for robot grasping in some example scenes. The tools are available at: https://mrudorfer.github.io/burg-toolkit/

preprint2022arXiv

COPE: End-to-end trainable Constant Runtime Object Pose Estimation

State-of-the-art object pose estimation handles multiple instances in a test image by using multi-model formulations: detection as a first stage and then separately trained networks per object for 2D-3D geometric correspondence prediction as a second stage. Poses are subsequently estimated using the Perspective-n-Points algorithm at runtime. Unfortunately, multi-model formulations are slow and do not scale well with the number of object instances involved. Recent approaches show that direct 6D object pose estimation is feasible when derived from the aforementioned geometric correspondences. We present an approach that learns an intermediate geometric representation of multiple objects to directly regress 6D poses of all instances in a test image. The inherent end-to-end trainability overcomes the requirement of separately processing individual object instances. By calculating the mutual Intersection-over-Unions, pose hypotheses are clustered into distinct instances, which achieves negligible runtime overhead with respect to the number of object instances. Results on multiple challenging standard datasets show that the pose estimation performance is superior to single-model state-of-the-art approaches despite being more than ~35 times faster. We additionally provide an analysis showing real-time applicability (>24 fps) for images where more than 90 object instances are present. Further results show the advantage of supervising geometric-correspondence-based object pose estimation with the 6D pose.

preprint2022arXiv

SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement

Observational noise, inaccurate segmentation and ambiguity due to symmetry and occlusion lead to inaccurate object pose estimates. While depth- and RGB-based pose refinement approaches increase the accuracy of the resulting pose estimates, they are susceptible to ambiguity in the observation as they consider visual alignment. We propose to leverage the fact that we often observe static, rigid scenes. Thus, the objects therein need to be under physically plausible poses. We show that considering plausibility reduces ambiguity and, in consequence, allows poses to be more accurately predicted in cluttered environments. To this end, we extend a recent RL-based registration approach towards iterative refinement of object poses. Experiments on the LINEMOD and YCB-VIDEO datasets demonstrate the state-of-the-art performance of our depth-based refinement approach.

preprint2020arXiv

DGCM-Net: Dense Geometrical Correspondence Matching Network for Incremental Experience-based Robotic Grasping

This article presents a method for grasping novel objects by learning from experience. Successful attempts are remembered and then used to guide future grasps such that more reliable grasping is achieved over time. To generalise the learned experience to unseen objects, we introduce the dense geometric correspondence matching network (DGCM-Net). This applies metric learning to encode objects with similar geometry nearby in feature space. Retrieving relevant experience for an unseen object is thus a nearest neighbour search with the encoded feature maps. DGCM-Net also reconstructs 3D-3D correspondences using the view-dependent normalised object coordinate space to transform grasp configurations from retrieved samples to unseen objects. In comparison to baseline methods, our approach achieves an equivalent grasp success rate. However, the baselines are significantly improved when fusing the knowledge from experience with their grasp proposal strategy. Offline experiments with a grasping dataset highlight the capability to generalise within and between object classes as well as to improve success rate over time from increasing experience. Lastly, by learning task-relevant grasps, our approach can prioritise grasps that enable the functional use of objects.

preprint2020arXiv

Robot Perception of Static and Dynamic Objects with an Autonomous Floor Scrubber

This paper presents the perception system of a new professional cleaning robot for large public places. The proposed system is based on multiple sensors including 3D and 2D lidar, two RGB-D cameras and a stereo camera. The two lidars together with an RGB-D camera are used for dynamic object (human) detection and tracking, while the second RGB-D and stereo camera are used for detection of static objects (dirt and ground objects). A learning and reasoning module for spatial-temporal representation of the environment based on the perception pipeline is also introduced. Furthermore, a new dataset collected with the robot in several public places, including a supermarket, a warehouse and an airport, is released. Baseline results on this dataset for further research and comparison are provided. The proposed system has been fully implemented into the Robot Operating System (ROS) with high modularity, also publicly available to the community.

preprint2020arXiv

Unsupervised Domain Adaptation through Inter-modal Rotation for RGB-D Object Recognition

Unsupervised Domain Adaptation (DA) exploits the supervision of a label-rich source dataset to make predictions on an unlabeled target dataset by aligning the two data distributions. In robotics, DA is used to take advantage of automatically generated synthetic data, that come with "free" annotation, to make effective predictions on real data. However, existing DA methods are not designed to cope with the multi-modal nature of RGB-D data, which are widely used in robotic vision. We propose a novel RGB-D DA method that reduces the synthetic-to-real domain shift by exploiting the inter-modal relation between the RGB and depth image. Our method consists of training a convolutional neural network to solve, in addition to the main recognition task, the pretext task of predicting the relative rotation between the RGB and depth image. To evaluate our method and encourage further research in this area, we define two benchmark datasets for object categorization and instance recognition. With extensive experiments, we show the benefits of leveraging the inter-modal relations for RGB-D DA.

preprint2020arXiv

VeREFINE: Integrating Object Pose Verification with Physics-guided Iterative Refinement

Accurate and robust object pose estimation for robotics applications requires verification and refinement steps. In this work, we propose to integrate hypotheses verification with object pose refinement guided by physics simulation. This allows the physical plausibility of individual object pose estimates and the stability of the estimated scene to be considered in a unified optimization. The proposed method is able to adapt to scenes of multiple objects and efficiently focuses on refining the most promising object poses in multi-hypotheses scenarios. We call this integrated approach VeREFINE and evaluate it on three datasets with varying scene complexity. The generality of the approach is shown by using three state-of-the-art pose estimators and three baseline refiners. Results show improvements over all baselines and on all datasets. Furthermore, our approach is applied in real-world grasping experiments and outperforms competing methods in terms of grasp success rate. Code is publicly available at github.com/dornik/verefine.

preprint2019arXiv

In-pipe Robotic System for Pipe-joint Rehabilitation in Fresh Water Pipes

The robot's objective is to rehabilitate the pipe joints of fresh water supply systems by crawling into water canals and applying a restoration material to repair the pipes. The robot's structure consists of six wheeled-legs, three on the front separated 120° and three on the back in the same configuration, supporting the structure along the centre of the pipe. In this configuration the robot is able to clean and seal with a rotating tool, similar to a cylindrical robot, covering the entire 3D in-pipe space.

preprint2019arXiv

Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation

Estimating the 6D pose of objects using only RGB images remains challenging because of problems such as occlusion and symmetries. It is also difficult to construct 3D models with precise texture without expert knowledge or specialized scanning devices. To address these problems, we propose a novel pose estimation method, Pix2Pose, that predicts the 3D coordinates of each object pixel without textured models. An auto-encoder architecture is designed to estimate the 3D coordinates and expected errors per pixel. These pixel-wise predictions are then used in multiple stages to form 2D-3D correspondences to directly compute poses with the PnP algorithm with RANSAC iterations. Our method is robust to occlusion by leveraging recent achievements in generative adversarial training to precisely recover occluded parts. Furthermore, a novel loss function, the transformer loss, is proposed to handle symmetric objects by guiding predictions to the closest symmetric pose. Evaluations on three different benchmark datasets containing symmetric and occluded objects show our method outperforms the state of the art using only RGB images.

preprint2016arXiv

Help, Anyone? A User Study For Modeling Robotic Behavior To Mitigate Malfunctions With The Help Of The User

Service robots for the domestic environment are intended to autonomously provide support for their users. However, state-of-the-art robots still often get stuck in failure situations leading to breakdowns in the interaction flow from which the robot cannot recover alone. We performed a multi-user Wizard-of-Oz experiment in which we manipulated the robot's behavior in such a way that it appeared unexpected and malfunctioning, and asked participants to help the robot in order to restore the interaction flow. We examined how participants reacted to the robot's error, its subsequent request for help and how it changed their perception of the robot with respect to perceived intelligence, likability, and task contribution. As interaction scenario we used a game of building Lego models performed by user dyads. In total 38 participants interacted with the robot and helped in malfunctioning situations. We report two major findings: (1) in user dyads, the user who gave the last command followed by the user who is closer is more likely to help (2) malfunctions that can be actively fixed by the user seem not to negatively impact perceived intelligence and likability ratings. This work offers insights in how far user support can be a strategy for domestic service robots to recover from repeating malfunctions.

preprint2015arXiv

Object Modelling with a Handheld RGB-D Camera

This work presents a flexible system to reconstruct 3D models of objects captured with an RGB-D sensor. A major advantage of the method is that our reconstruction pipeline allows the user to acquire a full 3D model of the object. This is achieved by acquiring several partial 3D models in different sessions that are automatically merged together to reconstruct a full model. In addition, the 3D models acquired by our system can be directly used by state-of-the-art object instance recognition and object tracking modules, providing object-perception capabilities for different applications, such as human-object interaction analysis or robot grasping. The system does not impose constraints in the appearance of objects (textured, untextured) nor in the modelling setup (moving camera with static object or a turn-table setup). The proposed reconstruction system has been used to model a large number of objects resulting in metrically accurate and visually appealing 3D models.

preprint2015arXiv

Using Dimension Reduction to Improve the Classification of High-dimensional Data

In this work we show that the classification performance of high-dimensional structural MRI data with only a small set of training examples is improved by the usage of dimension reduction methods. We assessed two different dimension reduction variants: feature selection by ANOVA F-test and feature transformation by PCA. On the reduced datasets, we applied common learning algorithms using 5-fold cross-validation. Training, tuning of the hyperparameters, as well as the performance evaluation of the classifiers was conducted using two different performance measures: Accuracy, and Receiver Operating Characteristic curve (AUC). Our hypothesis is supported by experimental results.

preprint2015arXiv

Where to look first? Behaviour control for fetch-and-carry missions of service robots

This paper presents the behaviour control of a service robot for intelligent object search in a domestic environment. A major challenge in service robotics is to enable fetch-and-carry missions that are satisfying for the user in terms of efficiency and human-oriented perception. The proposed behaviour controller provides an informed intelligent search based on a semantic segmentation framework for indoor scenes and integrates it with object recognition and grasping. Instead of manually annotating search positions in the environment, the framework automatically suggests likely locations to search for an object based on contextual information, e.g. next to tables and shelves. In a preliminary set of experiments we demonstrate that this behaviour control is as efficient as using manually annotated locations. Moreover, we argue that our approach will reduce the intensity of labour associated with programming fetch-and-carry tasks for service robots and that it will be perceived as more human-oriented.

preprint2014arXiv

Find my mug: Efficient object search with a mobile robot using semantic segmentation

In this paper, we propose an efficient semantic segmentation framework for indoor scenes, tailored to the application on a mobile robot. Semantic segmentation can help robots to gain a reasonable understanding of their environment, but to reach this goal, the algorithms not only need to be accurate, but also fast and robust. Therefore, we developed an optimized 3D point cloud processing framework based on a Randomized Decision Forest, achieving competitive results at sufficiently high frame rates. We evaluate the capabilities of our method on the popular NYU depth dataset and our own data and demonstrate its feasibility by deploying it on a mobile service robot, for which we could optimize an object search procedure using our results.

preprint2013arXiv

Visual Room-Awareness for Humanoid Robot Self-Localization

Humanoid robots without internal sensors such as a compass tend to lose their orientation after a fall. Furthermore, re-initialisation is often ambiguous due to symmetric man-made environments. The room-awareness module proposed here is inspired by the results of psychological experiments and improves existing self-localization strategies by mapping and matching the visual background with colour histograms. The matching algorithm uses a particle-filter to generate hypotheses of the viewing directions independent of the self-localization algorithm and generates confidence values for various possible poses. The robot's behaviour controller uses those confidence values to control self-localization algorithm to converge to the most likely pose and prevents the algorithm from getting stuck in local minima. Experiments with a symmetric Standard Platform League RoboCup playing field with a simulated and a real humanoid NAO robot show the significant improvement of the system.

Markus Vincze

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

OSCAR: Open-Set CAD Retrieval from a Language Prompt and a Single Image

BURG-Toolkit: Robot Grasping Experiments in Simulation and the Real World

COPE: End-to-end trainable Constant Runtime Object Pose Estimation

SporeAgent: Reinforced Scene-level Plausibility for Object Pose Refinement

DGCM-Net: Dense Geometrical Correspondence Matching Network for Incremental Experience-based Robotic Grasping

Robot Perception of Static and Dynamic Objects with an Autonomous Floor Scrubber

Unsupervised Domain Adaptation through Inter-modal Rotation for RGB-D Object Recognition

VeREFINE: Integrating Object Pose Verification with Physics-guided Iterative Refinement

In-pipe Robotic System for Pipe-joint Rehabilitation in Fresh Water Pipes

Pix2Pose: Pixel-Wise Coordinate Regression of Objects for 6D Pose Estimation

Help, Anyone? A User Study For Modeling Robotic Behavior To Mitigate Malfunctions With The Help Of The User

Object Modelling with a Handheld RGB-D Camera

Using Dimension Reduction to Improve the Classification of High-dimensional Data

Where to look first? Behaviour control for fetch-and-carry missions of service robots

Find my mug: Efficient object search with a mobile robot using semantic segmentation

Visual Room-Awareness for Humanoid Robot Self-Localization