Source author record

Matthias Grundmann

Matthias Grundmann appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Cryptography and Security eess.AS Networking and Internet Architecture Sound

Catalog footprint

What is connected

11works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

StreamVC: Real-Time Low-Latency Voice Conversion

We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.

preprint2022arXiv

BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation

We present BlazePose GHUM Holistic, a lightweight neural network pipeline for 3D human body landmarks and pose estimation, specifically tailored to real-time on-device inference. BlazePose GHUM Holistic enables motion capture from a single RGB image including avatar control, fitness tracking and AR/VR effects. Our main contributions include i) a novel method for 3D ground truth data acquisition, ii) updated 3D body tracking with additional hand landmarks and iii) full body pose estimation from a monocular image.

preprint2022arXiv

Efficient Heterogeneous Video Segmentation at the Edge

We introduce an efficient video segmentation system for resource-limited edge devices leveraging heterogeneous compute. Specifically, we design network models by searching across multiple dimensions of specifications for the neural architectures and operations on top of already light-weight backbones, targeting commercially available edge inference engines. We further analyze and optimize the heterogeneous data flows in our systems across the CPU, the GPU and the NPU. Our approach has empirically factored well into our real-time AR system, enabling remarkably higher accuracy with quadrupled effective resolutions, yet at much shorter end-to-end latency, much higher frame rate, and even lower power consumption on edge platforms.

preprint2021arXiv

On the Estimation of the Number of Unreachable Peers in the Bitcoin P2P Network by Observation of Peer Announcements

Bitcoin is based on a P2P network that is used to propagate transactions and blocks. While the P2P network design intends to hide the topology of the P2P network, information about the topology is required to understand the network from a scientific point of view. Thus, there is a natural tension between the 'desire' for unobservability on the one hand, and for observability on the other hand. On a middle ground, one would at least be interested on some statistical features of the Bitcoin network like the number of peers that participate in the propagation of transactions and blocks. This number is composed of the number of reachable peers that accept incoming connections and unreachable peers that do not accept incoming connections. While the number of reachable peers can be measured, it is inherently difficult to determine the number of unreachable peers. Thus, the number of unreachable peers can only be estimated based on some indicators. In this paper, we first define our understanding of unreachable peers and then propose the PAL (Passive Announcement Listening) method which gives an estimate of the number of unreachable peers by observing ADDR messages that announce active IP addresses in the network. The PAL method allows for detecting unreachable peers that indicate that they provide services useful to the P2P network. In conjunction with previous methods, the PAL method can help to get a better estimate of the number of unreachable peers. We use the PAL method to analyze data from a long-term measurement of the Bitcoin P2P network that gives insights into the development of the number of unreachable peers over five years from 2015 to 2020. Results show that about 31,000 unreachable peers providing useful services were active per day at the end of the year 2020. An empirical validation indicates that the approach finds about 50 % of unreachable peers that provide useful services.

preprint2020arXiv

Attention Mesh: High-fidelity Face Mesh Prediction in Real-time

We present Attention Mesh, a lightweight architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Our neural network is designed for real-time on-device inference and runs at over 50 FPS on a Pixel 2 phone. Our solution enables applications like AR makeup, eye tracking and AR puppeteering that rely on highly accurate landmarks for eye and lips regions. Our main contribution is a unified network architecture that achieves the same accuracy on facial landmarks as a multi-stage cascaded approach, while being 30 percent faster.

preprint2020arXiv

BlazePose: On-device Real-time Body Pose tracking

We present BlazePose, a lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices. During inference, the network produces 33 body keypoints for a single person and runs at over 30 frames per second on a Pixel 2 phone. This makes it particularly suited to real-time use cases like fitness tracking and sign language recognition. Our main contributions include a novel body pose tracking solution and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates.

preprint2020arXiv

Instant 3D Object Tracking with Applications in Augmented Reality

Tracking object poses in 3D is a crucial building block for Augmented Reality applications. We propose an instant motion tracking system that tracks an object's pose in space (represented by its 3D bounding box) in real-time on mobile devices. Our system does not require any prior sensory calibration or initialization to function. We employ a deep neural network to detect objects and estimate their initial 3D pose. Then the estimated pose is tracked using a robust planar tracker. Our tracker is capable of performing relative-scale 9-DoF tracking in real-time on mobile devices. By combining use of CPU and GPU efficiently, we achieve 26-FPS+ performance on mobile devices.

preprint2020arXiv

MediaPipe Hands: On-device Real-time Hand Tracking

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

preprint2020arXiv

MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision

In this paper, we address the problem of detecting unseen objects from RGB images and estimating their poses in 3D. We propose two mobile friendly networks: MobilePose-Base and MobilePose-Shape. The former is used when there is only pose supervision, and the latter is for the case when shape supervision is available, even a weak one. We revisit shape features used in previous methods, including segmentation and coordinate map. We explain when and why pixel-level shape supervision can improve pose estimation. Consequently, we add shape prediction as an intermediate layer in the MobilePose-Shape, and let the network learn pose from shape. Our models are trained on mixed real and synthetic data, with weak and noisy shape supervision. They are ultra lightweight that can run in real-time on modern mobile devices (e.g. 36 FPS on Galaxy S20). Comparing with previous single-shot solutions, our method has higher accuracy, while using a significantly smaller model (2~3% in model size or number of parameters).

preprint2015arXiv

Finding Temporally Consistent Occlusion Boundaries in Videos using Geometric Context

We present an algorithm for finding temporally consistent occlusion boundaries in videos to support segmentation of dynamic scenes. We learn occlusion boundaries in a pairwise Markov random field (MRF) framework. We first estimate the probability of an spatio-temporal edge being an occlusion boundary by using appearance, flow, and geometric features. Next, we enforce occlusion boundary continuity in a MRF model by learning pairwise occlusion probabilities using a random forest. Then, we temporally smooth boundaries to remove temporal inconsistencies in occlusion boundary estimation. Our proposed framework provides an efficient approach for finding temporally consistent occlusion boundaries in video by utilizing causality, redundancy in videos, and semantic layout of the scene. We have developed a dataset with fully annotated ground-truth occlusion boundaries of over 30 videos ($5000 frames). This dataset is used to evaluate temporal occlusion boundaries and provides a much needed baseline for future studies. We perform experiments to demonstrate the role of scene layout, and temporal information for occlusion reasoning in dynamic scenes.

preprint2015arXiv

Geometric Context from Videos

We present a novel algorithm for estimating the broad 3D geometric structure of outdoor video scenes. Leveraging spatio-temporal video segmentation, we decompose a dynamic scene captured by a video into geometric classes, based on predictions made by region-classifiers that are trained on appearance and motion features. By examining the homogeneity of the prediction, we combine predictions across multiple segmentation hierarchy levels alleviating the need to determine the granularity a priori. We built a novel, extensive dataset on geometric context of video to evaluate our method, consisting of over 100 ground-truth annotated outdoor videos with over 20,000 frames. To further scale beyond this dataset, we propose a semi-supervised learning framework to expand the pool of labeled data with high confidence predictions obtained from unlabeled data. Our system produces an accurate prediction of geometric context of video achieving 96% accuracy across main geometric classes.

Matthias Grundmann

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

StreamVC: Real-Time Low-Latency Voice Conversion

BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation

Efficient Heterogeneous Video Segmentation at the Edge

On the Estimation of the Number of Unreachable Peers in the Bitcoin P2P Network by Observation of Peer Announcements

Attention Mesh: High-fidelity Face Mesh Prediction in Real-time

BlazePose: On-device Real-time Body Pose tracking

Instant 3D Object Tracking with Applications in Augmented Reality

MediaPipe Hands: On-device Real-time Hand Tracking

MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision

Finding Temporally Consistent Occlusion Boundaries in Videos using Geometric Context

Geometric Context from Videos