Researcher profile

Matthias Grundmann

Matthias Grundmann contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2024arXiv

StreamVC: Real-Time Low-Latency Voice Conversion

We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.

preprint2022arXiv

BlazePose GHUM Holistic: Real-time 3D Human Landmarks and Pose Estimation

We present BlazePose GHUM Holistic, a lightweight neural network pipeline for 3D human body landmarks and pose estimation, specifically tailored to real-time on-device inference. BlazePose GHUM Holistic enables motion capture from a single RGB image including avatar control, fitness tracking and AR/VR effects. Our main contributions include i) a novel method for 3D ground truth data acquisition, ii) updated 3D body tracking with additional hand landmarks and iii) full body pose estimation from a monocular image.

preprint2022arXiv

Efficient Heterogeneous Video Segmentation at the Edge

We introduce an efficient video segmentation system for resource-limited edge devices leveraging heterogeneous compute. Specifically, we design network models by searching across multiple dimensions of specifications for the neural architectures and operations on top of already light-weight backbones, targeting commercially available edge inference engines. We further analyze and optimize the heterogeneous data flows in our systems across the CPU, the GPU and the NPU. Our approach has empirically factored well into our real-time AR system, enabling remarkably higher accuracy with quadrupled effective resolutions, yet at much shorter end-to-end latency, much higher frame rate, and even lower power consumption on edge platforms.

preprint2021arXiv

On the Estimation of the Number of Unreachable Peers in the Bitcoin P2P Network by Observation of Peer Announcements

Bitcoin is based on a P2P network that is used to propagate transactions and blocks. While the P2P network design intends to hide the topology of the P2P network, information about the topology is required to understand the network from a scientific point of view. Thus, there is a natural tension between the 'desire' for unobservability on the one hand, and for observability on the other hand. On a middle ground, one would at least be interested on some statistical features of the Bitcoin network like the number of peers that participate in the propagation of transactions and blocks. This number is composed of the number of reachable peers that accept incoming connections and unreachable peers that do not accept incoming connections. While the number of reachable peers can be measured, it is inherently difficult to determine the number of unreachable peers. Thus, the number of unreachable peers can only be estimated based on some indicators. In this paper, we first define our understanding of unreachable peers and then propose the PAL (Passive Announcement Listening) method which gives an estimate of the number of unreachable peers by observing ADDR messages that announce active IP addresses in the network. The PAL method allows for detecting unreachable peers that indicate that they provide services useful to the P2P network. In conjunction with previous methods, the PAL method can help to get a better estimate of the number of unreachable peers. We use the PAL method to analyze data from a long-term measurement of the Bitcoin P2P network that gives insights into the development of the number of unreachable peers over five years from 2015 to 2020. Results show that about 31,000 unreachable peers providing useful services were active per day at the end of the year 2020. An empirical validation indicates that the approach finds about 50 % of unreachable peers that provide useful services.

preprint2020arXiv

Attention Mesh: High-fidelity Face Mesh Prediction in Real-time

We present Attention Mesh, a lightweight architecture for 3D face mesh prediction that uses attention to semantically meaningful regions. Our neural network is designed for real-time on-device inference and runs at over 50 FPS on a Pixel 2 phone. Our solution enables applications like AR makeup, eye tracking and AR puppeteering that rely on highly accurate landmarks for eye and lips regions. Our main contribution is a unified network architecture that achieves the same accuracy on facial landmarks as a multi-stage cascaded approach, while being 30 percent faster.

preprint2020arXiv

BlazePose: On-device Real-time Body Pose tracking

We present BlazePose, a lightweight convolutional neural network architecture for human pose estimation that is tailored for real-time inference on mobile devices. During inference, the network produces 33 body keypoints for a single person and runs at over 30 frames per second on a Pixel 2 phone. This makes it particularly suited to real-time use cases like fitness tracking and sign language recognition. Our main contributions include a novel body pose tracking solution and a lightweight body pose estimation neural network that uses both heatmaps and regression to keypoint coordinates.

preprint2020arXiv

Instant 3D Object Tracking with Applications in Augmented Reality

Tracking object poses in 3D is a crucial building block for Augmented Reality applications. We propose an instant motion tracking system that tracks an object's pose in space (represented by its 3D bounding box) in real-time on mobile devices. Our system does not require any prior sensory calibration or initialization to function. We employ a deep neural network to detect objects and estimate their initial 3D pose. Then the estimated pose is tracked using a robust planar tracker. Our tracker is capable of performing relative-scale 9-DoF tracking in real-time on mobile devices. By combining use of CPU and GPU efficiently, we achieve 26-FPS+ performance on mobile devices.

preprint2020arXiv

MediaPipe Hands: On-device Real-time Hand Tracking

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

preprint2020arXiv

MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision

In this paper, we address the problem of detecting unseen objects from RGB images and estimating their poses in 3D. We propose two mobile friendly networks: MobilePose-Base and MobilePose-Shape. The former is used when there is only pose supervision, and the latter is for the case when shape supervision is available, even a weak one. We revisit shape features used in previous methods, including segmentation and coordinate map. We explain when and why pixel-level shape supervision can improve pose estimation. Consequently, we add shape prediction as an intermediate layer in the MobilePose-Shape, and let the network learn pose from shape. Our models are trained on mixed real and synthetic data, with weak and noisy shape supervision. They are ultra lightweight that can run in real-time on modern mobile devices (e.g. 36 FPS on Galaxy S20). Comparing with previous single-shot solutions, our method has higher accuracy, while using a significantly smaller model (2~3% in model size or number of parameters).