Source author record

Ruohan Gao

Ruohan Gao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Sound eess.AS Machine Learning Robotics Artificial Intelligence physics.geo-ph

Catalog footprint

What is connected

6works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Objects play a crucial role in our everyday activities. Though multisensory object-centric learning has shown great potential lately, the modeling of objects in prior work is rather unrealistic. ObjectFolder 1.0 is a recent dataset that introduces 100 virtualized objects with visual, acoustic, and tactile sensory data. However, the dataset is small in scale and the multisensory data is of limited quality, hampering generalization to real-world scenarios. We present ObjectFolder 2.0, a large-scale, multisensory dataset of common household objects in the form of implicit neural representations that significantly enhances ObjectFolder 1.0 in three aspects. First, our dataset is 10 times larger in the amount of objects and orders of magnitude faster in rendering time. Second, we significantly improve the multisensory rendering quality for all three modalities. Third, we show that models learned from virtual objects in our dataset successfully transfer to their real-world counterparts in three challenging tasks: object scale estimation, contact localization, and shape reconstruction. ObjectFolder 2.0 offers a new path and testbed for multisensory learning in computer vision and robotics. The dataset is available at https://github.com/rhgao/ObjectFolder.

preprint2021arXiv

Learning to Set Waypoints for Audio-Visual Navigation

In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation. Project: http://vision.cs.utexas.edu/projects/audio_visual_waypoints.

preprint2020arXiv

JULOC: A Local 3-D Refined Crust Model for the Geoneutrino Measurement at JUNO

Geothermal energy is the key to drive the plate tectonics and interior thermodynamics of the Earth. The surface heat flux, as measured in boreholes, provide limited insights into the relative contributions of primordial versus radiogenic sources of the heat budget of the mantle. Geoneutrinos, electron antineutrinos that produced from the radioactive decay of the heat producing elements, are unique probes that bring direct information about the amount and distribution of heat producing elements in the crust and mantle. Cosmochemical, geochemical, and geodynamic compositional models of the Bulk Silicate Earth (BSE) individually predicts different mantle neutrino fluxes, and therefore can be distinguished by the direct measurement of geoneutrinos. The 20 kton detector of the Jiangmen Underground Neutrino Observatory (JUNO), currently under construction in the Guangdong Province (China), is expected to provide an exciting opportunity to obtain a high statistics measurement, which will produce sufficient data to address several key questions of geological importance. To test different compositional models of the mantle, an accurate estimation of the crust geoneutrino flux based on a three-dimensional (3-D) crust model in advance is important. This paper presents a 3-D crust model over a surface area of 10-degrees-times-10-degrees grid surrounding the JUNO detector and a depth down to the Moho discontinuity, based on the geological, geophysical and geochemistry properties. The 3-D model provides a distinction of the volumes of the different geological layers together with the corresponding Th and U abundances. We also present our predicted local contribution to the total geoneutrino flux and the corresponding radiogenic heat.

preprint2020arXiv

Listen to Look: Action Recognition by Previewing Audio

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.

preprint2020arXiv

VisualEchoes: Spatial Image Representation Learning through Echolocation

Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning---monocular depth estimation, surface normal estimation, and visual navigation---with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.

preprint2016arXiv

Object-Centric Representation Learning from Unlabeled Videos

Supervised (pre-)training currently yields state-of-the-art performance for representation learning for visual recognition, yet it comes at the cost of (1) intensive manual annotations and (2) an inherent restriction in the scope of data relevant for learning. In this work, we explore unsupervised feature learning from unlabeled video. We introduce a novel object-centric approach to temporal coherence that encourages similar representations to be learned for object-like regions segmented from nearby frames. Our framework relies on a Siamese-triplet network to train a deep convolutional neural network (CNN) representation. Compared to existing temporal coherence methods, our idea has the advantage of lightweight preprocessing of the unlabeled video (no tracking required) while still being able to extract object-level regions from which to learn invariances. Furthermore, as we show in results on several standard datasets, our method typically achieves substantial accuracy gains over competing unsupervised methods for image classification and retrieval tasks.

Ruohan Gao

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Learning to Set Waypoints for Audio-Visual Navigation

JULOC: A Local 3-D Refined Crust Model for the Geoneutrino Measurement at JUNO

Listen to Look: Action Recognition by Previewing Audio

VisualEchoes: Spatial Image Representation Learning through Echolocation

Object-Centric Representation Learning from Unlabeled Videos