Source author record

Walterio Mayol-Cuevas

Walterio Mayol-Cuevas appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision eess.SP Emerging Technologies Robotics

Catalog footprint

What is connected

6works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

This work presents a method to implement fully convolutional neural networks (FCNs) on Pixel Processor Array (PPA) sensors, and demonstrates coarse segmentation and object localisation tasks. We design and train binarized FCN for both binary weights and activations using batchnorm, group convolution, and learnable threshold for binarization, producing networks small enough to be embedded on the focal plane of the PPA, with limited local memory resources, and using parallel elementary add/subtract, shifting, and bit operations only. We demonstrate the first implementation of an FCN on a PPA device, performing three convolution layers entirely in the pixel-level processors. We use this architecture to demonstrate inference generating heat maps for object segmentation and localisation at over 280 FPS using the SCAMP-5 PPA vision chip.

preprint2020arXiv

Action Modifiers: Learning from Adverbs in Instructional Videos

We present a method to learn a representation for adverbs from instructional videos using weak supervision from the accompanying narrations. Key to our method is the fact that the visual representation of the adverb is highly dependant on the action to which it applies, although the same adverb will modify multiple actions in a similar way. For instance, while 'spread quickly' and 'mix quickly' will look dissimilar, we can learn a common representation that allows us to recognize both, among other actions. We formulate this as an embedding problem, and use scaled dot-product attention to learn from weakly-supervised video narrations. We jointly learn adverbs as invertible transformations operating on the embedding space, so as to add or remove the effect of the adverb. As there is no prior work on weakly supervised learning from adverbs, we gather paired action-adverb annotations from a subset of the HowTo100M dataset for 6 adverbs: quickly/slowly, finely/coarsely, and partially/completely. Our method outperforms all baselines for video-to-adverb retrieval with a performance of 0.719 mAP. We also demonstrate our model's ability to attend to the relevant video parts in order to determine the adverb for a given action.

preprint2020arXiv

Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

We present a novel method of CNN inference for pixel processor array (PPA) vision sensors, designed to take advantage of their massive parallelism and analog compute capabilities. PPA sensors consist of an array of processing elements (PEs), with each PE capable of light capture, data storage and computation, allowing various computer vision processing to be executed directly upon the sensor device. The key idea behind our approach is storing network weights "in-pixel" within the PEs of the PPA sensor itself to allow various computations, such as multiple different image convolutions, to be carried out in parallel. Our approach can perform convolutional layers, max pooling, ReLu, and a final fully connected layer entirely upon the PPA sensor, while leaving no untapped computational resources. This is in contrast to previous works that only use a sensor-level processing to sequentially compute image convolutions, and must transfer data to an external digital processor to complete the computation. We demonstrate our approach on the SCAMP-5 vision system, performing inference of a MNIST digit classification network at over 3000 frames per second and over 93% classification accuracy. This is the first work demonstrating CNN inference conducted entirely upon the processor array of a PPA vision sensor device, requiring no external processing.

preprint2016arXiv

SEMBED: Semantic Embedding of Egocentric Action Videos

We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely annotated egocentric videos, outperforming SVM classification by more than 5%.

preprint2016arXiv

Towards an objective evaluation of underactuated gripper designs

In this paper we explore state-of-the-art underactuated, compliant robot gripper designs through looking at their performance on a generic grasping task. Starting from a state of the art open gripper design, we propose design modifications,and importantly, evaluate all designs on a grasping experiment involving a selection of objects resulting in 3600 object-gripper interactions. Interested in non-planned grasping but rather on a design's generic performance, we explore the influence of object shape, pose and orientation relative to the gripper and its finger number and configuration. Using open-loop grasps we achieved up to 75% success rate over our trials. The results indicate and support that under motion constraints and uncertainties and without involving grasp planning, a 2-fingered underactuated compliant hand outperforms higher multi-fingered configurations. To our knowledge this is the first extended objective comparison of various multi-fingered underactuated hand designs under generic grasping conditions.

preprint2016arXiv

You-Do, I-Learn: Unsupervised Multi-User egocentric Approach Towards Video-Based Guidance

This paper presents an unsupervised approach towards automatically extracting video-based guidance on object usage, from egocentric video and wearable gaze tracking, collected from multiple users while performing tasks. The approach i) discovers task relevant objects, ii) builds a model for each, iii) distinguishes different ways in which each discovered object has been used and iv) discovers the dependencies between object interactions. The work investigates using appearance, position, motion and attention, and presents results using each and a combination of relevant features. Moreover, an online scalable approach is presented and is compared to offline results. The paper proposes a method for selecting a suitable video guide to be displayed to a novice user indicating how to use an object, purely triggered by the user's gaze. The potential assistive mode can also recommend an object to be used next based on the learnt sequence of object interactions. The approach was tested on a variety of daily tasks such as initialising a printer, preparing a coffee and setting up a gym machine.

Walterio Mayol-Cuevas

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

On-Sensor Binarized Fully Convolutional Neural Network with A Pixel Processor Array

Action Modifiers: Learning from Adverbs in Instructional Videos

Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays

SEMBED: Semantic Embedding of Egocentric Action Videos

Towards an objective evaluation of underactuated gripper designs

You-Do, I-Learn: Unsupervised Multi-User egocentric Approach Towards Video-Based Guidance