Source author record

Marius Leordeanu

Marius Leordeanu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Computation and Language eess.IV Neural and Evolutionary Computing physics.soc-ph Populations and Evolution

Catalog footprint

What is connected

13works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2025arXiv

Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

We introduce Generalized Test-Time Augmentation (GTTA), a highly effective method for improving the performance of a trained model, which unlike other existing Test-Time Augmentation approaches from the literature is general enough to be used off-the-shelf for many vision and non-vision tasks, such as classification, regression, image segmentation and object detection. By applying a new general data transformation, that randomly perturbs multiple times the PCA subspace projection of a test input, GTTA creates valid augmented samples from the data distribution with high diversity, properties we theoretically show that are essential for a Test-Time Augmentation method to be effective. Different from other existing methods, we also propose a final self-supervised learning stage in which the ensemble output, acting as an unsupervised teacher, is used to train the initial single student model, thus reducing significantly the test time computational cost. Our comparisons to strong TTA approaches and SoTA models on various vision and non-vision well-known datasets and tasks, such as image classification and segmentation, pneumonia detection, speech recognition and house price prediction, validate the generality of the proposed GTTA. Furthermore, we also prove its effectiveness on the more specific real-world task of salmon segmentation and detection in low-visibility underwater videos, for which we introduce DeepSalmon, the largest dataset of its kind in the literature.

preprint2022arXiv

A regime switching on Covid19 analysis and prediction in Romania

In this paper we propose a three stages analysis of the evolution of Covid19 in Romania. There are two main issues when it comes to pandemic prediction. The first one is the fact that the numbers reported of infected and recovered are unreliable, however the number of deaths is more accurate. The second issue is that there were many factors which affected the evolution of the pandemic. In this paper we propose an analysis in three stages. The first stage is based on the classical SIR model which we do using a neural network. This provides a first set of daily parameters. In the second stage we propose a refinement of the SIR model in which we separate the deceased into a distinct category. By using the first estimate and a grid search, we give a daily estimation of the parameters. The third stage is used to define a notion of turning points (local extremes) for the parameters. We call a regime the time between these points. We outline a general way based on time varying parameters of SIRD to make predictions.

preprint2020arXiv

A 3D Convolutional Approach to Spectral Object Segmentation in Space and Time

We formulate object segmentation in video as a graph partitioning problem in space and time, in which nodes are pixels and their relations form local neighborhoods. We claim that the strongest cluster in this pixel-level graph represents the salient object segmentation. We compute the main cluster using a novel and fast 3D filtering technique that finds the spectral clustering solution, namely the principal eigenvector of the graph's adjacency matrix, without building the matrix explicitly - which would be intractable. Our method is based on the power iteration for finding the principal eigenvector of a matrix, which we prove is equivalent to performing a specific set of 3D convolutions in the space-time feature volume. This allows us to avoid creating the matrix and have a fast parallel implementation on GPU. We show that our method is much faster than classical power iteration applied directly on the adjacency matrix. Different from other works, ours is dedicated to preserving object consistency in space and time at the level of pixels. For that, it requires powerful pixel-wise features at the frame level. This makes it perfectly suitable for incorporating the output of a backbone network or other methods and fast-improving over their solution without supervision. In experiments, we obtain consistent improvement, with the same set of hyper-parameters, over the top state of the art methods on DAVIS-2016 dataset, both in unsupervised and semi-supervised tasks. We also achieve top results on the well-known SegTrackv2 dataset.

preprint2020arXiv

In Search of Life: Learning from Synthetic Data to Detect Vital Signs in Videos

Automatically detecting vital signs in videos, such as the estimation of heart and respiration rates, is a challenging research problem in computer vision with important applications in the medical field. One of the key difficulties in tackling this task is the lack of sufficient supervised training data, which severely limits the use of powerful deep neural networks. In this paper we address this limitation through a novel deep learning approach, in which a recurrent deep neural network is trained to detect vital signs in the infrared thermal domain from purely synthetic data. What is most surprising is that our novel method for synthetic training data generation is general, relatively simple and uses almost no prior medical domain knowledge. Moreover, our system, which is trained in a purely automatic manner and needs no human annotation, also learns to predict the respiration or heart intensity signal for each moment in time and to detect the region of interest that is most relevant for the given task, e.g. the nose area in the case of respiration. We test the effectiveness of our proposed system on the recent LCAS dataset and obtain state-of-the-art results.

preprint2019arXiv

Recurrent Space-time Graph Neural Networks

Learning in the space-time domain remains a very challenging problem in machine learning and computer vision. Current computational models for understanding spatio-temporal visual data are heavily rooted in the classical single-image based paradigm. It is not yet well understood how to integrate information in space and time into a single, general model. We propose a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene. Nodes and edges in our graph have dedicated neural networks for processing information. Nodes operate over features extracted from local parts in space and time and previous memory states. Edges process messages between connected nodes at different locations and spatial scales or between past and present time. Messages are passed iteratively in order to transmit information globally and establish long range interactions. Our model is general and could learn to recognize a variety of high level spatio-temporal concepts and be applied to different learning tasks. We demonstrate, through extensive experiments and ablation studies, that our model outperforms strong baselines and top published methods on recognizing complex activities in video. Moreover, we obtain state-of-the-art performance on the challenging Something-Something human-object interaction dataset.

preprint2018arXiv

Mining for meaning: from vision to language through multiple networks consensus

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their interactions and can convey the same abstract idea in many ways. It is both about content at the highest semantic level as well as about fluent form. Here we propose an approach to describe videos in natural language by reaching a consensus among multiple encoder-decoder networks. Finding such a consensual linguistic description, which shares common properties with a larger group, has a better chance to convey the correct meaning. We propose and train several network architectures and use different types of image, audio and video features. Each model produces its own description of the input video and the best one is chosen through an efficient, two-phase consensus process. We demonstrate the strength of our approach by obtaining state of the art results on the challenging MSR-VTT dataset.

preprint2016arXiv

Aerial image geolocalization from recognition and matching of roads and intersections

Aerial image analysis at a semantic level is important in many applications with strong potential impact in industry and consumer use, such as automated mapping, urban planning, real estate and environment monitoring, or disaster relief. The problem is enjoying a great interest in computer vision and remote sensing, due to increased computer power and improvement in automated image understanding algorithms. In this paper we address the task of automatic geolocalization of aerial images from recognition and matching of roads and intersections. Our proposed method is a novel contribution in the literature that could enable many applications of aerial image analysis when GPS data is not available. We offer a complete pipeline for geolocalization, from the detection of roads and intersections, to the identification of the enclosing geographic region by matching detected intersections to previously learned manually labeled ones, followed by accurate geometric alignment between the detected roads and the manually labeled maps. We test on a novel dataset with aerial images of two European cities and use the publicly available OpenStreetMap project for collecting ground truth roads annotations. We show in extensive experiments that our approach produces highly accurate localizations in the challenging case when we train on images from one city and test on the other and the quality of the aerial images is relatively poor. We also show that the the alignment between detected roads and pre-stored manual annotations can be effectively used for improving the quality of the road detection results.

preprint2016arXiv

Dual Local-Global Contextual Pathways for Recognition in Aerial Imagery

Visual context is important in object recognition and it is still an open problem in computer vision. Along with the advent of deep convolutional neural networks (CNN), using contextual information with such systems starts to receive attention in the literature. At the same time, aerial imagery is gaining momentum. While advances in deep learning make good progress in aerial image analysis, this problem still poses many great challenges. Aerial images are often taken under poor lighting conditions and contain low resolution objects, many times occluded by trees or taller buildings. In this domain, in particular, visual context could be of great help, but there are still very few papers that consider context in aerial image understanding. Here we introduce context as a complementary way of recognizing objects. We propose a dual-stream deep neural network model that processes information along two independent pathways, one for local and another for global visual reasoning. The two are later combined in the final layers of processing. Our model learns to combine local object appearance as well as information from the larger scene at the same time and in a complementary way, such that together they form a powerful classifier. We test our dual-stream network on the task of segmentation of buildings and roads in aerial images and obtain state-of-the-art results on the Massachusetts Buildings Dataset. We also introduce two new datasets, for buildings and road segmentation, respectively, and study the relative importance of local appearance vs. the larger scene, as well as their performance in combination. While our local-global model could also be useful in general recognition tasks, we clearly demonstrate the effectiveness of visual context in conjunction with deep nets for aerial image understanding.

preprint2015arXiv

Labeling the Features Not the Samples: Efficient Video Classification with Minimal Supervision

Feature selection is essential for effective visual recognition. We propose an efficient joint classifier learning and feature selection method that discovers sparse, compact representations of input features from a vast sea of candidates, with an almost unsupervised formulation. Our method requires only the following knowledge, which we call the \emph{feature sign}---whether or not a particular feature has on average stronger values over positive samples than over negatives. We show how this can be estimated using as few as a single labeled training sample per class. Then, using these feature signs, we extend an initial supervised learning problem into an (almost) unsupervised clustering formulation that can incorporate new data without requiring ground truth labels. Our method works both as a feature selection mechanism and as a fully competitive classifier. It has important properties, low computational cost and excellent accuracy, especially in difficult cases of very limited training data. We experiment on large-scale recognition in video and show superior speed and performance to established feature selection approaches such as AdaBoost, Lasso, greedy forward-backward selection, and powerful classifiers such as SVM.

preprint2015arXiv

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Integrating higher level visual and linguistic interpretations is at the heart of human intelligence. As automatic visual category recognition in images is approaching human performance, the high level understanding in the dynamic spatiotemporal domain of videos and its translation into natural language is still far from being solved. While most works on vision-to-text translations use pre-learned or pre-established computational linguistic models, in this paper we present an approach that uses vision alone to efficiently learn how to translate into language the video content. We discover, in simple form, the story played by main actors, while using only visual cues for representing objects and their interactions. Our method learns in a hierarchical manner higher level representations for recognizing subjects, actions and objects involved, their relevant contextual background and their interaction to one another over time. We have a three stage approach: first we take in consideration features of the individual entities at the local level of appearance, then we consider the relationship between these objects and actions and their video background, and third, we consider their spatiotemporal relations as inputs to classifiers at the highest level of interpretation. Thus, our approach finds a coherent linguistic description of videos in the form of a subject, verb and object based on their role played in the overall visual story learned directly from training data, without using a known language model. We test the efficiency of our approach on a large scale dataset containing YouTube clips taken in the wild and demonstrate state-of-the-art performance, often superior to current approaches that use more complex, pre-learned linguistic knowledge.

preprint2014arXiv

Features in Concert: Discriminative Feature Selection meets Unsupervised Clustering

Feature selection is an essential problem in computer vision, important for category learning and recognition. Along with the rapid development of a wide variety of visual features and classifiers, there is a growing need for efficient feature selection and combination methods, to construct powerful classifiers for more complex and higher-level recognition tasks. We propose an algorithm that efficiently discovers sparse, compact representations of input features or classifiers, from a vast sea of candidates, with important optimality properties, low computational cost and excellent accuracy in practice. Different from boosting, we start with a discriminant linear classification formulation that encourages sparse solutions. Then we obtain an equivalent unsupervised clustering problem that jointly discovers ensembles of diverse features. They are independently valuable but even more powerful when united in a cluster of classifiers. We evaluate our method on the task of large-scale recognition in video and show that it significantly outperforms classical selection approaches, such as AdaBoost and greedy forward-backward selection, and powerful classifiers such as SVMs, in speed of training and performance, especially in the case of limited training data.

preprint2014arXiv

Thoughts on a Recursive Classifier Graph: a Multiclass Network for Deep Object Recognition

We propose a general multi-class visual recognition model, termed the Classifier Graph, which aims to generalize and integrate ideas from many of today's successful hierarchical recognition approaches. Our graph-based model has the advantage of enabling rich interactions between classes from different levels of interpretation and abstraction. The proposed multi-class system is efficiently learned using step by step updates. The structure consists of simple logistic linear layers with inputs from features that are automatically selected from a large pool. Each newly learned classifier becomes a potential new feature. Thus, our feature pool can consist both of initial manually designed features as well as learned classifiers from previous steps (graph nodes), each copied many times at different scales and locations. In this manner we can learn and grow both a deep, complex graph of classifiers and a rich pool of features at different levels of abstraction and interpretation. Our proposed graph of classifiers becomes a multi-class system with a recursive structure, suitable for deep detection and recognition of several classes simultaneously.

preprint2012arXiv

Generalized Boundaries from Multiple Image Interpretations

Boundary detection is essential for a variety of computer vision tasks such as segmentation and recognition. In this paper we propose a unified formulation and a novel algorithm that are applicable to the detection of different types of boundaries, such as intensity edges, occlusion boundaries or object category specific boundaries. Our formulation leads to a simple method with state-of-the-art performance and significantly lower computational cost than existing methods. We evaluate our algorithm on different types of boundaries, from low-level boundaries extracted in natural images, to occlusion boundaries obtained using motion cues and RGB-D cameras, to boundaries from soft-segmentation. We also propose a novel method for figure/ground soft-segmentation that can be used in conjunction with our boundary detection method and improve its accuracy at almost no extra computational cost.

Marius Leordeanu

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Learning from Random Subspace Exploration: Generalized Test-Time Augmentation with Self-supervised Distillation

A regime switching on Covid19 analysis and prediction in Romania

A 3D Convolutional Approach to Spectral Object Segmentation in Space and Time

In Search of Life: Learning from Synthetic Data to Detect Vital Signs in Videos

Recurrent Space-time Graph Neural Networks

Mining for meaning: from vision to language through multiple networks consensus

Aerial image geolocalization from recognition and matching of roads and intersections

Dual Local-Global Contextual Pathways for Recognition in Aerial Imagery

Labeling the Features Not the Samples: Efficient Video Classification with Minimal Supervision

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Features in Concert: Discriminative Feature Selection meets Unsupervised Clustering

Thoughts on a Recursive Classifier Graph: a Multiclass Network for Deep Object Recognition

Generalized Boundaries from Multiple Image Interpretations