Source author record

Svebor Karaman

Svebor Karaman appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Multimedia Machine Learning

Catalog footprint

What is connected

9works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Bridging Knowledge Graphs to Generate Scene Graphs

Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them, while gradually refining their bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs. Through extensive experimentation, we showcase the superior accuracy of GB-Net compared to the most recent methods, resulting in a new state of the art. We publicly release the source code of our method.

preprint2020arXiv

Weakly Supervised Visual Semantic Parsing

Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pairs of object proposals to detect predicates. In this paper, we address those two limitations by first proposing a generalized formulation of SGG, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance. Then we propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic, attention-based, bipartite message passing framework that jointly infers graph nodes and edges through an iterative process. Additionally, we propose the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations. Through extensive experiments, we show that VSPNet outperforms weakly supervised baselines significantly and approaches fully supervised performance, while being several times faster. We publicly release the source code of our method.

preprint2016arXiv

Deep Image Set Hashing

In applications involving matching of image sets, the information from multiple images must be effectively exploited to represent each set. State-of-the-art methods use probabilistic distribution or subspace to model a set and use specific distance measure to compare two sets. These methods are slow to compute and not compact to use in a large scale scenario. Learning-based hashing is often used in large scale image retrieval as they provide a compact representation of each sample and the Hamming distance can be used to efficiently compare two samples. However, most hashing methods encode each image separately and discard knowledge that multiple images in the same set represent the same object or person. We investigate the set hashing problem by combining both set representation and hashing in a single deep neural network. An image set is first passed to a CNN module to extract image features, then these features are aggregated using two types of set feature to capture both set specific and database-wide distribution information. The computed set feature is then fed into a multilayer perceptron to learn a compact binary embedding. Triplet loss is used to train the network by forming set similarity relations using class labels. We extensively evaluate our approach on datasets used for image matching and show highly competitive performance compared to state-of-the-art methods.

preprint2015arXiv

A Multi-Camera Image Processing and Visualization System for Train Safety Assessment

In this paper we present a machine vision system to efficiently monitor, analyze and present visual data acquired with a railway overhead gantry equipped with multiple cameras. This solution aims to improve the safety of daily life railway transportation in a two- fold manner: (1) by providing automatic algorithms that can process large imagery of trains (2) by helping train operators to keep attention on any possible malfunction. The system is designed with the latest cutting edge, high-rate visible and thermal cameras that ob- serve a train passing under an railway overhead gantry. The machine vision system is composed of three principal modules: (1) an automatic wagon identification system, recognizing the wagon ID according to the UIC classification of railway coaches; (2) a temperature monitoring system; (3) a system for the detection, localization and visualization of the pantograph of the train. These three machine vision modules process batch trains sequences and their resulting analysis are presented to an operator using a multitouch user interface. We detail all technical aspects of our multi-camera portal: the hardware requirements, the software developed to deal with the high-frame rate cameras and ensure reliable acquisition, the algorithms proposed to solve each computer vision task, and the multitouch interaction and visualization interface. We evaluate each component of our system on a dataset recorded in an ad-hoc railway test-bed, showing the potential of our proposed portal for train safety assessment.

preprint2014arXiv

Hierarchical Hidden Markov Model in Detecting Activities of Daily Living in Wearable Videos for Studies of Dementia

This paper presents a method for indexing activities of daily living in videos obtained from wearable cameras. In the context of dementia diagnosis by doctors, the videos are recorded at patients' houses and later visualized by the medical practitioners. The videos may last up to two hours, therefore a tool for an efficient navigation in terms of activities of interest is crucial for the doctors. The specific recording mode provides video data which are really difficult, being a single sequence shot where strong motion and sharp lighting changes often appear. Our work introduces an automatic motion based segmentation of the video and a video structuring approach in terms of activities by a hierarchical two-level Hidden Markov Model. We define our description space over motion and visual characteristics of video and audio channels. Experiments on real data obtained from the recording at home of several patients show the difficulty of the task and the promising results of our approach.

preprint2014arXiv

Nested Graph Words for Object Recognition

In this paper, we propose a new, scalable approach for the task of object based image search or object recognition. Despite the very large literature existing on the scalability issues in CBIR in the sense of retrieval approaches, the scalability of media and scalability of features remain an issue. In our work we tackle the problem of scalability and structural organization of features. The proposed features are nested local graphs built upon sets of SURF feature points with Delaunay triangulation. A Bag-of-Visual-Words (BoVW) framework is applied on these graphs, giving birth to a Bag-of-Graph-Words representation. The nested nature of the descriptors consists in scaling from trivial Delaunay graphs - isolated feature points - by increasing the number of nodes layer by layer up to graphs with maximal number of nodes. For each layer of graphs its proper visual dictionary is built. The experiments conducted on the SIVAL data set reveal that the graph features at different layers exhibit complementary performances on the same content. The nested approach, the combination of all existing layers, yields significant improvement of the object recognition performance compared to single level approaches.

preprint2011arXiv

Activities of Daily Living Indexing by Hierarchical HMM for Dementia Diagnostics

This paper presents a method for indexing human ac- tivities in videos captured from a wearable camera being worn by patients, for studies of progression of the dementia diseases. Our method aims to produce indexes to facilitate the navigation throughout the individual video recordings, which could help doctors search for early signs of the dis- ease in the activities of daily living. The recorded videos have strong motion and sharp lighting changes, inducing noise for the analysis. The proposed approach is based on a two steps analysis. First, we propose a new approach to segment this type of video, based on apparent motion. Each segment is characterized by two original motion de- scriptors, as well as color, and audio descriptors. Second, a Hidden-Markov Model formulation is used to merge the multimodal audio and video features, and classify the test segments. Experiments show the good properties of the ap- proach on real data.

preprint2011arXiv

Multi-Layer Local Graph Words for Object Recognition

In this paper, we propose a new multi-layer structural approach for the task of object based image retrieval. In our work we tackle the problem of structural organization of local features. The structural features we propose are nested multi-layered local graphs built upon sets of SURF feature points with Delaunay triangulation. A Bag-of-Visual-Words (BoVW) framework is applied on these graphs, giving birth to a Bag-of-Graph-Words representation. The multi-layer nature of the descriptors consists in scaling from trivial Delaunay graphs - isolated feature points - by increasing the number of nodes layer by layer up to graphs with maximal number of nodes. For each layer of graphs its own visual dictionary is built. The experiments conducted on the SIVAL and Caltech-101 data sets reveal that the graph features at different layers exhibit complementary performances on the same content and perform better than baseline BoVW approach. The combination of all existing layers, yields significant improvement of the object recognition performance compared to single level approaches.

preprint2010arXiv

Human Daily Activities Indexing in Videos from Wearable Cameras for Monitoring of Patients with Dementia Diseases

Our research focuses on analysing human activities according to a known behaviorist scenario, in case of noisy and high dimensional collected data. The data come from the monitoring of patients with dementia diseases by wearable cameras. We define a structural model of video recordings based on a Hidden Markov Model. New spatio-temporal features, color features and localization features are proposed as observations. First results in recognition of activities are promising.

Svebor Karaman

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Bridging Knowledge Graphs to Generate Scene Graphs

Weakly Supervised Visual Semantic Parsing

Deep Image Set Hashing

A Multi-Camera Image Processing and Visualization System for Train Safety Assessment

Hierarchical Hidden Markov Model in Detecting Activities of Daily Living in Wearable Videos for Studies of Dementia

Nested Graph Words for Object Recognition

Activities of Daily Living Indexing by Hierarchical HMM for Dementia Diagnostics

Multi-Layer Local Graph Words for Object Recognition

Human Daily Activities Indexing in Videos from Wearable Cameras for Monitoring of Patients with Dementia Diseases