Source author record

Georgiana Ifrim

Georgiana Ifrim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language eess.SP Artificial Intelligence Computational Complexity Information Retrieval physics.soc-ph Social and Information Networks

Catalog footprint

What is connected

10works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Automated Mobility Context Detection with Inertial Signals

Remote monitoring of motor functions is a powerful approach for health assessment, especially among the elderly population or among subjects affected by pathologies that negatively impact their walking capabilities. This is further supported by the continuous development of wearable sensor devices, which are getting progressively smaller, cheaper, and more energy efficient. The external environment and mobility context have an impact on walking performance, hence one of the biggest challenges when remotely analysing gait episodes is the ability to detect the context within which those episodes occurred. The primary goal of this paper is the investigation of context detection for remote monitoring of daily motor functions. We aim to understand whether inertial signals sampled with wearable accelerometers, provide reliable information to classify gait-related activities as either indoor or outdoor. We explore two different approaches to this task: (1) using gait descriptors and features extracted from the input inertial signals sampled during walking episodes, together with classic machine learning algorithms, and (2) treating the input inertial signals as time series data and leveraging end-to-end state-of-the-art time series classifiers. We directly compare the two approaches through a set of experiments based on data collected from 9 healthy individuals. Our results indicate that the indoor/outdoor context can be successfully derived from inertial data streams. We also observe that time series classification models achieve better accuracy than any other feature-based models, while preserving efficiency and ease of use.

preprint2022arXiv

Efficient Unsupervised Sentence Compression by Fine-tuning Transformers with Reinforcement Learning

Sentence compression reduces the length of text by removing non-essential content while preserving important facts and grammaticality. Unsupervised objective driven methods for sentence compression can be used to create customized models without the need for ground-truth training data, while allowing flexibility in the objective function(s) that are used for learning and inference. Recent unsupervised sentence compression approaches use custom objectives to guide discrete search; however, guided search is expensive at inference time. In this work, we explore the use of reinforcement learning to train effective sentence compression models that are also fast when generating predictions. In particular, we cast the task as binary sequence labelling and fine-tune a pre-trained transformer using a simple policy gradient approach. Our approach outperforms other unsupervised models while also being more efficient at inference time.

preprint2022arXiv

MrSQM: Fast Time Series Classification with Symbolic Representations

Symbolic representations of time series have proven to be effective for time series classification, with many recent approaches including SAX-VSM, BOSS, WEASEL, and MrSEQL. The key idea is to transform numerical time series to symbolic representations in the time or frequency domain, i.e., sequences of symbols, and then extract features from these sequences. While achieving high accuracy, existing symbolic classifiers are computationally expensive. In this paper we present MrSQM, a new time series classifier which uses multiple symbolic representations and efficient sequence mining, to extract important time series features. We study four feature selection approaches on symbolic sequences, ranging from fully supervised, to unsupervised and hybrids. We propose a new approach for optimal supervised symbolic feature selection in all-subsequence space, by adapting a Chi-squared bound developed for discriminative pattern mining, to time series. Our extensive experiments on 112 datasets of the UEA/UCR benchmark demonstrate that MrSQM can quickly extract useful features and learn accurate classifiers with the classic logistic regression algorithm. Interestingly, we find that a very simple and fast feature selection strategy can be highly effective as compared with more sophisticated and expensive methods. MrSQM advances the state-of-the-art for symbolic time series classifiers and it is an effective method to achieve high accuracy, with fast runtime.

preprint2020arXiv

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

preprint2020arXiv

Background Knowledge Injection for Interpretable Sequence Classification

Sequence classification is the supervised learning task of building models that predict class labels of unseen sequences of symbols. Although accuracy is paramount, in certain scenarios interpretability is a must. Unfortunately, such trade-off is often hard to achieve since we lack human-independent interpretability metrics. We introduce a novel sequence learning algorithm, that combines (i) linear classifiers - which are known to strike a good balance between predictive power and interpretability, and (ii) background knowledge embeddings. We extend the classic subsequence feature space with groups of symbols which are generated by background knowledge injected via word or graph embeddings, and use this new feature space to learn a linear classifier. We also present a new measure to evaluate the interpretability of a set of symbolic features based on the symbol embeddings. Experiments on human activity recognition from wearables and amino acid sequence classification show that our classification approach preserves predictive power, while delivering more interpretable models.

preprint2020arXiv

Examining the State-of-the-Art in News Timeline Summarization

Previous work on automatic news timeline summarization (TLS) leaves an unclear picture about how this task can generally be approached and how well it is currently solved. This is mostly due to the focus on individual subtasks, such as date selection and date summarization, and to the previous lack of appropriate evaluation metrics for the full TLS task. In this paper, we compare different TLS strategies using appropriate evaluation frameworks, and propose a simple and effective combination of methods that improves over the state-of-the-art on all tested benchmarks. For a more robust evaluation, we also present a new TLS dataset, which is larger and spans longer time periods than previous datasets.

preprint2020arXiv

Interpretable Time Series Classification using Linear Models and Multi-resolution Multi-domain Symbolic Representations

The time series classification literature has expanded rapidly over the last decade, with many new classification approaches published each year. Prior research has mostly focused on improving the accuracy and efficiency of classifiers, with interpretability being somewhat neglected. This aspect of classifiers has become critical for many application domains and the introduction of the EU GDPR legislation in 2018 is likely to further emphasize the importance of interpretable learning algorithms. Currently, state-of-the-art classification accuracy is achieved with very complex models based on large ensembles (COTE) or deep neural networks (FCN). These approaches are not efficient with regard to either time or space, are difficult to interpret and cannot be applied to variable-length time series, requiring pre-processing of the original series to a set fixed-length. In this paper we propose new time series classification algorithms to address these gaps. Our approach is based on symbolic representations of time series, efficient sequence mining algorithms and linear classification models. Our linear models are as accurate as deep learning models but are more efficient regarding running time and memory, can work with variable-length time series and can be interpreted by highlighting the discriminative symbolic features on the original time series. We show that our multi-resolution multi-domain linear classifier (mtSS-SEQL+LR) achieves a similar accuracy to the state-of-the-art COTE ensemble, and to recent deep learning methods (FCN, ResNet), but uses a fraction of the time and memory required by either COTE or deep models. To further analyse the interpretability of our classifier, we present a case study on a human motion dataset collected by the authors. We release all the results, source code and data to encourage reproducibility.

preprint2015arXiv

A Machine Learning Approach to Predicting the Smoothed Complexity of Sorting Algorithms

Smoothed analysis is a framework for analyzing the complexity of an algorithm, acting as a bridge between average and worst-case behaviour. For example, Quicksort and the Simplex algorithm are widely used in practical applications, despite their heavy worst-case complexity. Smoothed complexity aims to better characterize such algorithms. Existing theoretical bounds for the smoothed complexity of sorting algorithms are still quite weak. Furthermore, empirically computing the smoothed complexity via its original definition is computationally infeasible, even for modest input sizes. In this paper, we focus on accurately predicting the smoothed complexity of sorting algorithms, using machine learning techniques. We propose two regression models that take into account various properties of sorting algorithms and some of the known theoretical results in smoothed analysis to improve prediction quality. We show experimental results for predicting the smoothed complexity of Quicksort, Mergesort, and optimized Bubblesort for large input sizes, therefore filling the gap between known theoretical and empirical results.

preprint2014arXiv

Be In The Know: Connecting News Articles to Relevant Twitter Conversations

In the era of data-driven journalism, data analytics can deliver tools to support journalists in connecting to new and developing news stories, e.g., as echoed in micro-blogs such as Twitter, the new citizen-driven media. In this paper, we propose a framework for tracking and automatically connecting news articles to Twitter conversations as captured by Twitter hashtags. For example, such a system could alert journalists about news that get a lot of Twitter reaction, so that they can investigate those conversations for new developments in the story, promote their article to a set of interested consumers, or discover general sentiment towards the story. Mapping articles to appropriate hashtags is nevertheless very challenging, due to different language styles used in articles versus tweets, the streaming aspect of news and tweets, as well as the user behavior when marking certain tweet-terms as hashtags. As a case-study, we continuously track the RSS feeds of Irish Times news articles and a focused Twitter stream over a two months period, and present a system that assigns hashtags to each article, based on its Twitter echo. We propose a machine learning approach for classifying and ranking article-hashtag pairs. Our empirical study shows that our system delivers high precision for this task.

preprint2010arXiv

Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem.

Georgiana Ifrim

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Automated Mobility Context Detection with Inertial Signals

Efficient Unsupervised Sentence Compression by Fine-tuning Transformers with Reinforcement Learning

MrSQM: Fast Time Series Classification with Symbolic Representations

A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Background Knowledge Injection for Interpretable Sequence Classification

Examining the State-of-the-Art in News Timeline Summarization

Interpretable Time Series Classification using Linear Models and Multi-resolution Multi-domain Symbolic Representations

A Machine Learning Approach to Predicting the Smoothed Complexity of Sorting Algorithms

Be In The Know: Connecting News Articles to Relevant Twitter Conversations

Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space