Source author record

João Gama

João Gama appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Applications Machine Learning Information Retrieval Computation and Language Distributed, Parallel, and Cluster Computing q-fin.GN q-fin.RM

Catalog footprint

What is connected

9works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Benchmark dataset for predictive maintenance

The paper describes the MetroPT data set, an outcome of a eXplainable Predictive Maintenance (XPM) project with an urban metro public transportation service in Porto, Portugal. The data was collected in 2022 that aimed to evaluate machine learning methods for online anomaly detection and failure prediction. By capturing several analogic sensor signals (pressure, temperature, current consumption), digital signals (control signals, discrete signals), and GPS information (latitude, longitude, and speed), we provide a dataset that can be easily used to evaluate online machine learning methods. This dataset contains some interesting characteristics and can be a good benchmark for predictive maintenance models.

preprint2022arXiv

Contextualization for the Organization of Text Documents Streams

There has been a significant effort by the research community to address the problem of providing methods to organize documentation with the help of information Retrieval methods. In this report paper, we present several experiments with some stream analysis methods to explore streams of text documents. We use only dynamic algorithms to explore, analyze, and organize the flux of text documents. This document shows a case study with developed architectures of a Text Document Stream Organization, using incremental algorithms like Incremental TextRank, and IS-TFIDF. Both these algorithms are based on the assumption that the mapping of text documents and their document-term matrix in lower-dimensional evolving networks provides faster processing when compared to batch algorithms. With this architecture, and by using FastText Embedding to retrieve similarity between documents, we compare methods with large text datasets and ground truth evaluation of clustering capacities. The datasets used were Reuters and COVID-19 emotions. The results provide a new view for the contextualization of similarity when approaching flux of documents organization tasks, based on the similarity between documents in the flux, and by using mentioned algorithms.

preprint2022arXiv

Federated Anomaly Detection over Distributed Data Streams

Sharing of telecommunication network data, for example, even at high aggregation levels, is nowadays highly restricted due to privacy legislation and regulations and other important ethical concerns. It leads to scattering data across institutions, regions, and states, inhibiting the usage of AI methods that could otherwise take advantage of data at scale. It creates the need to build a platform to control such data, build models or perform calculations. In this work, we propose an approach to building the bridge among anomaly detection, federated learning, and data streams. The overarching goal of the work is to detect anomalies in a federated environment over distributed data streams. This work complements the state-of-the-art by adapting the data stream algorithms in a federated learning setting for anomaly detection and by delivering a robust framework and demonstrating the practical feasibility in a real-world distributed deployment scenario.

preprint2021arXiv

How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

There have been several research works proposing new Explainable AI (XAI) methods designed to generate model explanations having specific properties, or desiderata, such as fidelity, robustness, or human-interpretability. However, explanations are seldom evaluated based on their true practical impact on decision-making tasks. Without that assessment, explanations might be chosen that, in fact, hurt the overall performance of the combined system of ML model + end-users. This study aims to bridge this gap by proposing XAI Test, an application-grounded evaluation methodology tailored to isolate the impact of providing the end-user with different levels of information. We conducted an experiment following XAI Test to evaluate three popular post-hoc explanation methods -- LIME, SHAP, and TreeInterpreter -- on a real-world fraud detection task, with real data, a deployed ML model, and fraud analysts. During the experiment, we gradually increased the information provided to the fraud analysts in three stages: Data Only, i.e., just transaction data without access to model score nor explanations, Data + ML Model Score, and Data + ML Model Score + Explanations. Using strong statistical analysis, we show that, in general, these popular explainers have a worse impact than desired. Some of the conclusion highlights include: i) showing Data Only results in the highest decision accuracy and the slowest decision time among all variants tested, ii) all the explainers improve accuracy over the Data + ML Model Score variant but still result in lower accuracy when compared with Data Only; iii) LIME was the least preferred by users, probably due to its substantially lower variability of explanations from case to case.

preprint2015arXiv

Evaluation of recommender systems in streaming environments

Evaluation of recommender systems is typically done with finite datasets. This means that conventional evaluation methodologies are only applicable in offline experiments, where data and models are stationary. However, in real world systems, user feedback is continuously generated, at unpredictable rates. Given this setting, one important issue is how to evaluate algorithms in such a streaming data environment. In this paper we propose a prequential evaluation protocol for recommender systems, suitable for streaming data environments, but also applicable in stationary settings. Using this protocol we are able to monitor the evolution of algorithms' accuracy over time. Furthermore, we are able to perform reliable comparative assessments of algorithms by computing significance tests over a sliding window. We argue that besides being suitable for streaming data, prequential evaluation allows the detection of phenomena that would otherwise remain unnoticed in the evaluation of both offline and online recommender systems.

preprint2014arXiv

A two-stage model for dealing with temporal degradation of credit scoring

This work is attached to the BRICS 2013 competition. We propose a two-stage model for dealing with the temporal degradation of credit scoring models. This methodology produced motivating results in a 1-year horizon. We anticipate that it can be extended to other applications of risk assessment with great success. Future extensions should cover predictions in larger time frames and consider lagged periods. This methodology can be further improved if more information about the economic cycles is integrated in the forecasting of default.

preprint2014arXiv

EigenEvent: An Algorithm for Event Detection from Complex Data Streams in Syndromic Surveillance

Syndromic surveillance systems continuously monitor multiple pre-diagnostic daily streams of indicators from different regions with the aim of early detection of disease outbreaks. The main objective of these systems is to detect outbreaks hours or days before the clinical and laboratory confirmation. The type of data that is being generated via these systems is usually multivariate and seasonal with spatial and temporal dimensions. The algorithm What's Strange About Recent Events (WSARE) is the state-of-the-art method for such problems. It exhaustively searches for contrast sets in the multivariate data and signals an alarm when find statistically significant rules. This bottom-up approach presents a much lower detection delay comparing the existing top-down approaches. However, WSARE is very sensitive to the small-scale changes and subsequently comes with a relatively high rate of false alarms. We propose a new approach called EigenEvent that is neither fully top-down nor bottom-up. In this method, we instead of top-down or bottom-up search, track changes in data correlation structure via eigenspace techniques. This new methodology enables us to detect both overall changes (via eigenvalue) and dimension-level changes (via eigenvectors). Experimental results on hundred sets of benchmark data reveals that EigenEvent presents a better overall performance comparing state-of-the-art, in particular in terms of the false alarm rate.

preprint2014arXiv

Eigenspace Method for Spatiotemporal Hotspot Detection

Hotspot detection aims at identifying subgroups in the observations that are unexpected, with respect to the some baseline information. For instance, in disease surveillance, the purpose is to detect sub-regions in spatiotemporal space, where the count of reported diseases (e.g. Cancer) is higher than expected, with respect to the population. The state-of-the-art method for this kind of problem is the Space-Time Scan Statistics (STScan), which exhaustively search the whole space through a sliding window looking for significant spatiotemporal clusters. STScan makes some restrictive assumptions about the distribution of data, the shape of the hotspots and the quality of data, which can be unrealistic for some nontraditional data sources. A novel methodology called EigenSpot is proposed where instead of an exhaustive search over the space, tracks the changes in a space-time correlation structure. Not only does the new approach presents much more computational efficiency, but also makes no assumption about the data distribution, hotspot shape or the data quality. The principal idea is that with the joint combination of abnormal elements in the principal spatial and the temporal singular vectors, the location of hotspots in the spatiotemporal space can be approximated. A comprehensive experimental evaluation, both on simulated and real data sets reveals the effectiveness of the proposed method.

preprint2014arXiv

Event and Anomaly Detection Using Tucker3 Decomposition

Failure detection in telecommunication networks is a vital task. So far, several supervised and unsupervised solutions have been provided for discovering failures in such networks. Among them unsupervised approaches has attracted more attention since no label data is required. Often, network devices are not able to provide information about the type of failure. In such cases the type of failure is not known in advance and the unsupervised setting is more appropriate for diagnosis. Among unsupervised approaches, Principal Component Analysis (PCA) is a well-known solution which has been widely used in the anomaly detection literature and can be applied to matrix data (e.g. Users-Features). However, one of the important properties of network data is their temporal sequential nature. So considering the interaction of dimensions over a third dimension, such as time, may provide us better insights into the nature of network failures. In this paper we demonstrate the power of three-way analysis to detect events and anomalies in time-evolving network data.

João Gama

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

A Benchmark dataset for predictive maintenance

Contextualization for the Organization of Text Documents Streams

Federated Anomaly Detection over Distributed Data Streams

How can I choose an explainer? An Application-grounded Evaluation of Post-hoc Explanations

Evaluation of recommender systems in streaming environments

A two-stage model for dealing with temporal degradation of credit scoring

EigenEvent: An Algorithm for Event Detection from Complex Data Streams in Syndromic Surveillance

Eigenspace Method for Spatiotemporal Hotspot Detection

Event and Anomaly Detection Using Tucker3 Decomposition