Source author record

Torben Bach Pedersen

Torben Bach Pedersen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Artificial Intelligence Machine Learning Social and Information Networks

Catalog footprint

What is connected

8works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value -- yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) -- structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

preprint2023arXiv

Mining Seasonal Temporal Patterns in Time Series

Very large time series are increasingly available from an ever wider range of IoT-enabled sensors, from which significant insights can be obtained through mining temporal patterns from them. A useful type of patterns found in many real-world applications exhibits periodic occurrences, and is thus called seasonal temporal pattern (STP). Compared to regular patterns, mining seasonal temporal patterns is more challenging since traditional measures such as support and confidence do not capture the seasonality characteristics. Further, the anti-monotonicity property does not hold for STPs, and thus, resulting in an exponential search space. This paper presents our Frequent Seasonal Temporal Pattern Mining from Time Series (FreqSTPfTS) solution providing: (1) The first solution for seasonal temporal pattern mining (STPM) from time series that can mine STP at different data granularities. (2) The STPM algorithm that uses efficient data structures and two pruning techniques to reduce the search space and speed up the mining process. (3) An approximate version of STPM that uses mutual information, a measure of data correlation, to prune unpromising time series from the search space. (4) An extensive experimental evaluation showing that STPM outperforms the baseline in runtime and memory consumption, and can scale to big datasets. The approximate STPM is up to an order of magnitude faster and less memory consuming than the baseline, while maintaining high accuracy.

preprint2022arXiv

A Comparative Study on Unsupervised Anomaly Detection for Time Series: Experiments and Analysis

The continued digitization of societal processes translates into a proliferation of time series data that cover applications such as fraud detection, intrusion detection, and energy management, where anomaly detection is often essential to enable reliability and safety. Many recent studies target anomaly detection for time series data. Indeed, area of time series anomaly detection is characterized by diverse data, methods, and evaluation strategies, and comparisons in existing studies consider only part of this diversity, which makes it difficult to select the best method for a particular problem setting. To address this shortcoming, we introduce taxonomies for data, methods, and evaluation strategies, provide a comprehensive overview of unsupervised time series anomaly detection using the taxonomies, and systematically evaluate and compare state-of-the-art traditional as well as deep learning techniques. In the empirical study using nine publicly available datasets, we apply the most commonly-used performance evaluation metrics to typical methods under a fair implementation standard. Based on the structuring offered by the taxonomies, we report on empirical studies and provide guidelines, in the form of comparative tables, for choosing the methods most suitable for particular application settings. Finally, we propose research directions for this dynamic field.

preprint2022arXiv

A Unified Approach for Multi-Scale Synchronous Correlation Search in Big Time Series -- Full Version

The wide deployment of IoT sensors has enabled the collection of very big time series across different domains, from which advanced analytics can be performed to find unknown relationships, most importantly the correlations between them. However, current approaches for correlation search on time series are limited to only a single temporal scale and simple types of relations, and cannot handle noise effectively. This paper presents the integrated SYnchronous COrrelation Search (iSYCOS) framework to find multi-scale correlations in big time series. Specifically, iSYCOS integrates top-down and bottom-up approaches into a single auto-configured framework capable of efficiently extracting complex window-based correlations from big time series using mutual information (MI). Moreover, iSYCOS includes a novel MI-based theory to identify noise in the data, and is used to perform pruning to improve iSYCOS performance. Besides, we design a distributed version of iSYCOS that can scale out in a Spark cluster to handle big time series. Our extensive experimental evaluation on synthetic and real-world datasets shows that iSYCOS can auto-configure on a given dataset to find complex multi-scale correlations. The pruning and optimisations can improve iSYCOS performance up to an order of magnitude, and the distributed iSYCOS can scale out linearly on a computing cluster.

preprint2022arXiv

Finding Representative Sampling Subsets in Sensor Graphs using Time Series Similarities

With the increasing use of IoT-enabled sensors, it is important to have effective methods for querying the sensors. For example, in a dense network of battery-driven temperature sensors, it is often possible to query (sample) just a subset of the sensors at any given time, since the values of the non-sampled sensors can be estimated from the sampled values. If we can divide the set of sensors into disjoint so-called representative sampling subsets that each represent the other sensors sufficiently well, we can alternate the sampling between the sampling subsets and thus, increase battery life significantly. In this paper, we formulate the problem of finding representative sampling subsets as a graph problem on a so-called sensor graph with the sensors as nodes. Our proposed solution, SubGraphSample, consists of two phases. In Phase-I, we create edges in the sensor graph based on the similarities between the time series of sensor values, analyzing six different techniques based on proven time series similarity metrics. In Phase-II, we propose two new techniques and extend four existing ones to find the maximal number of representative sampling subsets. Finally, we propose AutoSubGraphSample which auto-selects the best technique for Phase-I and Phase-II for a given dataset. Our extensive experimental evaluation shows that our approach can yield significant battery life improvements within realistic error bounds.

preprint2021arXiv

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

In the current context of Big Data, a multitude of new NoSQL solutions for storing, managing, and extracting information and patterns from semi-structured data have been proposed and implemented. These solutions were developed to relieve the issue of rigid data structures present in relational databases, by introducing semi-structured and flexible schema design. As current data generated by different sources and devices, especially from IoT sensors and actuators, use either XML or JSON format, depending on the application, database technologies that store and query semi-structured data in XML format are needed. Thus, Native XML Databases, which were initially designed to manipulate XML data using standardized querying languages, i.e., XQuery and XPath, were rebranded as NoSQL Document-Oriented Databases Systems. Currently, the majority of these solutions have been replaced with the more modern JSON based Database Management Systems. However, we believe that XML-based solutions can still deliver performance in executing complex queries on heterogeneous collections. Unfortunately nowadays, research lacks a clear comparison of the scalability and performance for database technologies that store and query documents in XML versus the more modern JSON format. Moreover, to the best of our knowledge, there are no Big Data-compliant benchmarks for such database technologies. In this paper, we present a comparison for selected Document-Oriented Database Systems that either use the XML format to encode documents, i.e., BaseX, eXist-db, and Sedna, or the JSON format, i.e., MongoDB, CouchDB, and Couchbase. To underline the performance differences we also propose a benchmark that uses a heterogeneous complex schema on a large DBLP corpus.

preprint2020arXiv

Multi-Source Spatial Entity Linkage

Besides the traditional cartographic data sources, spatial information can also be derived from location-based sources. However, even though different location-based sources refer to the same physical world, each one has only partial coverage of the spatial entities, describe them with different attributes, and sometimes provide contradicting information. Hence, we introduce the spatial entity linkage problem, which finds which pairs of spatial entities belong to the same physical spatial entity. Our proposed solution (QuadSky) starts with a time-efficient spatial blocking technique (QuadFlex), compares pairwise the spatial entities in the same block, ranks the pairs using Pareto optimality with the SkyRank algorithm, and finally, classifies the pairs with our novel SkyEx-* family of algorithms that yield 0.85 precision and 0.85 recall for a manually labeled dataset of 1,500 pairs and 0.87 precision and 0.6 recall for a semi-manually labeled dataset of 777,452 pairs. Moreover, we provide a theoretical guarantee and formalize the SkyEx-FES algorithm that explores only 27% of the skylines without any loss in F-measure. Furthermore, our fully unsupervised algorithm SkyEx-D approximates the optimal result with an F-measure loss of just 0.01. Finally, QuadSky provides the best trade-off between precision and recall, and the best F-measure compared to the existing baselines and clustering techniques, and approximates the results of supervised learning solutions.

preprint2020arXiv

Multidimensional Enrichment of Spatial RDF Data for SOLAP -- Full Version

Large volumes of spatial data and multidimensional data are being published on the Semantic Web, which has led to new opportunities for advanced analysis, such as Spatial Online Analytical Processing (SOLAP). The RDF Data Cube (QB) and QB4OLAP vocabularies have been widely used for annotating and publishing statistical and multidimensional RDF data. Although such statistical data sets might have spatial information, such as coordinates, the lack of spatial semantics and spatial multidimensional concepts in QB4OLAP and QB prevents users from employing SOLAP queries over spatial data using SPARQL. The QB4SOLAP vocabulary, on the other hand, fully supports annotating spatial and multidimensional data on the Semantic Web and enables users to query endpoints with SOLAP operators in SPARQL. To bridge the gap between QB/QB4OLAP and QB4SOLAP, we propose an RDF2SOLAP enrichment model that automatically annotates spatial multidimensional concepts with QB4SOLAP and in doing so enables SOLAP on existing QB and QB4OLAP data on the Semantic Web. Furthermore, we present and evaluate a wide range of enrichment algorithms and apply them on a non-trivial real-world use case involving governmental open data with complex geometry types.

Torben Bach Pedersen

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

Mining Seasonal Temporal Patterns in Time Series

A Comparative Study on Unsupervised Anomaly Detection for Time Series: Experiments and Analysis

A Unified Approach for Multi-Scale Synchronous Correlation Search in Big Time Series -- Full Version

Finding Representative Sampling Subsets in Sensor Graphs using Time Series Similarities

The Forgotten Document-Oriented Database Management Systems: An Overview and Benchmark of Native XML DODBMSes in Comparison with JSON DODBMSes

Multi-Source Spatial Entity Linkage

Multidimensional Enrichment of Spatial RDF Data for SOLAP -- Full Version