Source author record

Nikos Mamoulis

Nikos Mamoulis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Distributed, Parallel, and Cluster Computing Machine Learning Cryptography and Security Data Structures and Algorithms Social and Information Networks

Catalog footprint

What is connected

15works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

HINT: A Hierarchical Index for Intervals in Main Memory

Indexing intervals is a fundamental problem, finding a wide range of applications. Recent work on managing large collections of intervals in main memory focused on overlap joins and temporal aggregation problems. In this paper, we propose novel and efficient in-memory indexing techniques for intervals, with a focus on interval range queries, which are a basic component of many search and analysis tasks. First, we propose an optimized version of a single-level (flat) domain-partitioning approach, which may have large space requirements due to excessive replication. Then, we propose a hierarchical partitioning approach, which assigns each interval to at most two partitions per level and has controlled space requirements. Novel elements of our techniques include the division of the intervals at each partition into groups based on whether they begin inside or before the partition boundaries, reducing the information stored at each partition to the absolutely necessary, and the effective handling of data sparsity and skew. Experimental results on real and synthetic interval sets of different characteristics show that our approaches are typically one order of magnitude faster than the state-of-the-art.

preprint2022arXiv

Three-dimensional Geospatial Interlinking with JedAI-spatial

Geospatial data constitutes a considerable part of (Semantic) Web data, but so far, its sources are inadequately interlinked in the Linked Open Data cloud. Geospatial Interlinking aims to cover this gap by associating geometries with topological relations like those of the Dimensionally Extended 9-Intersection Model. Due to its quadratic time complexity, various algorithms aim to carry out Geospatial Interlinking efficiently. We present JedAI-spatial, a novel, open-source system that organizes these algorithms according to three dimensions: (i) Space Tiling, which determines the approach that reduces the search space, (ii) Budget-awareness, which distinguishes interlinking algorithms into batch and progressive ones, and (iii) Execution mode, which discerns between serial algorithms, running on a single CPU-core, and parallel ones, running on top of Apache Spark. We analytically describe JedAI-spatial's architecture and capabilities and perform thorough experiments to provide interesting insights about the relative performance of its algorithms.

preprint2021arXiv

A Two-level Spatial In-Memory Index

Very large volumes of spatial data increasingly become available and demand effective management. While there has been decades of research on spatial data management, few works consider the current state of commodity hardware, having relatively large memory and the ability of parallel multi-core processing. In this work, we re-consider the design of spatial indexing under this new reality. Specifically, we propose a main-memory indexing approach for objects with spatial extent, which is based on a classic regular space partitioning into disjoint tiles. The novelty of our index is that the contents of each tile are further partitioned into four classes. This second-level partitioning not only reduces the number of comparisons required to compute the results, but also avoids the generation and elimination of duplicate results, which is an inherent problem of spatial indexes based on disjoint space partitioning. The spatial partitions defined by our indexing scheme are totally independent, facilitating effortless parallel evaluation, as no synchronization or communication between the partitions is necessary. We show how our index can be used to efficiently process spatial range queries and drastically reduce the cost of the refinement step of the queries. In addition, we study the efficient processing of numerous range queries in batch and in parallel. Extensive experiments on real datasets confirm the efficiency of our approaches.

preprint2021arXiv

Leveraging Meta-path Contexts for Classification in Heterogeneous Information Networks

A heterogeneous information network (HIN) has as vertices objects of different types and as edges the relations between objects, which are also of various types. We study the problem of classifying objects in HINs. Most existing methods perform poorly when given scarce labeled objects as training sets, and methods that improve classification accuracy under such scenarios are often computationally expensive. To address these problems, we propose ConCH, a graph neural network model. ConCH formulates the classification problem as a multi-task learning problem that combines semi-supervised learning with self-supervised learning to learn from both labeled and unlabeled data. ConCH employs meta-paths, which are sequences of object types that capture semantic relationships between objects. ConCH co-derives object embeddings and context embeddings via graph convolution. It also uses the attention mechanism to fuse such embeddings. We conduct extensive experiments to evaluate the performance of ConCH against other 15 classification methods. Our results show that ConCH is an effective and efficient method for HIN classification.

preprint2020arXiv

Discovering Closed and Maximal Embedded Patterns from Large Tree Data

We address the problem of summarizing embedded tree patterns extracted from large data trees. We do so by defining and mining closed and maximal embedded unordered tree patterns from a single large data tree. We design an embedded frequent pattern mining algorithm extended with a local closedness checking technique. This algorithm is called {\em closedEmbTM-prune} as it eagerly eliminates non-closed patterns. To mitigate the generation of intermediate patterns, we devise pattern search space pruning rules to proactively detect and prune branches in the pattern search space which do not correspond to closed patterns. The pruning rules are accommodated into the extended embedded pattern miner to produce a new algorithm, called {\em closedEmbTM-prune}, for mining all the closed and maximal embedded frequent patterns from large data trees. Our extensive experiments on synthetic and real large-tree datasets demonstrate that, on dense datasets, {\em closedEmbTM-prune} not only generates a complete closed and maximal pattern set which is substantially smaller than that generated by the embedded pattern miner, but also runs much faster with negligible overhead on pattern pruning.

preprint2020arXiv

Fast and Secure Distributed Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) has been successfully applied in several data mining tasks. Recently, there is an increasing interest in the acceleration of NMF, due to its high cost on large matrices. On the other hand, the privacy issue of NMF over federated data is worthy of attention, since NMF is prevalently applied in image and text analysis which may involve leveraging privacy data (e.g, medical image and record) across several parties (e.g., hospitals). In this paper, we study the acceleration and security problems of distributed NMF. Firstly, we propose a distributed sketched alternating nonnegative least squares (DSANLS) framework for NMF, which utilizes a matrix sketching technique to reduce the size of nonnegative least squares subproblems with a convergence guarantee. For the second problem, we show that DSANLS with modification can be adapted to the security setting, but only for one or limited iterations. Consequently, we propose four efficient distributed NMF methods in both synchronous and asynchronous settings with a security guarantee. We conduct extensive experiments on several real datasets to show the superiority of our proposed methods. The implementation of our methods is available at https://github.com/qianyuqiu79/DSANLS.

preprint2020arXiv

Flow Computation in Temporal Interaction Networks

Temporal interaction networks capture the history of activities between entities along a timeline. At each interaction, some quantity of data (money, information, kbytes, etc.) flows from one vertex of the network to another. Flow-based analysis can reveal important information. For instance, financial intelligent units (FIUs) are interested in finding subgraphs in transactions networks with significant flow of money transfers. In this paper, we introduce the flow computation problem in an interaction network or a subgraph thereof. We propose and study two models of flow computation, one based on a greedy flow transfer assumption and one that finds the maximum possible flow. We show that the greedy flow computation problem can be easily solved by a single scan of the interactions in time order. For the harder maximum flow problem, we propose graph precomputation and simplification approaches that can greatly reduce its complexity in practice. As an application of flow computation, we formulate and solve the problem of flow pattern search, where, given a graph pattern, the objective is to find its instances and their flows in a large interaction network. We evaluate our algorithms using real datasets. The results show that the techniques proposed in this paper can greatly reduce the cost of flow computation and pattern enumeration.

preprint2019arXiv

Parallel In-Memory Evaluation of Spatial Joins

The spatial join is a popular operation in spatial database systems and its evaluation is a well-studied problem. As main memories become bigger and faster and commodity hardware supports parallel processing, there is a need to revamp classic join algorithms which have been designed for I/O-bound processing. In view of this, we study the in-memory and parallel evaluation of spatial joins, by re-designing a classic partitioning-based algorithm to consider alternative approaches for space partitioning. Our study shows that, compared to a straightforward implementation of the algorithm, our tuning can improve performance significantly. We also show how to select appropriate partitioning parameters based on data statistics, in order to tune the algorithm for the given join inputs. Our parallel implementation scales gracefully with the number of threads reducing the cost of the join to at most one second even for join inputs with tens of millions of rectangles.

preprint2016arXiv

Set Containment Join Revisited

Given two collections of set objects $R$ and $S$, the $R \bowtie_{\subseteq} S$ set containment join returns all object pairs $(r, s) \in R \times S$ such that $r \subseteq s$. Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-the-art algorithm for set containment joins (PRETTI) builds an inverted index on the right-hand collection $S$ and a prefix tree on the left-hand collection $R$ that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms PRETTI by a wide margin.

preprint2015arXiv

Adaptive Partitioning for Very Large RDF Data

Distributed RDF systems partition data across multiple computer nodes (workers). Some systems perform cheap hash partitioning, which may result in expensive query evaluation, while others apply heuristics aiming at minimizing inter-node communication during query evaluation. This requires an expensive data preprocessing phase, leading to high startup costs for very large RDF knowledge bases. Apriori knowledge of the query workload has also been used to create partitions, which however are static and do not adapt to workload changes; hence, inter-node communication cannot be consistently avoided for queries that are not favored by the initial data partitioning. In this paper, we propose AdHash, a distributed RDF system, which addresses the shortcomings of previous work. First, AdHash applies lightweight partitioning on the initial data, that distributes triples by hashing on their subjects; this renders its startup overhead low. At the same time, the locality-aware query optimizer of AdHash takes full advantage of the partitioning to (i)support the fully parallel processing of join patterns on subjects and (ii) minimize data communication for general queries by applying hash distribution of intermediate results instead of broadcasting, wherever possible. Second, AdHash monitors the data access patterns and dynamically redistributes and replicates the instances of the most frequent ones among workers. As a result, the communication cost for future queries is drastically reduced or even eliminated. To control replication, AdHash implements an eviction policy for the redistributed patterns. Our experiments with synthetic and real data verify that AdHash (i) starts faster than all existing systems, (ii) processes thousands of queries before other systems become online, and (iii) gracefully adapts to the query load, being able to evaluate queries on billion-scale RDF data in sub-seconds.

preprint2014arXiv

Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories

Nearest neighbor (NN) queries in trajectory databases have received significant attention in the past, due to their application in spatio-temporal data analysis. Recent work has considered the realistic case where the trajectories are uncertain; however, only simple uncertainty models have been proposed, which do not allow for accurate probabilistic search. In this paper, we fill this gap by addressing probabilistic nearest neighbor queries in databases with uncertain trajectories modeled by stochastic processes, specifically the Markov chain model. We study three nearest neighbor query semantics that take as input a query state or trajectory $q$ and a time interval. For some queries, we show that no polynomial time solution can be found. For problems that can be solved in PTIME, we present exact query evaluation algorithms, while for the general case, we propose a sophisticated sampling approach, which uses Bayesian inference to guarantee that sampled trajectories conform to the observation data stored in the database. This sampling approach can be used in Monte-Carlo based approximation solutions. We include an extensive experimental study to support our theoretical results.

preprint2012arXiv

Privacy Preservation by Disassociation

In this work, we focus on protection against identity disclosure in the publication of sparse multidimensional data. Existing multidimensional anonymization techniquesa) protect the privacy of users either by altering the set of quasi-identifiers of the original data (e.g., by generalization or suppression) or by adding noise (e.g., using differential privacy) and/or (b) assume a clear distinction between sensitive and non-sensitive information and sever the possible linkage. In many real world applications the above techniques are not applicable. For instance, consider web search query logs. Suppressing or generalizing anonymization methods would remove the most valuable information in the dataset: the original query terms. Additionally, web search query logs contain millions of query terms which cannot be categorized as sensitive or non-sensitive since a term may be sensitive for a user and non-sensitive for another. Motivated by this observation, we propose an anonymization technique termed disassociation that preserves the original terms but hides the fact that two or more different terms appear in the same record. We protect the users' privacy by disassociating record terms that participate in identifying combinations. This way the adversary cannot associate with high probability a record with a rare combination of terms. To the best of our knowledge, our proposal is the first to employ such a technique to provide protection against identity disclosure. We propose an anonymization algorithm based on our approach and evaluate its performance on real and synthetic datasets, comparing it against other state-of-the-art methods based on generalization and differential privacy.

preprint2011arXiv

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

In this paper, we propose a novel, effective and efficient probabilistic pruning criterion for probabilistic similarity queries on uncertain data. Our approach supports a general uncertainty model using continuous probabilistic density functions to describe the (possibly correlated) uncertain attributes of objects. In a nutshell, the problem to be solved is to compute the PDF of the random variable denoted by the probabilistic domination count: Given an uncertain database object B, an uncertain reference object R and a set D of uncertain database objects in a multi-dimensional space, the probabilistic domination count denotes the number of uncertain objects in D that are closer to R than B. This domination count can be used to answer a wide range of probabilistic similarity queries. Specifically, we propose a novel geometric pruning filter and introduce an iterative filter-refinement strategy for conservatively and progressively estimating the probabilistic domination count in an efficient way while keeping correctness according to the possible world semantics. In an experimental evaluation, we show that our proposed technique allows to acquire tight probability bounds for the probabilistic domination count quickly, even for large uncertain databases.

preprint2011arXiv

Inverse Queries For Multidimensional Spaces

Traditional spatial queries return, for a given query object $q$, all database objects that satisfy a given predicate, such as epsilon range and $k$-nearest neighbors. This paper defines and studies {\em inverse} spatial queries, which, given a subset of database objects $Q$ and a query predicate, return all objects which, if used as query objects with the predicate, contain $Q$ in their result. We first show a straightforward solution for answering inverse spatial queries for any query predicate. Then, we propose a filter-and-refinement framework that can be used to improve efficiency. We show how to apply this framework on a variety of inverse queries, using appropriate space pruning strategies. In particular, we propose solutions for inverse epsilon range queries, inverse $k$-nearest neighbor queries, and inverse skyline queries. Our experiments show that our framework is significantly more efficient than naive approaches.

preprint2011arXiv

Size-l Object Summaries for Relational Keyword Search

A previously proposed keyword search paradigm produces, as a query result, a ranked list of Object Summaries (OSs). An OS is a tree structure of related tuples that summarizes all data held in a relational database about a particular Data Subject (DS). However, some of these OSs are very large in size and therefore unfriendly to users that initially prefer synoptic information before proceeding to more comprehensive information about a particular DS. In this paper, we investigate the effective and efficient retrieval of concise and informative OSs. We argue that a good size-l OS should be a stand-alone and meaningful synopsis of the most important information about the particular DS. More precisely, we define a size-l OS as a partial OS composed of l important tuples. We propose three algorithms for the efficient generation of size-l OSs (in addition to the optimal approach which requires exponential time). Experimental evaluation on DBLP and TPC-H databases verifies the effectiveness and efficiency of our approach.

Nikos Mamoulis

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

HINT: A Hierarchical Index for Intervals in Main Memory

Three-dimensional Geospatial Interlinking with JedAI-spatial

A Two-level Spatial In-Memory Index

Leveraging Meta-path Contexts for Classification in Heterogeneous Information Networks

Discovering Closed and Maximal Embedded Patterns from Large Tree Data

Fast and Secure Distributed Nonnegative Matrix Factorization

Flow Computation in Temporal Interaction Networks

Parallel In-Memory Evaluation of Spatial Joins

Set Containment Join Revisited

Adaptive Partitioning for Very Large RDF Data

Probabilistic Nearest Neighbor Queries on Uncertain Moving Object Trajectories

Privacy Preservation by Disassociation

A Novel Probabilistic Pruning Approach to Speed Up Similarity Queries in Uncertain Databases

Inverse Queries For Multidimensional Spaces

Size-l Object Summaries for Relational Keyword Search