Source author record

Nikolaj Tatti

Nikolaj Tatti appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Machine Learning Databases Artificial Intelligence Human-Computer Interaction Social and Information Networks

Catalog footprint

What is connected

11works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Approximating splits for decision trees quickly in sparse data streams

Decision trees are one of the most popular classifiers in the machine learning literature. While the most common decision tree learning algorithms treat data as a batch, numerous algorithms have been proposed to construct decision trees from a data stream. A standard training strategy involves augmenting the current tree by changing a leaf node into a split. Here we typically maintain counters in each leaf which allow us to determine the optimal split, and whether the split should be done. In this paper we focus on how to speed up the search for the optimal split when dealing with sparse binary features and a binary class. We focus on finding splits that have the approximately optimal information gain or Gini index. In both cases finding the optimal split can be done in $O(d)$ time, where $d$ is the number of features. We propose an algorithm that yields $(1 + α)$ approximation when using conditional entropy in amortized $O(α^{-1}(1 + m\log d) \log \log n)$ time, where $m$ is the number of 1s in a data point, and $n$ is the number of data points. Similarly, for Gini index, we achieve $(1 + α)$ approximation in amortized $O(α^{-1} + m \log d)$ time. Our approach is beneficial for sparse data where $m \ll d$. In our experiments we find almost-optimal splits efficiently, faster than the baseline, overperforming the theoretical approximation guarantees.

preprint2023arXiv

Ranking with submodular functions on the fly

Maximizing submodular functions have been studied extensively for a wide range of subset-selection problems. However, much less attention has been given to the role of submodularity in sequence-selection and ranking problems. A recently-introduced framework, named \emph{maximum submodular ranking} (MSR), tackles a family of ranking problems that arise naturally when resources are shared among multiple demands with different budgets. For example, the MSR framework can be used to rank web pages for multiple user intents. In this paper, we extend the MSR framework in the streaming setting. In particular, we consider two different streaming models and we propose practical approximation algorithms. In the first streaming model, called \emph{function arriving}, we assume that submodular functions (demands) arrive continuously in a stream, while in the second model, called \emph{item arriving}, we assume that items (resources) arrive continuously. Furthermore, we study the MSR problem with additional constraints on the output sequence, such as a matroid constraint that can ensure fair exposure among items from different groups. These extensions significantly broaden the range of problems that can be captured by the MSR framework. On the practical side, we develop several novel applications based on the MSR formulation, and empirically evaluate the performance of the proposed~methods.

preprint2022arXiv

Coresets remembered and items forgotten: submodular maximization with deletions

In recent years we have witnessed an increase on the development of methods for submodular optimization, which have been motivated by the wide applicability of submodular functions in real-world data-science problems. In this paper, we contribute to this line of work by considering the problem of robust submodular maximization against unexpected deletions, which may occur due to privacy issues or user preferences. Specifically, we consider the minimum number of items an algorithm has to remember, in order to achieve a non-trivial approximation guarantee against adversarial deletion of up to $d$ items. We refer to the set of items that an algorithm has to keep before adversarial deletions as a deletion-robust coreset. Our theoretical contributions are two-fold. First, we propose a single-pass streaming algorithm that yields a $(1-2ε)/(4p)$-approximation for maximizing a non-decreasing submodular function under a general p-matroid constraint and requires a coreset of size $k + d/ε$, where $k$ is the maximum size of a feasible solution. To the best of our knowledge, this is the first work to achieve an (asymptotically) optimal coreset, as no constant-factor approximation is possible with a coreset of size sublinear in $d$. Second, we devise an effective offline algorithm that guarantees stronger approximation ratios with a coreset of size $O(d \log(k)/ε)$. We also demonstrate the superior empirical performance of the proposed algorithms in real-life applications.

preprint2022arXiv

Ranking with submodular functions on a budget

Submodular maximization has been the backbone of many important machine-learning problems, and has applications to viral marketing, diversification, sensor placement, and more. However, the study of maximizing submodular functions has mainly been restricted in the context of selecting a set of items. On the other hand, many real-world applications require a solution that is a ranking over a set of items. The problem of ranking in the context of submodular function maximization has been considered before, but to a much lesser extent than item-selection formulations. In this paper, we explore a novel formulation for ranking items with submodular valuations and budget constraints. We refer to this problem as max-submodular ranking (MSR). In more detail, given a set of items and a set of non-decreasing submodular functions, where each function is associated with a budget, we aim to find a ranking of the set of items that maximizes the sum of values achieved by all functions under the budget constraints. For the MSR problem with cardinality- and knapsack-type budget constraints we propose practical algorithms with approximation guarantees. In addition, we perform an empirical evaluation, which demonstrates the superior performance of the proposed algorithms against strong baselines.

preprint2020arXiv

Decomposable Families of Itemsets

The problem of selecting a small, yet high quality subset of patterns from a larger collection of itemsets has recently attracted lot of research. Here we discuss an approach to this problem using the notion of decomposable families of itemsets. Such itemset families define a probabilistic model for the data from which the original collection of itemsets has been derived from. Furthermore, they induce a special tree structure, called a junction tree, familiar from the theory of Markov Random Fields. The method has several advantages. The junction trees provide an intuitive representation of the mining results. From the computational point of view, the model provides leverage for problems that could be intractable using the entire collection of itemsets. We provide an efficient algorithm to build decomposable itemset families, and give an application example with frequency bound querying using the model. Empirical results show that our algorithm yields high quality results.

preprint2020arXiv

Density-friendly Graph Decomposition

Decomposing a graph into a hierarchical structure via $k$-core analysis is a standard operation in any modern graph-mining toolkit. $k$-core decomposition is a simple and efficient method that allows to analyze a graph beyond its mere degree distribution. More specifically, it is used to identify areas in the graph of increasing centrality and connectedness, and it allows to reveal the structural organization of the graph. Despite the fact that $k$-core analysis relies on vertex degrees, $k$-cores do not satisfy a certain, rather natural, density property. Simply put, the most central $k$-core is not necessarily the densest subgraph. This inconsistency between $k$-cores and graph density provides the basis of our study. We start by defining what it means for a subgraph to be locally-dense, and we show that our definition entails a nested chain decomposition of the graph, similar to the one given by $k$-cores, but in this case the components are arranged in order of increasing density. We show that such a locally-dense decomposition for a graph $G=(V,E)$ can be computed in polynomial time. The running time of the exact decomposition algorithm is $O(|V|^2|E|)$ but is significantly faster in practice. In addition, we develop a linear-time algorithm that provides a factor-2 approximation to the optimal locally-dense decomposition. Furthermore, we show that the $k$-core decomposition is also a factor-2 approximation, however, as demonstrated by our experimental evaluation, in practice $k$-cores have different structure than locally-dense subgraphs, and as predicted by the theory, $k$-cores are not always well-aligned with graph density.

preprint2020arXiv

Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure. In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

preprint2016arXiv

A note on adjusting $R^2$ for using with cross-validation

We show how to adjust the coefficient of determination ($R^2$) when used for measuring predictive accuracy via leave-one-out cross-validation.

preprint2016arXiv

Generating Realistic Synthetic Population Datasets

Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are necessary to study disease propagation and intervention measures before implementation. In social science, synthetic population datasets are needed to understand how policy decisions might affect preferences and behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic and procedural characteristics of patient records without violating confidentialities of individuals. To generate such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle to formalize a generative model such that in a statistically well-founded way we can optimally utilize given prior information about the data, and are unbiased otherwise. An efficient inference algorithm is designed to estimate the maximum entropy model, and we demonstrate how our approach is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and on US census datasets, and demonstrate its feasibility using an epidemic simulation application.

preprint2016arXiv

Interactive and Iterative Discovery of Entity Network Subgraphs

Graph mining to extract interesting components has been studied in various guises, e.g., communities, dense subgraphs, cliques. However, most existing works are based on notions of frequency and connectivity and do not capture subjective interestingness from a user's viewpoint. Furthermore, existing approaches to mine graphs are not interactive and cannot incorporate user feedbacks in any natural manner. In this paper, we address these gaps by proposing a graph maximum entropy model to discover surprising connected subgraph patterns from entity graphs. This model is embedded in an interactive visualization framework to enable human-in-the-loop, model-guided data exploration. Using case studies on real datasets, we demonstrate how interactions between users and the maximum entropy model lead to faster and explainable conclusions.

preprint2015arXiv

Interactive Discovery of Coordinated Relationship Chains with Maximum Entropy Models

Modern visual analytic tools promote human-in-the-loop analysis but are limited in their ability to direct the user toward interesting and promising directions of study. This problem is especially acute when the analysis task is exploratory in nature, e.g., the discovery of potentially coordinated relationships in massive text datasets. Such tasks are very common in domains like intelligence analysis and security forensics where the goal is to uncover surprising coalitions bridging multiple types of relations. We introduce new maximum entropy models to discover surprising chains of relationships leveraging count data about entity occurrences in documents. These models are embedded in a visual analytic system called MERCER that treats relationship bundles as first class objects and directs the user toward promising lines of inquiry. We demonstrate how user input can judiciously direct analysis toward valid conclusions whereas a purely algorithmic approach could be led astray. Experimental results on both synthetic and real datasets from the intelligence community are presented.

Nikolaj Tatti

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Approximating splits for decision trees quickly in sparse data streams

Ranking with submodular functions on the fly

Coresets remembered and items forgotten: submodular maximization with deletions

Ranking with submodular functions on a budget

Decomposable Families of Itemsets

Density-friendly Graph Decomposition

Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

A note on adjusting $R^2$ for using with cross-validation

Generating Realistic Synthetic Population Datasets

Interactive and Iterative Discovery of Entity Network Subgraphs

Interactive Discovery of Coordinated Relationship Chains with Maximum Entropy Models