Source author record

Kirill Smirnov

Kirill Smirnov appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Artificial Intelligence Data Structures and Algorithms Machine Learning Performance

Catalog footprint

What is connected

3works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Desbordante: from benchmarking suite to high-performance science-intensive data profiler (preprint)

Pioneering data profiling systems such as Metanome and OpenClean brought public attention to science-intensive data profiling. This type of profiling aims to extract complex patterns (primitives) such as functional dependencies, data constraints, association rules, and others. However, these tools are research prototypes rather than production-ready systems. The following work presents Desbordante - a high-performance science-intensive data profiler with open source code. Unlike similar systems, it is built with emphasis on industrial application in a multi-user environment. It is efficient, resilient to crashes, and scalable. Its efficiency is ensured by implementing discovery algorithms in C++, resilience is achieved by extensive use of containerization, and scalability is based on replication of containers. Desbordante aims to open industrial-grade primitive discovery to a broader public, focusing on domain experts who are not IT professionals. Aside from the discovery of various primitives, Desbordante offers primitive validation, which not only reports whether a given instance of primitive holds or not, but also points out what prevents it from holding via the use of special screens. Next, Desbordante supports pipelines - ready-to-use functionality implemented using the discovered primitives, for example, typo detection. We provide built-in pipelines, and the users can construct their own via provided Python bindings. Unlike other profilers, Desbordante works not only with tabular data, but with graph and transactional data as well. In this paper, we present Desbordante, the vision behind it and its use-cases. To provide a more in-depth perspective, we discuss its current state, architecture, and design decisions it is built on. Additionally, we outline our future plans.

preprint2022arXiv

Implementing the Comparison-Based External Sort

In the age of big data, sorting is an indispensable operation for DBMSes and similar systems. Having data sorted can help produce query plans with significantly lower run times. It also can provide other benefits like having non-blocking operators which will produce data steadily (without bursts), or operators with reduced memory footprint. Sorting may be required on any step of query processing, i.e., be it source data or intermediate results. At the same time, the data to be sorted may not fit into main memory. In this case, an external sort operator, which writes intermediate results to disk, should be used. In this paper we consider an external sort operator of the comparison-based sort type. We discuss its implementation and describe related design decisions. Our aim is to study the impact on performance of a data structure used on the merge step. For this, we have experimentally evaluated three data structures implemented inside a DBMS. Results have shown that it is worthwhile to make an effort to implement an efficient data structure for run merging, even on modern commodity computers which are usually disk-bound. Moreover, we demonstrated that using a loser tree is a more efficient approach than both the naive approach and the heap-based one.

preprint2020arXiv

Extending Databases to Support Data Manipulation with Functional Dependencies: a Vision Paper

In the current paper, we propose to fuse together stored data (tables) and their functional dependencies (FDs) inside a DBMS. We aim to make FDs first-class citizens: objects which can be queried and used to query data. Our idea is to allow analysts to explore both data and functional dependencies using the database interface. For example, an analyst may be interested in such tasks as: "find all rows which prevent a given functional dependency from holding", "for a given table, find all functional dependencies that involve a given attribute", "project all attributes that functionally determine a specified attribute". For this purpose, we propose: (1) an SQL-based query language for querying a collection of functional dependencies (2) an extension of the SQL SELECT clause for supporting FD-based predicates, including approximate ones (3) a special data structure intended for containing mined FDs and acting as a mediator between user queries and underlying data. We describe the proposed extensions, demonstrate their use-cases, and finally, discuss implementation details and their impact on query processing.