Source author record

Lukasz Golab

Lukasz Golab appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Computation and Language Distributed, Parallel, and Cluster Computing Machine Learning Social and Information Networks Artificial Intelligence

Catalog footprint

What is connected

14works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed reward signal, the model tends to collapse onto a narrow set of outputs that over-optimize that objective. Prior work suggests that such collapse is unavoidable without adding real data into the mix. We revisit this conclusion from an alignment perspective and show that collapse can be mitigated through curation based on multiple reward functions. We formalize the dynamics of recursive training under heterogeneous preferences and prove that, under certain conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions. The limiting distribution preserves diversity and provably satisfies a weighted Nash bargaining solution, offering a formal interpretation of value aggregation in synthetic retraining loops.

preprint2026arXiv

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.

preprint2023arXiv

Predicting Hateful Discussions on Reddit using Graph Transformer Networks and Communal Context

We propose a system to predict harmful discussions on social media platforms. Our solution uses contextual deep language models and proposes the novel idea of integrating state-of-the-art Graph Transformer Networks to analyze all conversations that follow an initial post. This framework also supports adapting to future comments as the conversation unfolds. In addition, we study whether a community-specific analysis of hate speech leads to more effective detection of hateful discussions. We evaluate our approach on 333,487 Reddit discussions from various communities. We find that community-specific modeling improves performance two-fold and that models which capture wider-discussion context improve accuracy by 28\% (35\% for the most hateful content) compared to limited context models.

preprint2022arXiv

Discovery and Contextual Data Cleaning with Ontology Functional Dependencies

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.

preprint2022arXiv

Real-Time LSM-Trees for HTAP Workloads

Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store.

preprint2021arXiv

Efficient Discovery of Approximate Order Dependencies

Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods for AODs, and prove that it is correct and has optimal runtime. By replacing the validation step in a leading algorithm for approximate OD discovery with ours, we achieve orders-of-magnitude improvements in performance.

preprint2020arXiv

Consentio: Managing Consent to Data Access using Permissioned Blockchains

The increasing amount of personal data is raising serious issues in the context of privacy, security, and data ownership. Entities whose data are being collected can benefit from mechanisms to manage the parties that can access their data and to audit who has accessed their data. Consent management systems address these issues. We present Consentio, a scalable consent management system based on the Hyperledger Fabric permissioned blockchain. The data management challenge we address is to ensure high throughput and low latency of endorsing data access requests and granting or revoking consent. Experimental results show that our system can handle as many as 6,000 access requests per second, allowing it to scale to very large deployments.

preprint2020arXiv

Iterative Edit-Based Unsupervised Sentence Simplification

We present a novel iterative, edit-based approach to unsupervised sentence simplification. Our model is guided by a scoring function involving fluency, simplicity, and meaning preservation. Then, we iteratively perform word and phrase-level edits on the complex sentence. Compared with previous approaches, our model does not require a parallel training set, but is more controllable and interpretable. Experiments on Newsela and WikiLarge datasets show that our approach is nearly as effective as state-of-the-art supervised approaches.

preprint2020arXiv

XOX Fabric: A hybrid approach to blockchain transaction execution

Performance and scalability are major concerns for blockchains: permissionless systems are typically limited by slow proof of X consensus algorithms and sequential post-order transaction execution on every node of the network. By introducing a small amount of trust in their participants, permissioned blockchain systems such as Hyperledger Fabric can benefit from more efficient consensus algorithms and make use of parallel pre-order execution on a subset of network nodes. Fabric, in particular, has been shown to handle tens of thousands of transactions per second. However, this performance is only achievable for contention-free transaction workloads. If many transactions compete for a small set of hot keys in the world state, the effective throughput drops drastically. We therefore propose XOX: a novel two-pronged transaction execution approach that both minimizes invalid transactions in the Fabric blockchain and maximizes concurrent execution. Our approach additionally prevents unintentional denial of service attacks by clients re-submitting conflicting transactions. Even under fully contentious workloads, XOX can handle more than 3000 transactions per second, all of which would be discarded by regular Fabric.

preprint2016arXiv

Authority-based Team Discovery in Social Networks

Given a social network of experts, we address the problem of discovering a team of experts that collectively holds a set of skills required to complete a given project. Most prior work ranks possible solutions by communication cost, represented by edge weights in the expert network. Our contribution is to take experts authority into account, represented by node weights. We formulate several problems that combine communication cost and authority, we prove that they are NP-hard, and we propose and experimentally evaluate greedy algorithms to solve them.

preprint2016arXiv

Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization

Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs, as they can additionally express business rules involving order; e.g., an employee never has a higher salary while paying lower taxes compared with another employee. We address the limitations of prior work on OD discovery which has factorial complexity in the number of attributes, is incomplete (i.e., it does not discover valid ODs that cannot be inferred from the ones found) and is not concise (i.e., it can result in "redundant" discovery and overly large discovery sets). We improve significantly on complexity, offer completeness, and define a compact canonical form. This is based on a novel polynomial mapping to a canonical form for ODs, and a sound and complete set of axioms (inference rules) for canonical ODs. This allows us to develop an efficient set-containment, lattice-driven OD discovery algorithm that uses the inference rules to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete, minimal set of ODs (i.e., minimal with regards to the canonical representation). Finally, using real and synthetic datasets, we experimentally show orders-of-magnitude performance improvements over the current state-of-the-art algorithm and demonstrate effectiveness of our techniques.

preprint2016arXiv

Effective Keyword Search in Graphs

In a node-labeled graph, keyword search finds subtrees of the graph whose nodes contain all of the query keywords. This provides a way to query graph databases that neither requires mastery of a query language such as SPARQL, nor a deep knowledge of the database schema. Previous work ranks answer trees using combinations of structural and content-based metrics, such as path lengths between keywords or relevance of the labels in the answer tree to the query keywords. We propose two new ways to rank keyword search results over graphs. The first takes node importance into account while the second is a bi-objective optimization of edge weights and node importance. Since both of these problems are NP-hard, we propose greedy algorithms to solve them, and experimentally verify their effectiveness and efficiency on a real dataset.

preprint2013arXiv

Distributed Data Placement via Graph Partitioning

With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReduce-style workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput. We study the problem of computing data placement strategies that minimize the data communication costs incurred by typical relational query workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of {\sc Graph Partitioning}, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives (not communication cost). We study several practical extensions of the problem: with load balancing, with replication, with materialized views, and with complex query plans consisting of sequences of intermediate operations that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the {\sc Graph Partitioning} solution of the no-replication case. Using the TPC-DS workload, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.

preprint2012arXiv

On the Relative Trust between Inconsistent Data and Inaccurate Constraints

Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are outdated, we should modify them to fit the data, but if we suspect that there are problems with the data, we should modify the data to fit the FDs. In practice, it is usually unclear how much to trust the data versus the FDs. To address this problem, we propose an algorithm for generating non-redundant solutions (i.e., simultaneous modifications of the data and the FDs) corresponding to various levels of relative trust. This can help users determine the best way to modify their data and/or FDs to achieve consistency.

Lukasz Golab

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

Predicting Hateful Discussions on Reddit using Graph Transformer Networks and Communal Context

Discovery and Contextual Data Cleaning with Ontology Functional Dependencies

Real-Time LSM-Trees for HTAP Workloads

Efficient Discovery of Approximate Order Dependencies

Consentio: Managing Consent to Data Access using Permissioned Blockchains

Iterative Edit-Based Unsupervised Sentence Simplification

XOX Fabric: A hybrid approach to blockchain transaction execution

Authority-based Team Discovery in Social Networks

Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization

Effective Keyword Search in Graphs

Distributed Data Placement via Graph Partitioning

On the Relative Trust between Inconsistent Data and Inaccurate Constraints