Researcher profile

Lukasz Golab

Lukasz Golab contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

Recursive retraining of generative models poses a critical representation challenge: when synthetic outputs are curated based on a fixed reward signal, the model tends to collapse onto a narrow set of outputs that over-optimize that objective. Prior work suggests that such collapse is unavoidable without adding real data into the mix. We revisit this conclusion from an alignment perspective and show that collapse can be mitigated through curation based on multiple reward functions. We formalize the dynamics of recursive training under heterogeneous preferences and prove that, under certain conditions, the model converges to a stable distribution that allocates probability mass across competing high-reward regions. The limiting distribution preserves diversity and provably satisfies a weighted Nash bargaining solution, offering a formal interpretation of value aggregation in synthetic retraining loops.

preprint2026arXiv

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

This paper demonstrates RUBEN, an interactive tool for discovering minimal rules to explain the outputs of retrieval-augmented large language models (LLMs) in data-driven applications. We leverage novel pruning strategies to efficiently identify a minimal set of rules that subsume all others. We further demonstrate novel applications of these rules for LLM safety, specifically to test the resiliency of safety training and effectiveness of adversarial prompt injections.

preprint2023arXiv

Predicting Hateful Discussions on Reddit using Graph Transformer Networks and Communal Context

We propose a system to predict harmful discussions on social media platforms. Our solution uses contextual deep language models and proposes the novel idea of integrating state-of-the-art Graph Transformer Networks to analyze all conversations that follow an initial post. This framework also supports adapting to future comments as the conversation unfolds. In addition, we study whether a community-specific analysis of hate speech leads to more effective detection of hateful discussions. We evaluate our approach on 333,487 Reddit discussions from various communities. We find that community-specific modeling improves performance two-fold and that models which capture wider-discussion context improve accuracy by 28\% (35\% for the most hateful content) compared to limited context models.

preprint2022arXiv

Discovery and Contextual Data Cleaning with Ontology Functional Dependencies

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cleaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.

preprint2022arXiv

Real-Time LSM-Trees for HTAP Workloads

Real-time analytics systems employ hybrid data layouts in which data are stored in different formats throughout their lifecycle. Recent data are stored in a row-oriented format to serve OLTP workloads and support high insert rates, while older data are transformed to a column-oriented format for OLAP access patterns. We observe that a Log-Structured Merge (LSM) Tree is a natural fit for a lifecycle-aware storage engine due to its high write throughput and level-oriented structure, in which records propagate from one level to the next over time. To build a lifecycle-aware storage engine using an LSM-Tree, we make a crucial modification to allow different data layouts in different levels, ranging from purely row-oriented to purely column-oriented, leading to a Real-Time LSM-Tree. We give a cost model and an algorithm to design a Real-Time LSM-Tree that is suitable for a given workload, followed by an experimental evaluation of LASER - a prototype implementation of our idea built on top of the RocksDB key-value store.

preprint2021arXiv

Efficient Discovery of Approximate Order Dependencies

Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods for AODs, and prove that it is correct and has optimal runtime. By replacing the validation step in a leading algorithm for approximate OD discovery with ours, we achieve orders-of-magnitude improvements in performance.

preprint2020arXiv

Consentio: Managing Consent to Data Access using Permissioned Blockchains

The increasing amount of personal data is raising serious issues in the context of privacy, security, and data ownership. Entities whose data are being collected can benefit from mechanisms to manage the parties that can access their data and to audit who has accessed their data. Consent management systems address these issues. We present Consentio, a scalable consent management system based on the Hyperledger Fabric permissioned blockchain. The data management challenge we address is to ensure high throughput and low latency of endorsing data access requests and granting or revoking consent. Experimental results show that our system can handle as many as 6,000 access requests per second, allowing it to scale to very large deployments.

preprint2020arXiv

Iterative Edit-Based Unsupervised Sentence Simplification

We present a novel iterative, edit-based approach to unsupervised sentence simplification. Our model is guided by a scoring function involving fluency, simplicity, and meaning preservation. Then, we iteratively perform word and phrase-level edits on the complex sentence. Compared with previous approaches, our model does not require a parallel training set, but is more controllable and interpretable. Experiments on Newsela and WikiLarge datasets show that our approach is nearly as effective as state-of-the-art supervised approaches.

preprint2020arXiv

XOX Fabric: A hybrid approach to blockchain transaction execution

Performance and scalability are major concerns for blockchains: permissionless systems are typically limited by slow proof of X consensus algorithms and sequential post-order transaction execution on every node of the network. By introducing a small amount of trust in their participants, permissioned blockchain systems such as Hyperledger Fabric can benefit from more efficient consensus algorithms and make use of parallel pre-order execution on a subset of network nodes. Fabric, in particular, has been shown to handle tens of thousands of transactions per second. However, this performance is only achievable for contention-free transaction workloads. If many transactions compete for a small set of hot keys in the world state, the effective throughput drops drastically. We therefore propose XOX: a novel two-pronged transaction execution approach that both minimizes invalid transactions in the Fabric blockchain and maximizes concurrent execution. Our approach additionally prevents unintentional denial of service attacks by clients re-submitting conflicting transactions. Even under fully contentious workloads, XOX can handle more than 3000 transactions per second, all of which would be discarded by regular Fabric.