Source author record

Davood Rafiei

Davood Rafiei appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Computation and Language

Catalog footprint

What is connected

6works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents

Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, limiting recovery from early mistakes. We present FlexSQL, a text-to-SQL agent whose core design principle is flexible database interaction: the agent can explore schema structure, inspect data values, and run verification queries at any point during reasoning. FlexSQL generates diverse execution plans to cover multiple query interpretations, implements each plan in either SQL or Python depending on the task, and uses a two-tiered repair mechanism that can backtrack from code-level errors to plan-level revisions. On Spider2-Snow, using gpt-oss-120b, FlexSQL achieves a 65.4\% score, outperforming strong open-source baselines that use stronger, larger models such as gpt-o3 and DeepSeek-R1. When integrated into a general-purpose coding agent (as skills in Claude Code), our approach yields over 10\% relative improvement on Spider2-Snow. Further analysis shows that flexible exploration and flexible execution jointly contribute to the effectiveness of our approach, highlighting flexibility as a key design principle. Our code is available at: https://github.com/StringNLPLAB/FlexSQL

preprint2022arXiv

Efficiently Transforming Tables for Joinability

Data from different sources rarely conform to a single formatting even if they describe the same set of entities, and this raises concerns when data from multiple sources must be joined or cross-referenced. Such a formatting mismatch is unavoidable when data is gathered from various public and third-party sources. Commercial database systems are not able to perform the join when there exist differences in data representation or formatting, and manual reformatting is both time consuming and error-prone. We study the problem of efficiently joining textual data under the condition that the join columns are not formatted the same and cannot be equi-joined, but they become joinable under some transformations. The problem is challenging simply because the number of possible transformations explodes with both the length of the input and the number of rows, even if each transformation is formed using very few basic units. We show that an efficient algorithm can be developed based on the common characteristics of the joined columns, and develop one such algorithm over a rich set of basic operations that can be composed to form transformations. We compare both the coverage and the running time of our algorithm to a state-of-the-art approach, and show that our algorithm covers every transformation that is covered in the state-of-the-art approach but is a few orders of magnitude faster, as evaluated on various real and synthetic data.

preprint2022arXiv

Tracking Where Events Take Place: Reverse Spatial Term Queries on Streaming Data

A large volume of content generated by online users is geo-tagged and this provides a rich source for querying in various location-based services. An important class of queries within such services involves the association between content and locations. In this paper, we study two types of queries on streaming geo-tagged data: 1) "Top-k reverse frequent spatial queries", where given a term, the goal is to find top K locations where the term is frequent, and 2) "Term frequency spatial queries", which is finding the expected frequency of a term in a given location. To efficiently support these queries in a streaming setting, we model terms as events and explore a probabilistic model of geographical distribution that allows us to estimate the frequency of terms in locations that are not kept in a stream sketch or summary. We study the back-and-forth relationship between the efficiency of queries, the efficiency of updates and the accuracy of the results and identify some sweet spots where both efficient and effective algorithms can be developed. We demonstrate that our method can be extended to support multi-term queries. To evaluate the efficiency of our algorithms, we conduct experiments on a relatively large collection of both geo-tagged tweets and geo-tagged Flickr photos. The evaluation reveals that our proposed method achieves a high accuracy when only a limited amount of memory is given. Also the query time is improved, compared to a recent baseline, by 2-3 orders of magnitude without much loss in accuracy and that the update time can further be improved by at least an order of magnitude under some term distributions or update strategies.

preprint2020arXiv

Efficient Error-tolerant Search on Knowledge Graphs

Edge-labeled graphs are widely used to describe relationships between entities in a database. Given a query subgraph that represents an example of what the user is searching for, we study the problem of efficiently searching for similar subgraphs in a large data graph, where the similarity is defined in terms of the well-known graph edit distance. We call these queries "error-tolerant exemplar queries" since matches are allowed despite small variations in the graph structure and the labels. The problem in its general case is computationally intractable, but efficient solutions are reachable for labeled graphs under well-behaved distribution of the labels, commonly found in knowledge graphs. We propose two efficient exact algorithms, based on a filtering-and-verification framework, for finding subgraphs in a large data graph that are isomorphic to a query graph under some edit operations. Our filtering scheme, which uses the neighbourhood structure around a node and the presence or absence of paths, significantly reduces the number of candidates that are passed to the verification stage. Moreover, we analyze the costs of our algorithms and the conditions under which one algorithm is expected to outperform the other. Our analysis identifies some of the variables that affect the cost, including the number and the selectivity of query edge labels and the degree of nodes in the data graph, and characterizes their relationships. We empirically evaluate the effectiveness of our filtering schemes and queries, the efficiency of our algorithms and the reliability of our cost models on real datasets.

preprint2019arXiv

Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment

We study the problem of similarity self-join and similarity join size estimation in a streaming setting where the goal is to estimate, in one scan of the input and with sublinear space in the input size, the number of record pairs that have a similarity within a given threshold. The problem has many applications in data cleaning and query plan generation, where the cost of a similarity join may be estimated before actually doing the join. On unary input where two records either match or don't match, the problem becomes join and self-join size estimation for which one-pass algorithms are readily available. Our work addresses the problem for d-ary input, for d >= 1, where the degree of similarity can vary from 1 to d. We show that our proposed algorithm gives an accurate estimate and scales well with the input size. We provide error bounds and time and space costs, and conduct an extensive experimental evaluation of our algorithm, comparing its estimation accuracy to a few competitors, including some multi-pass algorithms. Our results show that given the same space, the proposed algorithm has an order of magnitude less error for a large range of similarity thresholds.

preprint2012arXiv

Efficient Indexing and Querying over Syntactically Annotated Trees

Natural language text corpora are often available as sets of syntactically parsed trees. A wide range of expressive tree queries are possible over such parsed trees that open a new avenue in searching over natural language text. They not only allow for querying roles and relationships within sentences, but also improve search effectiveness compared to flat keyword queries. One major drawback of current systems supporting querying over parsed text is the performance of evaluating queries over large data. In this paper we propose a novel indexing scheme over unique subtrees as index keys. We also propose a novel root-split coding scheme that stores subtree structural information only partially, thus reducing index size and improving querying performance. Our extensive set of experiments show that root-split coding reduces the index size of any interval coding which stores individual node numbers by a factor of 50% to 80%, depending on the sizes of subtrees indexed. Moreover, We show that our index using root-split coding, outperforms previous approaches by at least an order of magnitude in terms of the response time of queries.