Source author record

Rahul Shah

Rahul Shah appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Data Structures and Algorithms Databases math.CO math.CT math.QA

Catalog footprint

What is connected

6works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

Parameterized Pattern Matching -- Succinctly

We consider the $Parameterized$ $Pattern$ $Matching$ problem, where a pattern $P$ matches some location in a text $\mathsf{T}$ iff there is a one-to-one correspondence between the alphabet symbols of the pattern to those of the text. More specifically, assume that the text $\mathsf{T}$ contains $n$ characters from a static alphabet $Σ_s$ and a parameterized alphabet $Σ_p$, where $Σ_s \cap Σ_p = \varnothing$ and $|Σ_s \cup Σ_p|=σ$. A pattern $P$ matches a substring $S$ of $\mathsf{T}$ iff the static characters match exactly, and there exists a one-to-one function that renames the parameterized characters in $S$ to that in $P$. Previous indexing solution [Baker, STOC 1993], known as $Parameterized$ $Suffix$ $Tree$, requires $Θ(n\log n)$ bits of space, and can find all $occ$ occurrences of $P$ in $\mathcal{O}(|P|\log σ+ occ)$ time. In this paper, we present the first succinct index that occupies $n \log σ+ \mathcal{O}(n)$ bits and answers queries in $\mathcal{O}((|P|+ occ\cdot \log n) \logσ\log \log σ)$ time. We also present a compact index that occupies $\mathcal{O}(n\logσ)$ bits and answers queries in $\mathcal{O}(|P|\log σ+ occ\cdot \log n)$ time. Furthermore, the techniques are extended to obtain the first succinct representation of the index of Shibuya for $Structural$ $Matching$ [SWAT, 2000], and of Idury and Schäffer for $Parameterized$ $Dictionary$ $Matching$ [CPM, 1994].

preprint2015arXiv

Probabilistic Threshold Indexing for Uncertain Strings

Strings form a fundamental data type in computer systems. String searching has been extensively studied since the inception of computer science. Increasingly many applications have to deal with imprecise strings or strings with fuzzy information in them. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable characters with associated probability of occurrence for each character. Such uncertain strings are prevalent in various applications such as biological sequence data, event monitoring and automatic ECG annotations. We explore the problem of indexing uncertain strings to support efficient string searching. In this paper we consider two basic problems of string searching, namely substring searching and string listing. In substring searching, the task is to find the occurrences of a deterministic string in an uncertain string. We formulate the string listing problem for uncertain strings, where the objective is to output all the strings from a collection of strings, that contain probable occurrence of a deterministic query string. Indexing solution for both these problems are significantly more challenging for uncertain strings than for deterministic strings. Given a construction time probability value $τ$, our indexes can be constructed in linear space and supports queries in near optimal time for arbitrary values of probability threshold parameter greater than $τ$. To the best of our knowledge, this is the first indexing solution for searching in uncertain strings that achieves strong theoretical bound and supports arbitrary values of probability threshold parameter. We also propose an approximate substring search index that can answer substring search queries with an additive error in optimal time. We conduct experiments to evaluate the performance of our indexes.

preprint2012arXiv

On Optimal Top-K String Retrieval

Let ${\cal{D}}$ = $\{d_1, d_2, d_3, ..., d_D\}$ be a given set of $D$ (string) documents of total length $n$. The top-$k$ document retrieval problem is to index $\cal{D}$ such that when a pattern $P$ of length $p$, and a parameter $k$ come as a query, the index returns the $k$ most relevant documents to the pattern $P$. Hon et. al. \cite{HSV09} gave the first linear space framework to solve this problem in $O(p + k\log k)$ time. This was improved by Navarro and Nekrich \cite{NN12} to $O(p + k)$. These results are powerful enough to support arbitrary relevance functions like frequency, proximity, PageRank, etc. In many applications like desktop or email search, the data resides on disk and hence disk-bound indexes are needed. Despite of continued progress on this problem in terms of theoretical, practical and compression aspects, any non-trivial bounds in external memory model have so far been elusive. Internal memory (or RAM) solution to this problem decomposes the problem into $O(p)$ subproblems and thus incurs the additive factor of $O(p)$. In external memory, these approaches will lead to $O(p)$ I/Os instead of optimal $O(p/B)$ I/O term where $B$ is the block-size. We re-interpret the problem independent of $p$, as interval stabbing with priority over tree-shaped structure. This leads us to a linear space index in external memory supporting top-$k$ queries (with unsorted outputs) in near optimal $O(p/B + \log_B n + \log^{(h)} n + k/B)$ I/Os for any constant $h${$\log^{(1)}n =\log n$ and $\log^{(h)} n = \log (\log^{(h-1)} n)$}. Then we get $O(n\log^*n)$ space index with optimal $O(p/B+\log_B n + k/B)$ I/Os.

preprint2012arXiv

Towards an Optimal Space-and-Query-Time Index for Top-k Document Retrieval

Let $\D = $$ \{d_1,d_2,...d_D\}$ be a given set of $D$ string documents of total length $n$, our task is to index $\D$, such that the $k$ most relevant documents for an online query pattern $P$ of length $p$ can be retrieved efficiently. We propose an index of size $|CSA|+n\log D(2+o(1))$ bits and $O(t_{s}(p)+k\log\log n+poly\log\log n)$ query time for the basic relevance metric \emph{term-frequency}, where $|CSA|$ is the size (in bits) of a compressed full text index of $\D$, with $O(t_s(p))$ time for searching a pattern of length $p$ . We further reduce the space to $|CSA|+n\log D(1+o(1))$ bits, however the query time will be $O(t_s(p)+k(\log σ\log\log n)^{1+ε}+poly\log\log n)$, where $σ$ is the alphabet size and $ε>0$ is any constant.

preprint2010arXiv

Fully Dynamic Data Structure for Top-k Queries on Uncertain Data

Top-$k$ queries allow end-users to focus on the most important (top-$k$) answers amongst those which satisfy the query. In traditional databases, a user defined score function assigns a score value to each tuple and a top-$k$ query returns $k$ tuples with the highest score. In uncertain database, top-$k$ answer depends not only on the scores but also on the membership probabilities of tuples. Several top-$k$ definitions covering different aspects of score-probability interplay have been proposed in recent past~\cite{R10,R4,R2,R8}. Most of the existing work in this research field is focused on developing efficient algorithms for answering top-$k$ queries on static uncertain data. Any change (insertion, deletion of a tuple or change in membership probability, score of a tuple) in underlying data forces re-computation of query answers. Such re-computations are not practical considering the dynamic nature of data in many applications. In this paper, we propose a fully dynamic data structure that uses ranking function $PRF^e(α)$ proposed by Li et al.~\cite{R8} under the generally adopted model of $x$-relations~\cite{R11}. $PRF^e$ can effectively approximate various other top-$k$ definitions on uncertain data based on the value of parameter $α$. An $x$-relation consists of a number of $x$-tuples, where $x$-tuple is a set of mutually exclusive tuples (up to a constant number) called alternatives. Each $x$-tuple in a relation randomly instantiates into one tuple from its alternatives. For an uncertain relation with $N$ tuples, our structure can answer top-$k$ queries in $O(k\log N)$ time, handles an update in $O(\log N)$ time and takes $O(N)$ space. Finally, we evaluate practical efficiency of our structure on both synthetic and real data.

preprint2009arXiv

Visibility graphs and deformations of associahedra

The associahedron is a convex polytope whose face poset is based on nonintersecting diagonals of a convex polygon. In this paper, given an arbitrary simple polygon P, we construct a polytopal complex analogous to the associahedron based on convex diagonalizations of P. We describe topological properties of this complex and provide realizations based on secondary polytopes. Moreover, using the visibility graph of P, a deformation space of polygons is created which encapsulates substructures of the associahedron.