Researcher profile

Michael J. Carey

Michael J. Carey contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold $τ$, retrieve all JSON documents that are within $τ$ from the query document. Due to its recursive definition, JSON data are naturally represented as trees. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document. This feature poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON documents. We develop an algorithm, called QuickJEDI, for computing JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and a highly effective upper bound based on tree sorting. Our algorithm for the upper bound runs in $O(n τ)$ time and $O(n + τ\log n)$ space, which substantially improves the previous best bound of $O(n^2)$ time and $O(n \log n)$ space (where $n$ is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

preprint2021arXiv

Bridging BAD Islands: Declarative Data Sharing at Scale

In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our prior work, we developed a Big Active Data (BAD) system for enabling Big Data subscriptions and analytics with millions of subscribers. Based on that, we introduce a new mechanism for enabling the sharing of Big Data at scale declaratively so that developers can easily create and provide data sharing services using declarative statements and can benefit from an underlying scalable infrastructure. We show our implementation on top of the BAD system, explain the data sharing data flow among multiple systems, and present a prototype system with experimental results.

preprint2021arXiv

PolyFrame: A Retargetable Query-based Approach to Scaling DataFrames (Extended Version)

In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision making and applications. Scaling data analysis, possibly including the application of custom machine learning models, to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a Python data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by incorporating the data scientists' development environment and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a very large volume of data, the initial version only generated SQL++ queries and only operated against Apache AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems as well, making it more flexible for deployment against other data management systems with composable query languages.

preprint2020arXiv

An IDEA: An Ingestion Framework for Data Enrichment in AsterixDB

Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other reference information to support complex analytical queries. Depending on the use case, the enrichment operations can be compiled code, declarative queries, or machine learning models with different complexities. For enrichments that will be frequently used in the future, it can be advantageous to push their computation into the ingestion pipeline so that they can be stored (and queried) together with the data. In some cases, the referenced information may change over time, so the ingestion pipeline should be able to adapt to such changes to guarantee the currency and/or correctness of the enrichment results. In this paper, we present a new data ingestion framework that supports data ingestion at scale, enrichments requiring complex operations, and adaptiveness to reference data changes. We explain how this framework has been built on top of Apache AsterixDB and investigate its performance at scale under various workloads.

preprint2020arXiv

An LSM-based Tuple Compaction Framework for Apache AsterixDB (Extended Version)

Document database systems store self-describing semi-structured records, such as JSON, "as-is" without requiring the users to pre-define a schema. This provides users with the flexibility to change the structure of incoming records without worrying about taking the system offline or hindering the performance of currently running queries. However, the flexibility of such systems does not free. The large amount of redundancy in the records can introduce an unnecessary storage overhead and impact query performance. Our focus in this paper is to address the storage overhead issue by introducing a tuple compactor framework that infers and extracts the schema from self-describing semi-structured records during the data ingestion. As many prominent document stores, such as MongoDB and Couchbase, adopt Log Structured Merge (LSM) trees in their storage engines, our framework exploits LSM lifecycle events to piggyback the schema inference and extraction operations. We have implemented and empirically evaluated our approach to measure its impact on storage, data ingestion, and query performance in the context of Apache AsterixDB.

preprint2020arXiv

BAD to the Bone: Big Active Data at its Core

Virtually all of today's Big Data systems are passive in nature, responding to queries posted by their users. Instead, we are working to shift Big Data platforms from passive to active. In our view, a Big Active Data (BAD) system should continuously and reliably capture Big Data while enabling timely and automatic delivery of relevant information to a large pool of interested users, as well as supporting retrospective analyses of historical information. While various scalable streaming query engines have been created, their active behavior is limited to a (relatively) small window of the incoming data. To this end we have created a BAD platform that combines ideas and capabilities from both Big Data and Active Data (e.g., Publish/Subscribe, Streaming Engines). It supports complex subscriptions that consider not only newly arrived items but also their relationships to past, stored data. Further, it can provide actionable notifications by enriching the subscription results with other useful data. Our platform extends an existing open-source Big Data Management System, Apache AsterixDB, with an active toolkit. The toolkit contains features to rapidly ingest semistructured data, share execution pipelines among users, manage scaled user data subscriptions, and actively monitor the state of the data to produce individualized information for each user. This paper describes the features and design of our current BAD data platform and demonstrates its ability to scale without sacrificing query capabilities or result individualization.

preprint2020arXiv

Breaking Down Memory Walls: Adaptive Memory Management in LSM-based Storage Systems (Extended Version)

Log-Structured Merge-trees (LSM-trees) have been widely used in modern NoSQL systems. Due to their out-of-place update design, LSM-trees have introduced memory walls among the memory components of multiple LSM-trees and between the write memory and the buffer cache. Optimal memory allocation among these regions is non-trivial because it is highly workload-dependent. Existing LSM-tree implementations instead adopt static memory allocation schemes due to their simplicity and robustness, sacrificing performance. In this paper, we attempt to break down these memory walls in LSM-based storage systems. We first present a memory management architecture that enables adaptive memory management. We then present a partitioned memory component structure with new flush policies to better exploit the write memory to minimize the write cost. To break down the memory wall between the write memory and the buffer cache, we further introduce a memory tuner that tunes the memory allocation between these two regions. We have conducted extensive experiments in the context of Apache AsterixDB using the YCSB and TPC-C benchmarks and we present the results here.

preprint2020arXiv

On Performance Stability in LSM-based Storage Systems (Extended Version)

The Log-Structured Merge-Tree (LSM-tree) has been widely adopted for use in modern NoSQL systems for its superior write performance. Despite the popularity of LSM-trees, they have been criticized for suffering from write stalls and large performance variances due to the inherent mismatch between their fast in-memory writes and slow background I/O operations. In this paper, we use a simple yet effective two-phase experimental approach to evaluate write stalls for various LSM-tree designs. We further explore the design choices of LSM merge schedulers to minimize write stalls given an I/O bandwidth budget. We have conducted extensive experiments in the context of the Apache AsterixDB system and we present the results here.

preprint2020arXiv

Subscribing to Big Data at Scale

Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus on passively answering queries from users, rather than actively collecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, users need either to heavily customize an existing passive Big Data system or to glue multiple systems together. Either choice would require significant effort from users and incur additional overhead. In this paper, we present the BAD (Big Active Data) system, which is designed to preserve the merits of passive Big Data systems and introduce new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system's performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a "glued" system.