Source author record

Raul Castro Fernandez

Raul Castro Fernandez appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Databases Machine Learning

Catalog footprint

What is connected

5works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Enabling Personal Dataflow Sovereignty via Bolt-on Data Escrow

The digital economy is powered by a continuous and massive exchange of personal data. Individuals provide data to platforms in return for services, from social networking and search to health monitoring, entertainment, and access to LLMs. This exchange has created immense value, but it has also established a fundamental asymmetry of power: individuals possess only coarse-grained control over data access rather than fine-grained control over its purpose of use, creating a gap where data can be repurposed for undisclosed uses, e.g., platforms selling the data to data brokers, which results in a critical loss of personal data sovereignty. This paper reframes this socio-technical challenge as a dataflow management problem. We propose a bolt-on data escrow architecture through delegated computation. In our model, instead of data flowing to platforms, platforms delegate their computation to a trustworthy escrow. This inversion empowers individuals with transparency and control over their dataflows. We present four contributions: (1) a dataflow model that explicitly incorporates computational purpose as a first-class primitive; (2) a minimally invasive programming interface, run(access(), compute()), built on a unified relational interface that virtualizes on-device data sources and a computation offloading component; (3) a concrete implementation of our escrow within the Apple ecosystem, demonstrating its practicality; and (4) both qualitative and quantitative evaluations demonstrating that our solution is expressive enough to implement a wide range of dataflows from real-world applications and introduces minimal runtime overhead. In summary, our work serves as a stepping stone toward achieving personal dataflow sovereignty.

preprint2026arXiv

The Pneuma Project: Reifying Information Needs as Relational Schemas to Automate Discovery, Guide Preparation, and Align Data with Intent

Data discovery and preparation remain persistent bottlenecks in the data management lifecycle, especially when user intent is vague, evolving, or difficult to operationalize. The Pneuma Project introduces Pneuma-Seeker, a system that helps users articulate and fulfill information needs through iterative interaction with a language model-powered platform. The system reifies the user's evolving information need as a relational data model and incrementally converges toward a usable document aligned with that intent. To achieve this, the system combines three architectural ideas: context specialization to reduce LLM burden across subtasks, a conductor-style planner to assemble dynamic execution plans, and a convergence mechanism based on shared state. The system integrates recent advances in retrieval-augmented generation (RAG), agentic frameworks, and structured data preparation to support semi-automatic, language-guided workflows. We evaluate the system through LLM-based user simulations and show that it helps surface latent intent, guide discovery, and produce fit-for-purpose documents. It also acts as an emergent documentation layer, capturing institutional knowledge and supporting organizational memory.

preprint2020arXiv

ARDA: Automatic Relational Data Augmentation for Machine Learning

Automatic machine learning (\AML) is a family of techniques to automate the process of training predictive models, aiming to both improve performance and make machine learning more accessible. While many recent works have focused on aspects of the machine learning pipeline like model selection, hyperparameter tuning, and feature selection, relatively few works have focused on automatic data augmentation. Automatic data augmentation involves finding new features relevant to the user's predictive task with minimal ``human-in-the-loop'' involvement. We present \system, an end-to-end system that takes as input a dataset and a data repository, and outputs an augmented data set such that training a predictive model on this augmented dataset results in improved performance. Our system has two distinct components: (1) a framework to search and join data with the input data, based on various attributes of the input, and (2) an efficient feature selection algorithm that prunes out noisy or irrelevant features from the resulting join. We perform an extensive empirical evaluation of different system components and benchmark our feature selection algorithm on real-world datasets.

preprint2020arXiv

Data Market Platforms: Trading Data Assets to Solve Data Problems

Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by others. In this paper, we propose data market platforms to address the lack of information and incentives and tackle the problems of data sharing, discovery, and integration. In a data market platform, data owners want to share data because they will be rewarded if they do so. Consumers are encouraged to share their data needs because the market will solve the discovery and integration problem for them in exchange for some form of currency. We consider internal markets that operate within organizations to bring down data silos, as well as external markets that operate across organizations to increase the value of data for everybody. We outline a research agenda that revolves around two problems. The problem of market design, or how to design rules that lead to the outcomes we want, and the systems problem, how to implement the market and enforce the rules. Treating data as a first-class asset is sorely needed to extend the value of data to more organizations, and we propose data market platforms as one mechanism to achieve this goal.

preprint2020arXiv

The Data Station: Combining Data, Compute, and Market Forces

This paper introduces Data Stations, a new data architecture that we are designing to tackle some of the most challenging data problems that we face today: access to sensitive data; data discovery and integration; and governance and compliance. Data Stations depart from modern data lakes in that both data and derived data products, such as machine learning models, are sealed and cannot be directly seen, accessed, or downloaded by anyone. Data Stations do not deliver data to users; instead, users bring questions to data. This inversion of the usual relationship between data and compute mitigates many of the security risks that are otherwise associated with sharing and working with sensitive data. Data Stations are designed following the principle that many data problems require human involvement, and that incentives are the key to obtaining such involvement. To that end, Data Stations implement market designs to create, manage, and coordinate the use of incentives. We explain the motivation for this new kind of platform and its design.

Raul Castro Fernandez

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Enabling Personal Dataflow Sovereignty via Bolt-on Data Escrow

The Pneuma Project: Reifying Information Needs as Relational Schemas to Automate Discovery, Guide Preparation, and Align Data with Intent

ARDA: Automatic Relational Data Augmentation for Machine Learning

Data Market Platforms: Trading Data Assets to Solve Data Problems

The Data Station: Combining Data, Compute, and Market Forces