Researcher profile

Juliana Freire

Juliana Freire contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2022arXiv

GPU-Powered Spatial Database Engine for Commodity Hardware: Extended Version

Given the massive growth in the volume of spatial data, there is a great need for systems that can efficiently evaluate spatial queries over large data sets. These queries are notoriously expensive using traditional database solutions. While faster response times can be attained through powerful clusters or servers with large main-memory, these options, due to cost and complexity, are out of reach to many data scientists and analysts making up the long tail. Graphics Processing Units (GPUs), which are now widely available even in commodity desktops and laptops, provide a cost-effective alternative to support high-performance computing, opening up new opportunities to the efficient evaluation of spatial queries. While GPU-based approaches proposed in the literature have shown great improvements in performance, they are tied to specific GPU hardware and only handle specific queries over fixed geometry types. In this paper we present SPADE, a GPU-powered spatial database engine that supports a rich set of spatial queries. We discuss the challenges involved in attaining efficient query evaluation over large datasets as well as portability across different GPU hardware, and how these are addressed in SPADE. We performed a detailed experimental evaluation to assess the effectiveness of the system for wide range of queries and datasets, and report results which show that SPADE is scalable and able to handle data larger than main-memory, and its performance on a laptop is on par with that other systems that require clusters or large-memory servers.

preprint2022arXiv

Understanding how people consume low quality and extreme news using web traffic data

To mitigate the spread of fake news, researchers need to understand who visit fake new sites, what brings people to those sites, where visitors come from, and what content they prefer to consume. In this paper, we analyze web traffic data from The Gateway Pundit (TGP), a popular far-right website that is known for repeatedly sharing false information that has made its web traffic available to the general public. We collect data on 68 million web traffic visits to the site over a month period and analyze how people consume news via multiple features. Our traffic analysis shows that search engines and social media platforms are main drivers of traffic; our geo-location analysis reveals that TGP is more popular in counties that voted for Trump in 2020; and our topic analysis shows that conspiratorial articles receive more visits than factual articles. Due to the inability to observe direct website traffic, existing research uses alternative data source such as engagement signals from social media posts. To validate if social media engagement signals correlate with actual web visit counts, we collect all Facebook and Twitter posts with URLs from TGP during the same time period. We show that all engagement signals positively correlate with web visit counts, but with varying correlation strengths. Metrics based on Facebook posts correlate better than metrics based on Twitter. Our unique web traffic data set and insights can help researchers to better measure the impact of far-right and fake news URLs on social media platforms.

preprint2020arXiv

A GPU-friendly Geometric Data Model and Algebra for Spatial Queries: Extended Version

The availability of low cost sensors has led to an unprecedented growth in the volume of spatial data. However, the time required to evaluate even simple spatial queries over large data sets greatly hampers our ability to interactively explore these data sets and extract actionable insights. Graphics Processing Units~(GPUs) are increasingly being used to speedup spatial queries. However, existing GPU-based solutions have two important drawbacks: they are often tightly coupled to the specific query types they target, making it hard to adapt them for other queries; and since their design is based on CPU-based approaches, it can be difficult to effectively utilize all the benefits provided by the GPU. As a first step towards making GPU spatial query processing mainstream, we propose a new model that represents spatial data as geometric objects and define an algebra consisting of GPU-friendly composable operators that operate over these objects. We demonstrate the expressiveness of the proposed algebra by formulating standard spatial queries as algebraic expressions. We also present a proof-of-concept prototype that supports a subset of the operators and show that it is at least two orders of magnitude faster than a CPU-based implementation. This performance gain is obtained both using a discrete Nvidia mobile GPU and the less powerful integrated GPUs common in commodity laptops.

preprint2020arXiv

BugDoc: Algorithms to Debug Computational Processes

Data analysis for scientific experiments and enterprises, large-scale simulations, and machine learning tasks all entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous outputs, the pipeline may fail to execute or produce incorrect results. Inferring the root cause(s) of such failures is challenging, usually requiring time and much human thought, while still being error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our experimental data and processing software is available for use, reproducibility, and enhancement.

preprint2020arXiv

Debugging Machine Learning Pipelines

Machine learning tasks entail the use of complex computational pipelines to reach quantitative and qualitative conclusions. If some of the activities in a pipeline produce erroneous or uninformative outputs, the pipeline may fail or produce incorrect results. Inferring the root cause of failures and unexpected behavior is challenging, usually requiring much human thought, and is both time-consuming and error-prone. We propose a new approach that makes use of iteration and provenance to automatically infer the root causes and derive succinct explanations of failures. Through a detailed experimental evaluation, we assess the cost, precision, and recall of our approach compared to the state of the art. Our source code and experimental data will be available for reproducibility and enhancement.

preprint2020arXiv

Effective Discovery of Meaningful Outlier Relationships

We propose PODS (Predictable Outliers in Data-trendS), a method that, given a collection of temporal data sets, derives data-driven explanations for outliers by identifying meaningful relationships between them. First, we formalize the notion of meaningfulness, which so far has been informally framed in terms of explainability. Next, since outliers are rare and it is difficult to determine whether their relationships are meaningful, we develop a new criterion that does so by checking if these relationships could have been predicted from non-outliers, i.e., if we could see the outlier relationships coming. Finally, searching for meaningful outlier relationships between every pair of data sets in a large data collection is computationally infeasible. To address that, we propose an indexing strategy that prunes irrelevant comparisons across data sets, making the approach scalable. We present the results of an experimental evaluation using real data sets and different baselines, which demonstrates the effectiveness, robustness, and scalability of our approach.

preprint2020arXiv

PipelineProfiler: A Visual Analytics Tool for the Exploration of AutoML Pipelines

In recent years, a wide variety of automated machine learning (AutoML) methods have been proposed to search and generate end-to-end learning pipelines. While these techniques facilitate the creation of models for real-world applications, given their black-box nature, the complexity of the underlying algorithms, and the large number of pipelines they derive, it is difficult for their developers to debug these systems. It is also challenging for machine learning experts to select an AutoML system that is well suited for a given problem or class of problems. In this paper, we present the PipelineProfiler, an interactive visualization tool that allows the exploration and comparison of the solution space of machine learning (ML) pipelines produced by AutoML systems. PipelineProfiler is integrated with Jupyter Notebook and can be used together with common data science tools to enable a rich set of analyses of the ML pipelines and provide insights about the algorithms that generated them. We demonstrate the utility of our tool through several use cases where PipelineProfiler is used to better understand and improve a real-world AutoML system. Furthermore, we validate our approach by presenting a detailed analysis of a think-aloud experiment with six data scientists who develop and evaluate AutoML tools.

preprint2020arXiv

Towards Evaluating Exploratory Model Building Process with AutoML Systems

The use of Automated Machine Learning (AutoML) systems are highly open-ended and exploratory. While rigorously evaluating how end-users interact with AutoML is crucial, establishing a robust evaluation methodology for such exploratory systems is challenging. First, AutoML is complex, including multiple sub-components that support a variety of sub-tasks for synthesizing ML pipelines, such as data preparation, problem specification, and model generation, making it difficult to yield insights that tell us which components were successful or not. Second, because the usage pattern of AutoML is highly exploratory, it is not possible to rely solely on widely used task efficiency and effectiveness metrics as success metrics. To tackle the challenges in evaluation, we propose an evaluation methodology that (1) guides AutoML builders to divide their AutoML system into multiple sub-system components, and (2) helps them reason about each component through visualization of end-users' behavioral patterns and attitudinal data. We conducted a study to understand when, how, why, and applying our methodology can help builders to better understand their systems and end-users. We recruited 3 teams of professional AutoML builders. The teams prepared their own systems and let 41 end-users use the systems. Using our methodology, we visualized end-users' behavioral and attitudinal data and distributed the results to the teams. We analyzed the results in two directions: what types of novel insights the AutoML builders learned from end-users, and (2) how the evaluation methodology helped the builders to understand workflows and the effectiveness of their systems. Our findings suggest new insights explaining future design opportunities in the AutoML domain as well as how using our methodology helped the builders to determine insights and let them draw concrete directions for improving their systems.