Source author record

Ricardo Baeza-Yates

Ricardo Baeza-Yates appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks Human-Computer Interaction cs.CY Artificial Intelligence Information Retrieval physics.soc-ph Computer Vision Computation and Language

Catalog footprint

What is connected

15works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Improving Model Safety by Targeted Error Correction

The widespread adoption of machine learning in critical applications demands techniques to mitigate high-consequence errors. Our method utilizes a dual-classifier GBDT pipeline to distinguish routine human-like errors from high-risk non-human misclassifications. Evaluated across three domains, animal breed classification, skin lesion diagnosis (ISIC 2018), and prostate histopathology (SICAPv2), our framework demonstrates robust safety improvements. To address real-world deployment concerns, our results confirm the pipeline introduces negligible inference latency (1.60% overhead for the animal dataset, 1.84% for ISIC, and 1.70% for SICAPv2) while outperforming traditional Maximum Class Probability (MCP) baselines in correction precision. Our conservative correction strategy successfully reduced dangerous non-human errors by 34.1% in ISIC and 12.57% in SICAPv2, improving super-class diagnostic safety to 90.41% and 92.13% respectively. This proves that safety-critical reliability can be substantially enhanced post-hoc without expensive model retraining. keywords: Error Analysis, Post-hoc Correction, Trustworthy AI.

preprint2022arXiv

Bots don't Vote, but They Surely Bother! A Study of Anomalous Accounts in a National Referendum

The Web contains several social media platforms for discussion, exchange of ideas, and content publishing. These platforms are used by people, but also by distributed agents known as bots. Although bots have existed for decades, with many of them being benevolent, their influence in propagating and generating deceptive information in the last years has increased. Here we present a characterization of the discussion on Twitter about the 2020 Chilean constitutional referendum. The characterization uses a profile-oriented analysis that enables the isolation of anomalous content using machine learning. As result, we obtain a characterization that matches national vote turnout, and we measure how anomalous accounts (some of which are automated bots) produce content and interact promoting (false) information.

preprint2020arXiv

Every Colour You Are: Stance Prediction and Turnaround in Controversial Issues

Web platforms have allowed political manifestation and debate for decades. Technology changes have brought new opportunities for expression, and the availability of longitudinal data of these debates entice new questions regarding who participates, and who updates their opinion. The aim of this work is to provide a methodology to measure these phenomena, and to test this methodology on a specific topic, abortion, as observed on one of the most popular micro-blogging platforms. To do so, we followed the discussion on Twitter about abortion in two Spanish-speaking countries from 2015 to 2018. Our main insights are two fold. On the one hand, people adopted new technologies to express their stances, particularly colored variations of heart emojis ([green heart] & [purple heart]) in a way that mirrored physical manifestations on abortion. On the other hand, even on issues with strong opinions, opinions can change, and these changes show differences in demographic groups. These findings imply that debate on the Web embraces new ways of stance adherence, and that changes of opinion can be measured and characterized.

preprint2019arXiv

Predicting risk of dyslexia with an online gamified test

Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants with dyslexia. To check the robustness of the method we tested our method using a new data set with over 1,300 participants with age customized tests in a different environment -- a tablet instead of a desktop computer -- reaching a recall of over 72% for the class with dyslexia for children 9 years old or older. Our work shows that dyslexia can be screened using a machine learning approach. An online screening tool based on our methods has already been used by more than 200,000 people.

preprint2016arXiv

Data Portraits and Intermediary Topics: Encouraging Exploration of Politically Diverse Profiles

In micro-blogging platforms, people connect and interact with others. However, due to cognitive biases, they tend to interact with like-minded people and read agreeable information only. Many efforts to make people connect with those who think differently have not worked well. In this paper, we hypothesize, first, that previous approaches have not worked because they have been direct -- they have tried to explicitly connect people with those having opposing views on sensitive issues. Second, that neither recommendation or presentation of information by themselves are enough to encourage behavioral change. We propose a platform that mixes a recommender algorithm and a visualization-based user interface to explore recommendations. It recommends politically diverse profiles in terms of distance of latent topics, and displays those recommendations in a visual representation of each user's personal content. We performed an "in the wild" evaluation of this platform, and found that people explored more recommendations when using a biased algorithm instead of ours. In line with our hypothesis, we also found that the mixture of our recommender algorithm and our user interface, allowed politically interested users to exhibit an unbiased exploration of the recommended profiles. Finally, our results contribute insights in two aspects: first, which individual differences are important when designing platforms aimed at behavioral change; and second, which algorithms and user interfaces should be mixed to help users avoid cognitive mechanisms that lead to biased behavior.

preprint2016arXiv

Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising

Sponsored search represents a major source of revenue for web search engines. This popular advertising model brings a unique possibility for advertisers to target users' immediate intent communicated through a search query, usually by displaying their ads alongside organic search results for queries deemed relevant to their products or services. However, due to a large number of unique queries it is challenging for advertisers to identify all such relevant queries. For this reason search engines often provide a service of advanced matching, which automatically finds additional relevant queries for advertisers to bid on. We present a novel advanced matching approach based on the idea of semantic embeddings of queries and ads. The embeddings were learned using a large data set of user search sessions, consisting of search queries, clicked ads and search links, while utilizing contextual information such as dwell time and skipped ads. To address the large-scale nature of our problem, both in terms of data and vocabulary size, we propose a novel distributed algorithm for training of the embeddings. Finally, we present an approach for overcoming a cold-start problem associated with new ads and queries. We report results of editorial evaluation and online tests on actual search traffic. The results show that our approach significantly outperforms baselines in terms of relevance, coverage, and incremental revenue. Lastly, we open-source learned query embeddings to be used by researchers in computational advertising and related fields.

preprint2016arXiv

Sentiment Visualisation Widgets for Exploratory Search

This paper proposes the usage of \emph{visualisation widgets} for exploratory search with \emph{sentiment} as a facet. Starting from specific design goals for depiction of ambivalence in sentiment, two visualization widgets were implemented: \emph{scatter plot} and \emph{parallel coordinates}. Those widgets were evaluated against a text baseline in a small-scale usability study with exploratory tasks using Wikipedia as dataset. The study results indicate that users spend more time browsing with scatter plots in a positive way. A post-hoc analysis of individual differences in behavior revealed that when considering two types of users, \emph{explorers} and \emph{achievers}, engagement with scatter plots is positive and significantly greater \textit{when users are explorers}. We discuss the implications of these findings for sentiment-based exploratory search and personalised user interfaces.

preprint2016arXiv

Visual Congruent Ads for Image Search

The quality of user experience online is affected by the relevance and placement of advertisements. We propose a new system for selecting and displaying visual advertisements in image search result sets. Our method compares the visual similarity of candidate ads to the image search results and selects the most visually similar ad to be displayed. The method further selects an appropriate location in the displayed image grid to minimize the perceptual visual differences between the ad and its neighbors. We conduct an experiment with about 900 users and find that our proposed method provides significant improvement in the users' overall satisfaction with the image search experience, without diminishing the users' ability to see the ad or recall the advertised brand.

preprint2016arXiv

Wisdom of the Crowd or Wisdom of a Few? An Analysis of Users' Content Generation

In this paper we analyze how user generated content (UGC) is created, challenging the well known {\it wisdom of crowds} concept. Although it is known that user activity in most settings follow a power law, that is, few people do a lot, while most do nothing, there are few studies that characterize well this activity. In our analysis of datasets from two different social networks, Facebook and Twitter, we find that a small percentage of active users and much less of all users represent 50\% of the UGC. We also analyze the dynamic behavior of the generation of this content to find that the set of most active users is quite stable in time. Moreover, we study the social graph, finding that those active users are highly connected among them. This implies that most of the wisdom comes from a few users, challenging the independence assumption needed to have a wisdom of crowds. We also address the content that is never seen by any people, which we call digital desert, that challenges the assumption that the content of every person should be taken in account in a collective decision. We also compare our results with Wikipedia data and we address the quality of UGC content using an Amazon dataset. At the end our results are not surprising, as the Web is a reflection of our own society, where economical or political power also is in the hands of minorities.

preprint2015arXiv

Encouraging Diversity- and Representation-Awareness in Geographically Centralized Content

In centralized countries, not only population, media and economic power are concentrated, but people give more attention to central locations. While this is not inherently bad, this behavior extends to micro-blogging platforms: central locations get more attention in terms of information flow. In this paper we study the effects of an information filtering algorithm that decentralizes content in such platforms. Particularly, we find that users from non-central locations were not able to identify the geographical diversity on timelines generated by the algorithm, which were diverse by construction. To make users see the inherent diversity, we define a design rationale to approach this problem, focused on an already known visualization technique: treemaps. Using interaction data from an "in the wild" deployment of our proposed system, we find that, even though there are effects of centralization in exploratory user behavior, the treemap was able to make users see the inherent geographical diversity of timelines, and engage with user generated content. With these results in mind, we propose practical actions for micro-blogging platforms to account for the differences and biased behavior induced by centralization.

preprint2015arXiv

Finding Intermediary Topics Between People of Opposing Views: A Case Study

In micro-blogging platforms, people can connect with others and have conversations on a wide variety of topics. However, because of homophily and selective exposure, users tend to connect with like-minded people and only read agreeable information. Motivated by this scenario, in this paper we study the diversity of intermediary topics, which are latent topics estimated from user generated content. These topics can be used as features in recommender systems aimed at introducing people of diverse political viewpoints. We conducted a case study on Twitter, considering the debate about a sensitive issue in Chile, where we quantified homophilic behavior in terms of political discussion and then we evaluated the diversity of intermediary topics in terms of political stances of users.

preprint2013arXiv

Caracterizando la Web Chilena

This article presents a characterization of the web space from Chile in 2007. The characterization shows distributions of sites and domains, analysis of document content and server configuration. In addition, the network structure of the chilean Web is analyzed, determining components based on hyperlink structure at the document and site levels. Original Abstract: En este artículo se muestra una caracterización del espacio web de Chile para el año 2007. Se muestran distribuciones de sitios y dominios, caracterización del contenido en base a tipos de documento, asi como configuración de los servidores. Se estudia la estructura de la red creada mediante hipervínculos en los documentos y cómo las diferentes componentes de esta estructura varían cuando los hipervínculos son agregados a nivel de sitios.

preprint2013arXiv

Evolution of the Chilean Web: A Larger Study

In this paper we extend our previous and only study on the dynamics of the Chilean Web. This new study doubles the time period and to the best of our knowledge is the only study of its type known about any country in the Web. The new results corroborate the trends found before, in particular the exponential growth of the Web, and reinforce the conclusion that the Web is more chaotic than we would like. Hence, modeling most Web characteristics is not trivial.

preprint2012arXiv

Learning to Rank Query Recommendations by Semantic Similarities

Logs of the interactions with a search engine show that users often reformulate their queries. Examining these reformulations shows that recommendations that precise the focus of a query are helpful, like those based on expansions of the original queries. But it also shows that queries that express some topical shift with respect to the original query can help user access more rapidly the information they need. We propose a method to identify from the query logs of past users queries that either focus or shift the initial query topic. This method combines various click-based, topic-based and session based ranking strategies and uses supervised learning in order to maximize the semantic similarities between the query and the recommendations, while at the same diversifying them. We evaluate our method using the query/click logs of a Japanese web search engine and we show that the combination of the three methods proposed is significantly better than any of them taken individually.

preprint2010arXiv

Capacity Planning for Vertical Search Engines

Vertical search engines focus on specific slices of content, such as the Web of a single country or the document collection of a large corporation. Despite this, like general open web search engines, they are expensive to maintain, expensive to operate, and hard to design. Because of this, predicting the response time of a vertical search engine is usually done empirically through experimentation, requiring a costly setup. An alternative is to develop a model of the search engine for predicting performance. However, this alternative is of interest only if its predictions are accurate. In this paper we propose a methodology for analyzing the performance of vertical search engines. Applying the proposed methodology, we present a capacity planning model based on a queueing network for search engines with a scale typically suitable for the needs of large corporations. The model is simple and yet reasonably accurate and, in contrast to previous work, considers the imbalance in query service times among homogeneous index servers. We discuss how we tune up the model and how we apply it to predict the impact on the query response time when parameters such as CPU and disk capacities are changed. This allows a manager of a vertical search engine to determine a priori whether a new configuration of the system might keep the query response under specified performance constraints.

Ricardo Baeza-Yates

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Improving Model Safety by Targeted Error Correction

Bots don't Vote, but They Surely Bother! A Study of Anomalous Accounts in a National Referendum

Every Colour You Are: Stance Prediction and Turnaround in Controversial Issues

Predicting risk of dyslexia with an online gamified test

Data Portraits and Intermediary Topics: Encouraging Exploration of Politically Diverse Profiles

Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising

Sentiment Visualisation Widgets for Exploratory Search

Visual Congruent Ads for Image Search

Wisdom of the Crowd or Wisdom of a Few? An Analysis of Users' Content Generation

Encouraging Diversity- and Representation-Awareness in Geographically Centralized Content

Finding Intermediary Topics Between People of Opposing Views: A Case Study

Caracterizando la Web Chilena

Evolution of the Chilean Web: A Larger Study

Learning to Rank Query Recommendations by Semantic Similarities

Capacity Planning for Vertical Search Engines