Source author record

Bruno Ribeiro

Bruno Ribeiro appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

33works

20topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Bridging Input Feature Spaces Towards Graph Foundation Models

Unlike vision and language domains, graph learning lacks a shared input space, as input features differ across graph datasets not only in semantics, but also in value ranges and dimensionality. This misalignment prevents graph models from generalizing across datasets, limiting their use as foundation models. In this work, we propose ALL-IN, a simple and theoretically grounded method that enables transferability across datasets with different input features. Our approach projects node features into a shared random space and constructs representations via covariance-based statistics, thus eliminating dependence on the original feature space. We show that the computed node-covariance operators and the resulting node representations are invariant in distribution to permutations of the input features. We further demonstrate that the expected operator exhibits invariance to general orthogonal transformations of the input features. Empirically, ALL-IN achieves strong performance across diverse node- and graph-level tasks on unseen datasets with new input features, without requiring architecture changes or retraining. These results point to a promising direction for input-agnostic, transferable graph models.

preprint2026arXiv

Imitative Membership Inference Attack

A Membership Inference Attack (MIA) assesses how much a target machine learning model reveals about its training data by determining whether specific query instances were part of the training set. State-of-the-art MIAs rely on training hundreds of shadow models that are independent of the target model, leading to significant computational overhead. In this paper, we introduce Imitative Membership Inference Attack (IMIA), which employs a novel imitative training technique to strategically construct a small number of target-informed imitative models that closely replicate the target model's behavior for inference. Extensive experimental results demonstrate that IMIA substantially outperforms existing MIAs in various attack settings while only requiring less than 5% of the computational cost of state-of-the-art approaches.

preprint2022arXiv

Bias Challenges in Counterfactual Data Augmentation

Deep learning models tend not to be out-of-distribution robust primarily due to their reliance on spurious features to solve the task. Counterfactual data augmentations provide a general way of (approximately) achieving representations that are counterfactual-invariant to spurious features, a requirement for out-of-distribution (OOD) robustness. In this work, we show that counterfactual data augmentations may not achieve the desired counterfactual-invariance if the augmentation is performed by a context-guessing machine, an abstract machine that guesses the most-likely context of a given input. We theoretically analyze the invariance imposed by such counterfactual data augmentations and describe an exemplar NLP task where counterfactual data augmentation by a context-guessing machine does not lead to robust OOD classifiers.

preprint2022arXiv

Veritas: Answering Causal Queries from Video Streaming Traces

In this paper, we seek to answer what-if questions - i.e., given recorded data of an existing deployed networked system, what would be the performance impact if we changed the design of the system (a task also known as causal inference). We make three contributions. First, we expose the complexity of causal inference in the context of adaptive bit rate video streaming, a challenging domain where the network conditions during the session act as a sequence of latent and confounding variables, and a change at any point in the session has a cascading impact on the rest of the session. Second, we present Veritas, a novel framework that tackles causal reasoning for video streaming without resorting to randomised trials. Integral to Veritas is an easy to interpret domain-specific ML model (an embedded Hidden Markov Model) that relates the latent stochastic process (intrinsic bandwidth that the video session can achieve) to actual observations (download times) while exploiting control variables such as the TCP state (e.g., congestion window) observed at the start of the download of video chunks. We show through experiments on an emulation testbed that Veritas can answer both counterfactual queries (e.g., the performance of a completed video session had it used a different buffer size) and interventional queries (e.g., estimating the download time for every possible video quality choice for the next chunk in a session in progress). In doing so, Veritas achieves accuracy close to an ideal oracle, while significantly outperforming both a commonly used baseline approach, and Fugu (an off-the-shelf neural network) neither of which account for causal effects.

preprint2021arXiv

Membership Inference Attacks and Defenses in Classification Models

We study the membership inference (MI) attack against classifiers, where the attacker's goal is to determine whether a data instance was used for training the classifier. Through systematic cataloging of existing MI attacks and extensive experimental evaluations of them, we find that a model's vulnerability to MI attacks is tightly related to the generalization gap -- the difference between training accuracy and test accuracy. We then propose a defense against MI attacks that aims to close the gap by intentionally reduces the training accuracy. More specifically, the training process attempts to match the training and validation accuracies, by means of a new {\em set regularizer} using the Maximum Mean Discrepancy between the softmax output empirical distributions of the training and validation sets. Our experimental results show that combining this approach with another simple defense (mix-up training) significantly improves state-of-the-art defense against MI attacks, with minimal impact on testing accuracy.

preprint2020arXiv

ALMA reveals the molecular gas properties of 5 star-forming galaxies across the main sequence at 3 < z < 3.5

We present the detection of CO(5-4) with S/N> 7 - 13 and a lower CO transition with S/N > 3 (CO(4-3) for 4 galaxies, and CO(3-2) for one) with ALMA in band 3 and 4 in five main sequence star-forming galaxies with stellar masses 3-6x10^10 M/M_sun at 3 < z < 3.5. We find a good correlation between the total far-infrared luminosity LFIR and the luminosity of the CO(5-4) transition L'CO(5-4), where L'CO(5-4) increases with SFR, indicating that CO(5-4) is a good tracer of the obscured SFR in these galaxies. The two galaxies that lie closer to the star-forming main sequence have CO SLED slopes that are comparable to other star-forming populations, such as local SMGs and BzK star-forming galaxies; the three objects with higher specific star formation rates (sSFR) have far steeper CO SLEDs, which possibly indicates a more concentrated episode of star formation. By exploiting the CO SLED slopes to extrapolate the luminosity of the CO(1-0) transition, and using a classical conversion factor for main sequence galaxies of alpha_CO = 3.8 M_sun(K km s^-1 pc^-2)^-1, we find that these galaxies are very gas rich, with molecular gas fractions between 60 and 80%, and quite long depletion times, between 0.2 and 1 Gyr. Finally, we obtain dynamical masses that are comparable with the sum of stellar and gas mass (at least for four out of five galaxies), allowing us to put a first constraint on the alpha_CO parameter for main sequence galaxies at an unprecedented redshift.

preprint2020arXiv

Deceptive Deletions for Protecting Withdrawn Posts on Social Platforms

Over-sharing poorly-worded thoughts and personal information is prevalent on online social platforms. In many of these cases, users regret posting such content. To retrospectively rectify these errors in users' sharing decisions, most platforms offer (deletion) mechanisms to withdraw the content, and social media users often utilize them. Ironically and perhaps unfortunately, these deletions make users more susceptible to privacy violations by malicious actors who specifically hunt post deletions at large scale. The reason for such hunting is simple: deleting a post acts as a powerful signal that the post might be damaging to its owner. Today, multiple archival services are already scanning social media for these deleted posts. Moreover, as we demonstrate in this work, powerful machine learning models can detect damaging deletions at scale. Towards restraining such a global adversary against users' right to be forgotten, we introduce Deceptive Deletion, a decoy mechanism that minimizes the adversarial advantage. Our mechanism injects decoy deletions, hence creating a two-player minmax game between an adversary that seeks to classify damaging content among the deleted posts and a challenger that employs decoy deletions to masquerade real damaging deletions. We formalize the Deceptive Game between the two players, determine conditions under which either the adversary or the challenger provably wins the game, and discuss the scenarios in-between these two extremes. We apply the Deceptive Deletion mechanism to a real-world task on Twitter: hiding damaging tweet deletions. We show that a powerful global adversary can be beaten by a powerful challenger, raising the bar significantly and giving a glimmer of hope in the ability to be really forgotten on social platforms.

preprint2020arXiv

Infinity Learning: Learning Markov Chains from Aggregate Steady-State Observations

We consider the task of learning a parametric Continuous Time Markov Chain (CTMC) sequence model without examples of sequences, where the training data consists entirely of aggregate steady-state statistics. Making the problem harder, we assume that the states we wish to predict are unobserved in the training data. Specifically, given a parametric model over the transition rates of a CTMC and some known transition rates, we wish to extrapolate its steady state distribution to states that are unobserved. A technical roadblock to learn a CTMC from its steady state has been that the chain rule to compute gradients will not work over the arbitrarily long sequences necessary to reach steady state ---from where the aggregate statistics are sampled. To overcome this optimization challenge, we propose $\infty$-SGD, a principled stochastic gradient descent method that uses randomly-stopped estimators to avoid infinite sums required by the steady state computation, while learning even when only a subset of the CTMC states can be observed. We apply $\infty$-SGD to a real-world testbed and synthetic experiments showcasing its accuracy, ability to extrapolate the steady state distribution to unobserved states under unobserved conditions (heavy loads, when training under light loads), and succeeding in difficult scenarios where even a tailor-made extension of existing methods fails.

preprint2020arXiv

On the Equivalence between Positional Node Embeddings and Structural Graph Representations

This work provides the first unifying theoretical framework for node (positional) embeddings and structural graph representations, bridging methods like matrix factorization and graph neural networks. Using invariant theory, we show that the relationship between structural representations and node embeddings is analogous to that of a distribution and its samples. We prove that all tasks that can be performed by node embeddings can also be performed by structural representations and vice-versa. We also show that the concept of transductive and inductive learning is unrelated to node embeddings and graph representations, clearing another source of confusion in the literature. Finally, we introduce new practical guidelines to generating and using node embeddings, which fixes significant shortcomings of standard operating procedures used today.

preprint2020arXiv

Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples

Image classifiers often suffer from adversarial examples, which are generated by strategically adding a small amount of noise to input images to trick classifiers into misclassification. Over the years, many defense mechanisms have been proposed, and different researchers have made seemingly contradictory claims on their effectiveness. We present an analysis of possible adversarial models, and propose an evaluation framework for comparing different defense mechanisms. As part of the framework, we introduce a more powerful and realistic adversary strategy. Furthermore, we propose a new defense mechanism called Random Spiking (RS), which generalizes dropout and introduces random noises in the training process in a controlled manner. Evaluations under our proposed framework suggest RS delivers better protection against adversarial examples than many existing schemes.

preprint2020arXiv

The evolution of rest-frame UV properties, Lya EWs and the SFR-Stellar mass relation at z~2-6 for SC4K LAEs

We explore deep rest-frame UV to FIR data in the COSMOS field to measure the individual spectral energy distributions (SED) of the ~4000 SC4K (Sobral et al. 2018) Lyman-alpha (Lya) emitters (LAEs) at z~2-6. We find typical stellar masses of 10$^{9.3\pm0.6}$ M$_{\odot}$ and star formation rates (SFR) of SFR$_{SED}=4.4^{+10.5}_{-2.4}$ M$_{\odot}$/yr and SFR$_{Lya}=5.9^{+6.3}_{-2.6}$ M$_{\odot}$/yr, combined with very blue UV slopes of beta=-2.1$^{+0.5}_{-0.4}$, but with significant variations within the population. M$_{UV}$ and beta are correlated in a similar way to UV-selected sources, but LAEs are consistently bluer. This suggests that LAEs are the youngest and/or most dust-poor subset of the UV-selected population. We also study the Lya rest-frame equivalent width (EW$_0$) and find 45 "extreme" LAEs with EW$_0>240$ A (3 $σ$), implying a low number density of $(7\pm1)\times10^{-7}$ Mpc$^{-3}$. Overall, we measure little to no evolution of the Lya EW$_0$ and scale length parameter ($w_0$) which are consistently high (EW$_0=140^{+280}_{-70}$ A, $w_0=129^{+11}_{-11}$ A) from z~6 to z~2 and below. However, $w_0$ is anti-correlated with M$_{UV}$ and stellar mass. Our results imply that sources selected as LAEs have a high Lya escape fraction (f$_{esc, Lya}$) irrespective of cosmic time, but f$_{esc, Lya}$ is still higher for UV-fainter and lower mass LAEs. The least massive LAEs ($<10^{9.5}$ M$_{\odot}$) are typically located above the star formation "Main Sequence" (MS), but the offset from the MS decreases towards z~6 and towards $10^{10}$ M$_{\odot}$. Our results imply a lack of evolution in the properties of LAEs across time and reveals the increasing overlap in properties of LAEs and UV-continuum selected galaxies as typical star-forming galaxies at high redshift effectively become LAEs.

preprint2020arXiv

Towards Studying Hierarchical Assembly in Real Time: A Milky Way Progenitor Galaxy at z = 2.36 under the Microscope

We use Hubble Space Telescope (HST) imaging and near-infrared spectroscopy from Keck/MOSFIRE to study the sub-structure around the progenitor of a Milky Way-mass galaxy in the Hubble Frontier Fields (HFF). Specifically, we study an $r_e = 40^{+70}_{-30}$pc, $M_{\star} \sim 10^{8.2} M_{\odot}$ rest-frame ultra-violet luminous "clump" at a projected distance of $\sim$100~pc from a $M_{\star} \sim 10^{9.8}$M$_{\odot}$ galaxy at $z = 2.36$ with a magnification $μ= 5.21$. We measure the star formation history of the clump and galaxy by jointly modeling the broadband spectral energy distribution from HST photometry and H$α$ from MOSFIRE spectroscopy. Given our inferred properties (e.g., mass, metallicity, dust) of the clump and galaxy, we explore scenarios in which the clump formed \emph{in-situ} (e.g., a star forming complex) or \emph{ex-situ} (e.g., a dwarf galaxy being accreted). If it formed \emph{in-situ}, we conclude that the clump is likely a single entity as opposed to a aggregation of smaller star clusters, making it one of the most dense star clusters cataloged. If it formed \emph{ex-situ}, then we are witnessing an accretion event with a 1:40 stellar mass ratio. However, our data alone are not informative enough to distinguish between \emph{in-situ} and \emph{ex-situ} scenarios to a high level of significance. We posit that the addition of high-fidelity metallicity information, such as [OIII]4363Å, which can be detected at modest S/N with only a few hours of JWST/NIRSpec time, may be a powerful discriminant. We suggest that studying larger samples of moderately lensed sub-structures across cosmic time can provide unique insight into the hierarchical formation of galaxies like the Milky Way.

preprint2019arXiv

VIS3COS: III. environmental effects on the star formation histories of galaxies at z~0.8 seen in [OII], H$δ$, and Dn4000

[ABRIDGED] We present spectroscopic observations of 466 galaxies in and around a superstructure at $z\sim0.84$ targeted by the VIMOS Spectroscopic Survey of a Supercluster in the COSMOS field (VIS$^{3}$COS). We use [OII]$λ$3727, H$δ$, and $D_n4000$ to trace the recent, mid-, and long-term star formation histories and investigate how stellar mass and the local environment impacts those. By studying trends both in individual and composite galaxy spectra, we find that both stellar mass and environment play a role in the observed galactic properties. We find that the median [OII] equivalent width (|EW$_\mathrm{[OII]}|$) decreases from $27\pm2$ Å to $2.0_{-0.4}^{+0.5}$ Å and $D_n4000$ increases from $1.09\pm0.01$ to $1.56\pm0.03$ with increasing stellar mass (from $\sim10^{9.25}$ to $\sim10^{11.35}\ \mathrm{M_\odot}$). Concerning the dependence on the environment, we find that at fixed stellar mass |EW$_\mathrm{[OII]}|$ is tentatively lower in higher density environments. Regarding $D_n4000$, we find that the increase with stellar mass is sharper in denser environments, hinting that such environments may accelerate galaxy evolution. Moreover, we find larger $D_n4000$ values in denser environments at fixed stellar mass, suggesting that galaxies are on average older and/or more metal-rich in such dense environments. This set of tracers depicts a scenario where the most massive galaxies have, on average, the lowest sSFRs and the oldest stellar populations (age $\gtrsim1$ Gyr, showing a mass-downsizing effect). We also hypothesize that the observed increase in star formation (higher EW$_\mathrm{[OII]|}$, higher sSFR) at intermediate densities may lead to quenching since we find the quenched fraction to increase sharply from the filament to cluster-like regions at similar stellar masses.

preprint2016arXiv

TribeFlow: Mining & Predicting User Trajectories

Which song will Smith listen to next? Which restaurant will Alice go to tomorrow? Which product will John click next? These applications have in common the prediction of user trajectories that are in a constant state of flux over a hidden network (e.g. website links, geographic location). What users are doing now may be unrelated to what they will be doing in an hour from now. Mindful of these challenges we propose TribeFlow, a method designed to cope with the complex challenges of learning personalized predictive models of non-stationary, transient, and time-heterogeneous user trajectories. TribeFlow is a general method that can perform next product recommendation, next song recommendation, next location prediction, and general arbitrary-length user trajectory prediction without domain-specific knowledge. TribeFlow is more accurate and up to 413x faster than top competitors.

preprint2015arXiv

Bayesian Inference of Online Social Network Statistics via Lightweight Random Walk Crawls

Online social networks (OSN) contain extensive amount of information about the underlying society that is yet to be explored. One of the most feasible technique to fetch information from OSN, crawling through Application Programming Interface (API) requests, poses serious concerns over the the guarantees of the estimates. In this work, we focus on making reliable statistical inference with limited API crawls. Based on regenerative properties of the random walks, we propose an unbiased estimator for the aggregated sum of functions over edges and proved the connection between variance of the estimator and spectral gap. In order to facilitate Bayesian inference on the true value of the estimator, we derive the approximate posterior distribution of the estimate. Later the proposed ideas are validated with numerical experiments on inference problems in real-world networks.

preprint2014arXiv

Classifying Latent Infection States in Complex Networks

Algorithms for identifying the infection states of nodes in a network are crucial for understanding and containing infections. Often, however, only a relatively small set of nodes have a known infection state. Moreover, the length of time that each node has been infected is also unknown. This missing data -- infection state of most nodes and infection time of the unobserved infected nodes -- poses a challenge to the study of real-world cascades. In this work, we develop techniques to identify the latent infected nodes in the presence of missing infection time-and-state data. Based on the likely epidemic paths predicted by the simple susceptible-infected epidemic model, we propose a measure (Infection Betweenness) for uncovering these unknown infection states. Our experimental results using machine learning algorithms show that Infection Betweenness is the most effective feature for identifying latent infected nodes.

preprint2014arXiv

Efficient Network Generation Under General Preferential Attachment

Preferential attachment (PA) models of network structure are widely used due to their explanatory power and conceptual simplicity. PA models are able to account for the scale-free degree distributions observed in many real-world large networks through the remarkably simple mechanism of sequentially introducing nodes that attach preferentially to high-degree nodes. The ability to efficiently generate instances from PA models is a key asset in understanding both the models themselves and the real networks that they represent. Surprisingly, little attention has been paid to the problem of efficient instance generation. In this paper, we show that the complexity of generating network instances from a PA model depends on the preference function of the model, provide efficient data structures that work under any preference function, and present empirical results from an implementation based on these data structures. We demonstrate that, by indexing growing networks with a simple augmented heap, we can implement a network generator which scales many orders of magnitude beyond existing capabilities ($10^6$ -- $10^8$ nodes). We show the utility of an efficient and general PA network generator by investigating the consequences of varying the preference functions of an existing model. We also provide "quicknet", a freely-available open-source implementation of the methods described in this work.

preprint2014arXiv

Efficiently Estimating Motif Statistics of Large Networks

Exploring statistics of locally connected subgraph patterns (also known as network motifs) has helped researchers better understand the structure and function of biological and online social networks (OSNs). Nowadays the massive size of some critical networks -- often stored in already overloaded relational databases -- effectively limits the rate at which nodes and edges can be explored, making it a challenge to accurately discover subgraph statistics. In this work, we propose sampling methods to accurately estimate subgraph statistics from as few queried nodes as possible. We present sampling algorithms that efficiently and accurately estimate subgraph properties of massive networks. Our algorithms require no pre-computation or complete network topology information. At the same time, we provide theoretical guarantees of convergence. We perform experiments using widely known data sets, and show that for the same accuracy, our algorithms require an order of magnitude less queries (samples) than the current state-of-the-art algorithms.

preprint2014arXiv

Modeling and Predicting the Growth and Death of Membership-based Websites

Driven by outstanding success stories of Internet startups such as Facebook and The Huffington Post, recent studies have thoroughly described their growth. These highly visible online success stories, however, overshadow an untold number of similar ventures that fail. The study of website popularity is ultimately incomplete without general mechanisms that can describe both successes and failures. In this work we present six years of the daily number of users (DAU) of twenty-two membership-based websites - encompassing online social networks, grassroots movements, online forums, and membership-only Internet stores - well balanced between successes and failures. We then propose a combination of reaction-diffusion-decay processes whose resulting equations seem not only to describe well the observed DAU time series but also provide means to roughly predict their evolution. This model allows an approximate automatic DAU-based classification of websites into self-sustainable v.s. unsustainable and whether the startup growth is mostly driven by marketing & media campaigns or word-of-mouth adoptions.

preprint2014arXiv

Modeling Website Popularity Competition in the Attention-Activity Marketplace

How does a new startup drive the popularity of competing websites into oblivion like Facebook famously did to MySpace? This question is of great interest to academics, technologists, and financial investors alike. In this work we exploit the singular way in which Facebook wiped out the popularity of MySpace, Hi5, Friendster, and Multiply to guide the design of a new popularity competition model. Our model provides new insights into what Nobel Laureate Herbert A. Simon called the "marketplace of attention," which we recast as the attention-activity marketplace. Our model design is further substantiated by user-level activity of 250,000 MySpace users obtained between 2004 and 2009. The resulting model not only accurately fits the observed Daily Active Users (DAU) of Facebook and its competitors but also predicts their fate four years into the future.

preprint2014arXiv

On the duration and intensity of cumulative advantage competitions

The role of skill (fitness) and luck (randomness) as driving forces on the dynamics of resource accumulation in a myriad of systems have long puzzled scientists. Fueled by undisputed inequalities that emerge from actual competitions, there is a pressing need for better understanding the effects of skill and luck in resource accumulation. When such competitions are driven by externalities such as cumulative advantage (CA), the rich-get-richer effect, little is known with respect to fundamental properties such as their duration and intensity. In this work we provide a mathematical understanding of how CA exacerbates the role of luck in detriment of skill in simple and well-studied competition models. We show, for instance, that if two agents are competing for resources that arrive sequentially at each time unit, an early stroke of luck can place the less skilled in the lead for an extremely long period of time, a phenomenon we call "struggle of the fittest". In the absence of CA, the more skilled quickly prevails despite any early stroke of luck that the less skilled may have. We prove that duration of a simple skill and luck competition model exhibit power law tails when CA is present, regardless of skill difference, which is in sharp contrast to exponential tails when CA is absent. Our findings have important implications to competitions not only in complex social systems but also in contexts that leverage such models.

preprint2014arXiv

Online Dating Recommendations: Matching Markets and Learning Preferences

Recommendation systems for online dating have recently attracted much attention from the research community. In this paper we proposed a two-side matching framework for online dating recommendations and design an LDA model to learn the user preferences from the observed user messaging behavior and user profile features. Experimental results using data from a large online dating website shows that two-sided matching improves significantly the rate of successful matches by as much as 45%. Finally, using simulated matchings we show that the the LDA model can correctly capture user preferences.

preprint2014arXiv

Revisit Behavior in Social Media: The Phoenix-R Model and Discoveries

How many listens will an artist receive on a online radio? How about plays on a YouTube video? How many of these visits are new or returning users? Modeling and mining popularity dynamics of social activity has important implications for researchers, content creators and providers. We here investigate the effect of revisits (successive visits from a single user) on content popularity. Using four datasets of social activity, with up to tens of millions media objects (e.g., YouTube videos, Twitter hashtags or LastFM artists), we show the effect of revisits in the popularity evolution of such objects. Secondly, we propose the Phoenix-R model which captures the popularity dynamics of individual objects. Phoenix-R has the desired properties of being: (1) parsimonious, being based on the minimum description length principle, and achieving lower root mean squared error than state-of-the-art baselines; (2) applicable, the model is effective for predicting future popularity values of objects.

preprint2014arXiv

Who is Dating Whom: Characterizing User Behaviors of a Large Online Dating Site

Online dating sites have become popular platforms for people to look for potential romantic partners. It is important to understand users' dating preferences in order to make better recommendations on potential dates. The message sending and replying actions of a user are strong indicators for what he/she is looking for in a potential date and reflect the user's actual dating preferences. We study how users' online dating behaviors correlate with various user attributes using a large real-world dateset from a major online dating site in China. Many of our results on user messaging behavior align with notions in social and evolutionary psychology: males tend to look for younger females while females put more emphasis on the socioeconomic status (e.g., income, education level) of a potential date. In addition, we observe that the geographic distance between two users and the photo count of users play an important role in their dating behaviors. Our results show that it is important to differentiate between users' true preferences and random selection. Some user behaviors in choosing attributes in a potential date may largely be a result of random selection. We also find that both males and females are more likely to reply to users whose attributes come closest to the stated preferences of the receivers, and there is significant discrepancy between a user's stated dating preference and his/her actual online dating behavior. These results can provide valuable guidelines to the design of a recommendation engine for potential dates.

preprint2013arXiv

Characterizing Branching Processes from Sampled Data

Branching processes model the evolution of populations of agents that randomly generate offsprings. These processes, more patently Galton-Watson processes, are widely used to model biological, social, cognitive, and technological phenomena, such as the diffusion of ideas, knowledge, chain letters, viruses, and the evolution of humans through their Y-chromosome DNA or mitochondrial RNA. A practical challenge of modeling real phenomena using a Galton-Watson process is the offspring distribution, which must be measured from the population. In most cases, however, directly measuring the offspring distribution is unrealistic due to lack of resources or the death of agents. So far, researchers have relied on informed guesses to guide their choice of offspring distribution. In this work we propose two methods to estimate the offspring distribution from real sampled data. Using a small sampled fraction of the agents and instrumented with the identity of the ancestors of the sampled agents, we show that accurate offspring distribution estimates can be obtained by sampling as little as 14% of the population.

preprint2013arXiv

Practical Characterization of Large Networks Using Neighborhood Information

Characterizing large online social networks (OSNs) through node querying is a challenging task. OSNs often impose severe constraints on the query rate, hence limiting the sample size to a small fraction of the total network. Various ad-hoc subgraph sampling methods have been proposed, but many of them give biased estimates and no theoretical basis on the accuracy. In this work, we focus on developing sampling methods for OSNs where querying a node also reveals partial structural information about its neighbors. Our methods are optimized for NoSQL graph databases (if the database can be accessed directly), or utilize Web API available on most major OSNs for graph sampling. We show that our sampling method has provable convergence guarantees on being an unbiased estimator, and it is more accurate than current state-of-the-art methods. We characterize metrics such as node label density estimation and edge label density estimation, two of the most fundamental network characteristics from which other network characteristics can be derived. We evaluate our methods on-the-fly over several live networks using their native APIs. Our simulation studies over a variety of offline datasets show that by including neighborhood information, our method drastically (4-fold) reduces the number of samples required to achieve the same estimation accuracy of state-of-the-art methods.

preprint2013arXiv

Quantifying the effect of temporal resolution on time-varying networks

Time-varying networks describe a wide array of systems whose constituents and interactions evolve over time. They are defined by an ordered stream of interactions between nodes, yet they are often represented in terms of a sequence of static networks, each aggregating all edges and nodes present in a time interval of size Δt. In this work we quantify the impact of an arbitrary Δt on the description of a dynamical process taking place upon a time-varying network. We focus on the elementary random walk, and put forth a simple mathematical framework that well describes the behavior observed on real datasets. The analytical description of the bias introduced by time integrating techniques represents a step forward in the correct characterization of dynamical processes on time-varying graphs.

preprint2013arXiv

Red bulgeless galaxies in SDSS DR7. Are there any AGN hosts?

With the main goal of finding bulgeless galaxies harbouring super massive black holes and showing, at most, just residual star formation activity, we have selected a sample of massive bulgeless red sequence galaxies from the SDSS-DR7, based on the NYU-VAGC catalogue. Multivavelength data were retrieved using EURO-VO tools, and the objects are characterised in terms of degree of star formation and the presence of an AGN. We have found seven objects that are quenched massive galaxies, that have no prominent bulge and that show signs of extra activity in their nuclei, five of them being central in their halo. These objects are rather robust candidates for rare systems that, though devoid of a significant bulge, harbor a supermassive black hole with an activity level likely capable of having halted the star formation through feedback.

preprint2012arXiv

Characterizing Continuous Time Random Walks on Time Varying Graphs

In this paper we study the behavior of a continuous time random walk (CTRW) on a stationary and ergodic time varying dynamic graph. We establish conditions under which the CTRW is a stationary and ergodic process. In general, the stationary distribution of the walker depends on the walker rate and is difficult to characterize. However, we characterize the stationary distribution in the following cases: i) the walker rate is significantly larger or smaller than the rate in which the graph changes (time-scale separation), ii) the walker rate is proportional to the degree of the node that it resides on (coupled dynamics), and iii) the degrees of node belonging to the same connected component are identical (structural constraints). We provide examples that illustrate our theoretical findings.

preprint2012arXiv

Multiple Random Walks to Uncover Short Paths in Power Law Networks

Consider the following routing problem in the context of a large scale network $G$, with particular interest paid to power law networks, although our results do not assume a particular degree distribution. A small number of nodes want to exchange messages and are looking for short paths on $G$. These nodes do not have access to the topology of $G$ but are allowed to crawl the network within a limited budget. Only crawlers whose sample paths cross are allowed to exchange topological information. In this work we study the use of random walks (RWs) to crawl $G$. We show that the ability of RWs to find short paths bears no relation to the paths that they take. Instead, it relies on two properties of RWs on power law networks: 1) RW's ability observe a sizable fraction of the network edges; and 2) an almost certainty that two distinct RW sample paths cross after a small percentage of the nodes have been visited. We show promising simulation results on several real world networks.

preprint2012arXiv

On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling

In this work we study the set size distribution estimation problem, where elements are randomly sampled from a collection of non-overlapping sets and we seek to recover the original set size distribution from the samples. This problem has applications to capacity planning, network theory, among other areas. Examples of real-world applications include characterizing in-degree distributions in large graphs and uncovering TCP/IP flow size distributions on the Internet. We demonstrate that it is hard to estimate the original set size distribution. The recoverability of original set size distributions presents a sharp threshold with respect to the fraction of elements that remain in the sets. If this fraction remains below a threshold, typically half of the elements in power-law and heavier-than-exponential-tailed distributions, then the original set size distribution is unrecoverable. We also discuss practical implications of our findings.

preprint2012arXiv

Online Myopic Network Covering

Efficient marketing or awareness-raising campaigns seek to recruit $n$ influential individuals -- where $n$ is the campaign budget -- that are able to cover a large target audience through their social connections. So far most of the related literature on maximizing this network cover assumes that the social network topology is known. Even in such a case the optimal solution is NP-hard. In practice, however, the network topology is generally unknown and needs to be discovered on-the-fly. In this work we consider an unknown topology where recruited individuals disclose their social connections (a feature known as {\em one-hop lookahead}). The goal of this work is to provide an efficient greedy online algorithm that recruits individuals as to maximize the size of target audience covered by the campaign. We propose a new greedy online algorithm, Maximum Expected $d$-Excess Degree (MEED), and provide, to the best of our knowledge, the first detailed theoretical analysis of the cover size of a variety of well known network sampling algorithms on finite networks. Our proposed algorithm greedily maximizes the expected size of the cover. For a class of random power law networks we show that MEED simplifies into a straightforward procedure, which we denote MOD (Maximum Observed Degree). We substantiate our analytical results with extensive simulations and show that MOD significantly outperforms all analyzed myopic algorithms. We note that performance may be further improved if the node degree distribution is known or can be estimated online during the campaign.

preprint2010arXiv

Estimating and Sampling Graphs with Multidimensional Random Walks

Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. Current sampling methods such as (independent) random vertex and random walks are useful but have drawbacks. Random vertex sampling may require too many resources (time, bandwidth, or money). Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or loosely connected graphs. In this work we propose a new $m$-dimensional random walk that uses $m$ dependent random walkers. We show that the proposed sampling method, which we call Frontier sampling, exhibits all of the nice sampling properties of a regular random walk. At the same time, our simulations over large real world graphs show that, in the presence of disconnected or loosely connected components, Frontier sampling exhibits lower estimation errors than regular random walks. We also show that Frontier sampling is more suitable than random vertex sampling to sample the tail of the degree distribution of the graph.

Bruno Ribeiro

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

Bridging Input Feature Spaces Towards Graph Foundation Models

Imitative Membership Inference Attack

Bias Challenges in Counterfactual Data Augmentation

Veritas: Answering Causal Queries from Video Streaming Traces

Membership Inference Attacks and Defenses in Classification Models

ALMA reveals the molecular gas properties of 5 star-forming galaxies across the main sequence at 3 < z < 3.5

Deceptive Deletions for Protecting Withdrawn Posts on Social Platforms

Infinity Learning: Learning Markov Chains from Aggregate Steady-State Observations

On the Equivalence between Positional Node Embeddings and Structural Graph Representations

Random Spiking and Systematic Evaluation of Defenses Against Adversarial Examples

The evolution of rest-frame UV properties, Lya EWs and the SFR-Stellar mass relation at z~2-6 for SC4K LAEs

Towards Studying Hierarchical Assembly in Real Time: A Milky Way Progenitor Galaxy at z = 2.36 under the Microscope

VIS3COS: III. environmental effects on the star formation histories of galaxies at z~0.8 seen in [OII], H$δ$, and Dn4000

TribeFlow: Mining & Predicting User Trajectories

Bayesian Inference of Online Social Network Statistics via Lightweight Random Walk Crawls

Classifying Latent Infection States in Complex Networks

Efficient Network Generation Under General Preferential Attachment

Efficiently Estimating Motif Statistics of Large Networks

Modeling and Predicting the Growth and Death of Membership-based Websites

Modeling Website Popularity Competition in the Attention-Activity Marketplace

On the duration and intensity of cumulative advantage competitions

Online Dating Recommendations: Matching Markets and Learning Preferences

Revisit Behavior in Social Media: The Phoenix-R Model and Discoveries

Who is Dating Whom: Characterizing User Behaviors of a Large Online Dating Site

Characterizing Branching Processes from Sampled Data

Practical Characterization of Large Networks Using Neighborhood Information

Quantifying the effect of temporal resolution on time-varying networks

Red bulgeless galaxies in SDSS DR7. Are there any AGN hosts?

Characterizing Continuous Time Random Walks on Time Varying Graphs

Multiple Random Walks to Uncover Short Paths in Power Law Networks

On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling

Online Myopic Network Covering

Estimating and Sampling Graphs with Multidimensional Random Walks