Source author record

Nicolas Kourtellis

Nicolas Kourtellis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cs.CY Social and Information Networks Distributed, Parallel, and Cluster Computing Cryptography and Security physics.soc-ph Machine Learning Data Structures and Algorithms Information Retrieval Networking and Internet Architecture Artificial Intelligence Databases Robotics

Catalog footprint

What is connected

30works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A deep dive into the consistently toxic 1% of Twitter

Misbehavior in online social networks (OSN) is an ever-growing phenomenon. The research to date tends to focus on the deployment of machine learning to identify and classify types of misbehavior such as bullying, aggression, and racism to name a few. The main goal of identification is to curb natural and mechanical misconduct and make OSNs a safer place for social discourse. Going beyond past works, we perform a longitudinal study of a large selection of Twitter profiles, which enables us to characterize profiles in terms of how consistently they post highly toxic content. Our data spans 14 years of tweets from 122K Twitter profiles and more than 293M tweets. From this data, we selected the most extreme profiles in terms of consistency of toxic content and examined their tweet texts, and the domains, hashtags, and URLs they shared. We found that these selected profiles keep to a narrow theme with lower diversity in hashtags, URLs, and domains, they are thematically similar to each other (in a coordinated manner, if not through intent), and have a high likelihood of bot-like behavior (likely to have progenitors with intentions to influence). Our work contributes a substantial and longitudinal online misbehavior dataset to the research community and establishes the consistency of a profile's toxic behavior as a useful factor when exploring misbehavior as potential accessories to influence operations on OSNs.

preprint2022arXiv

Hierarchical Federated Learning with Privacy

Federated learning (FL), where data remains at the federated clients, and where only gradient updates are shared with a central aggregator, was assumed to be private. Recent work demonstrates that adversaries with gradient-level access can mount successful inference and reconstruction attacks. In such settings, differentially private (DP) learning is known to provide resilience. However, approaches used in the status quo (\ie central and local DP) introduce disparate utility vs. privacy trade-offs. In this work, we take the first step towards mitigating such trade-offs through {\em hierarchical FL (HFL)}. We demonstrate that by the introduction of a new intermediary level where calibrated DP noise can be added, better privacy vs. utility trade-offs can be obtained; we term this {\em hierarchical DP (HDP)}. Our experiments with 3 different datasets (commonly used as benchmarks for FL) suggest that HDP produces models as accurate as those obtained using central DP, where noise is added at a central aggregator. Such an approach also provides comparable benefit against inference adversaries as in the local DP case, where noise is added at the federated clients.

preprint2022arXiv

Leveraging Google's Publisher-specific IDs to Detect Website Administration

Digital advertising is the most popular way for content monetization on the Internet. Publishers spawn new websites, and older ones change hands with the sole purpose of monetizing user traffic. In this ever-evolving ecosystem, it is challenging to effectively answer questions such as: Which entities monetize what websites? What categories of websites does an average entity typically monetize on and how diverse are these websites? How has this website administration ecosystem changed across time? In this paper, we propose a novel, graph-based methodology to detect administration of websites on the Web, by exploiting the ad-related publisher-specific IDs. We apply our methodology across the top 1 million websites and study the characteristics of the created graphs of website administration. Our findings show that approximately 90% of the websites are associated each with a single publisher, and that small publishers tend to manage less popular websites. We perform a historical analysis of up to 8 million websites, and find a new, constantly rising number of (intermediary) publishers that control and monetize traffic from hundreds of websites, seeking a share of the ad-market pie. We also observe that over time, websites tend to move from big to smaller administrators.

preprint2022arXiv

Measuring the (Over)use of Service Workers for In-Page Push Advertising Purposes

Rich offline experience, periodic background sync, push notification functionality, network requests control, improved performance via requests caching are only a few of the functionalities provided by the Service Worker (SW) API. This new technology, supported by all major browsers, can significantly improve users' experience by providing the publisher with the technical foundations that would normally require a native application. Albeit the capabilities of this new technique and its important role in the ecosystem of Progressive Web Apps (PWAs), it is still unclear what is their actual purpose on the web, and how publishers leverage the provided functionality in their web applications. In this study, we shed light in the real world deployment of SWs, by conducting the first large scale analysis of the prevalence of SWs in the wild. We see that SWs are becoming more and more popular, with the adoption increased by 26% only within the last 5 months. Surprisingly, besides their fruitful capabilities, we see that SWs are being mostly used for In-Page Push Advertising, in 65.08% of the SWs that connect with 3rd parties. We highlight that this is a relatively new way for advertisers to bypass ad-blockers and render ads on the user's displays natively.

preprint2022arXiv

User Tracking in the Post-cookie Era: How Websites Bypass GDPR Consent to Track Users

During the past few years, mostly as a result of the GDPR and the CCPA, websites have started to present users with cookie consent banners. These banners are web forms where the users can state their preference and declare which cookies they would like to accept, if such option exists. Although requesting consent before storing any identifiable information is a good start towards respecting the user privacy, yet previous research has shown that websites do not always respect user choices. Furthermore, considering the ever decreasing reliance of trackers on cookies and actions browser vendors take by blocking or restricting third-party cookies, we anticipate a world where stateless tracking emerges, either because trackers or websites do not use cookies, or because users simply refuse to accept any. In this paper, we explore whether websites use more persistent and sophisticated forms of tracking in order to track users who said they do not want cookies. Such forms of tracking include first-party ID leaking, ID synchronization, and browser fingerprinting. Our results suggest that websites do use such modern forms of tracking even before users had the opportunity to register their choice with respect to cookies. To add insult to injury, when users choose to raise their voice and reject all cookies, user tracking only intensifies. As a result, users' choices play very little role with respect to tracking: we measured that more than 75% of tracking activities happened before users had the opportunity to make a selection in the cookie consent banner, or when users chose to reject all cookies.

preprint2022arXiv

YouTubers Not madeForKids: Detecting Channels Sharing Inappropriate Videos Targeting Children

In the last years, hundreds of new Youtube channels have been creating and sharing videos targeting children, with themes related to animation, superhero movies, comics, etc. Unfortunately, many of these videos are inappropriate for consumption by their target audience, due to disturbing, violent, or sexual scenes. In this paper, we study YouTube channels found to post suitable or disturbing videos targeting kids in the past. We identify a clear discrepancy between what YouTube assumes and flags as inappropriate content and channel, vs. what is found to be disturbing content and still available on the platform, targeting kids. In particular, we find that almost 60\% of videos that were manually annotated and classified as disturbing by an older study in 2019 (a collection bootstrapped with Elsa and other keywords related to children videos), are still available on YouTube in mid 2021. In the meantime, 44% of channels that uploaded such disturbing videos, have yet to be suspended and their videos to be removed. For the first time in literature, we also study the "madeForKids" flag, a new feature that YouTube introduced in the end of 2019, and compare its application to the channels that shared disturbing videos, as flagged from the previous study. Apparently, these channels are less likely to be set as "madeForKids" than those sharing suitable content. In addition, channels posting disturbing videos utilize their channel features such as keywords, description, topics, posts, etc., to appeal to kids (e.g., using game-related keywords). Finally, we use a collection of such channel and content features to train ML classifiers able to detect, at channel creation time, when a channel will be related to disturbing content uploads. These classifiers can help YouTube moderators reduce such incidences, pointing to potentially suspicious accounts without analyzing actual videos.

preprint2021arXiv

Differential Tracking Across Topical Webpages of Indian News Media

Online user privacy and tracking have been extensively studied in recent years, especially due to privacy and personal data-related legislations in the EU and the USA, such as the General Data Protection Regulation, ePrivacy Regulation, and California Consumer Privacy Act. Research has revealed novel tracking and personal identifiable information leakage methods that first- and third-parties employ on websites around the world, as well as the intensity of tracking performed on such websites. However, for the sake of scaling to cover a large portion of the Web, most past studies focused on homepages of websites, and did not look deeper into the tracking practices on their topical subpages. The majority of studies focused on the Global North markets such as the EU and the USA. Large markets such as India, which covers 20% of the world population and has no explicit privacy laws, have not been studied in this regard. We aim to address these gaps and focus on the following research questions: Is tracking on topical subpages of Indian news websites different from their homepage? Do third-party trackers prefer to track specific topics? How does this preference compare to the similarity of content shown on these topical subpages? To answer these questions, we propose a novel method for automatic extraction and categorization of Indian news topical subpages based on the details in their URLs. We study the identified topical subpages and compare them with their homepages with respect to the intensity of cookie injection and third-party embeddedness and type. We find differential user tracking among subpages, and between subpages and homepages. We also find a preferential attachment of third-party trackers to specific topics. Also, embedded third-parties tend to track specific subpages simultaneously, revealing possible user profiling in action.

preprint2021arXiv

Under the Spotlight: Web Tracking in Indian Partisan News Websites

India is experiencing intense political partisanship and sectarian divisions. The paper performs, to the best of our knowledge, the first comprehensive analysis on the Indian online news media with respect to tracking and partisanship. We build a dataset of 103 online, mostly mainstream news websites. With the help of two experts, alongside data from the Media Ownership Monitor of the Reporters without Borders, we label these websites according to their partisanship (Left, Right, or Centre). We study and compare user tracking on these sites with different metrics: numbers of cookies, cookie synchronizations, device fingerprinting, and invisible pixel-based tracking. We find that Left and Centre websites serve more cookies than Right-leaning websites. However, through cookie synchronization, more user IDs are synchronized in Left websites than Right or Centre. Canvas fingerprinting is used similarly by Left and Right, and less by Centre. Invisible pixel-based tracking is 50% more intense in Centre-leaning websites than Right, and 25% more than Left. Desktop versions of news websites deliver more cookies than their mobile counterparts. A handful of third-parties are tracking users in most websites in this study. This paper, by demonstrating intense web tracking, has implications for research on overall privacy of users visiting partisan news websites in India.

preprint2020arXiv

Clash of the Trackers: Measuring the Evolution of the Online Tracking Ecosystem

Websites are constantly adapting the methods used, and intensity with which they track online visitors. However, the wide-range enforcement of GDPR since one year ago (May 2018) forced websites serving EU-based online visitors to eliminate or at least reduce such tracking activity, given they receive proper user consent. Therefore, it is important to record and analyze the evolution of this tracking activity and assess the overall "privacy health" of the Web ecosystem and if it is better after GDPR enforcement. This work makes a significant step towards this direction. In this paper, we analyze the online ecosystem of 3rd-parties embedded in top websites which amass the majority of online tracking through 6 time snapshots taken every few months apart, in the duration of the last 2 years. We perform this analysis in three ways: 1) by looking into the network activity that 3rd-parties impose on each publisher hosting them, 2) by constructing a bipartite graph of "publisher-to-tracker", connecting 3rd parties with their publishers, 3) by constructing a "tracker-to-tracker" graph connecting 3rd-parties who are commonly found in publishers. We record significant changes through time in number of trackers, traffic induced in publishers (incoming vs. outgoing), embeddedness of trackers in publishers, popularity and mixture of trackers across publishers. We also report how such measures compare with the ranking of publishers based on Alexa. On the last level of our analysis, we dig deeper and look into the connectivity of trackers with each other and how this relates to potential cookie synchronization activity.

preprint2020arXiv

Cookie Synchronization: Everything You Always Wanted to Know But Were Afraid to Ask

User data is the primary input of digital advertising, fueling the free Internet as we know it. As a result, web companies invest a lot in elaborate tracking mechanisms to acquire user data that can sell to data markets and advertisers. However, with same-origin policy, and cookies as a primary identification mechanism on the web, each tracker knows the same user with a different ID. To mitigate this, Cookie Synchronization (CSync) came to the rescue, facilitating an information sharing channel between third parties that may or not have direct access to the website the user visits. In the background, with CSync, they merge user data they own, but also reconstruct a user's browsing history, bypassing the same origin policy. In this paper, we perform a first to our knowledge in-depth study of CSync in the wild, using a year-long weblog from 850 real mobile users. Through our study, we aim to understand the characteristics of the CSync protocol and the impact it has on web users' privacy. For this, we design and implement CONRAD, a holistic mechanism to detect CSync events at real time, and the privacy loss on the user side, even when the synced IDs are obfuscated. Using CONRAD, we find that 97% of the regular web users are exposed to CSync: most of them within the first week of their browsing, and the median userID gets leaked, on average, to 3.5 different domains. Finally, we see that CSync increases the number of domains that track the user by a factor of 6.75.

preprint2020arXiv

I call BS: Fraud Detection in Crowdfunding Campaigns

Donations to charity-based crowdfunding environments have been on the rise in the last few years. Unsurprisingly, deception and fraud in such platforms have also increased, but have not been thoroughly studied to understand what characteristics can expose such behavior and allow its automatic detection and blocking. Indeed, crowdfunding platforms are the only ones typically performing oversight for the campaigns launched in each service. However, they are not properly incentivized to combat fraud among users and the campaigns they launch: on the one hand, a platform's revenue is directly proportional to the number of transactions performed (since the platform charges a fixed amount per donation); on the other hand, if a platform is transparent with respect to how much fraud it has, it may discourage potential donors from participating. In this paper, we take the first step in studying fraud in crowdfunding campaigns. We analyze data collected from different crowdfunding platforms, and annotate 700 campaigns as fraud or not. We compute various textual and image-based features and study their distributions and how they associate with campaign fraud. Using these attributes, we build machine learning classifiers, and show that it is possible to automatically classify such fraudulent behavior with up to 90.14% accuracy and 96.01% AUC, only using features available from the campaign's description at the moment of publication (i.e., with no user or money activity), making our method applicable for real-time operation on a user browser.

preprint2020arXiv

Not one but many Tradeoffs: Privacy Vs. Utility in Differentially Private Machine Learning

Data holders are increasingly seeking to protect their user's privacy, whilst still maximizing their ability to produce machine models with high quality predictions. In this work, we empirically evaluate various implementations of differential privacy (DP), and measure their ability to fend off real-world privacy attacks, in addition to measuring their core goal of providing accurate classifications. We establish an evaluation framework to ensure each of these implementations are fairly evaluated. Our selection of DP implementations add DP noise at different positions within the framework, either at the point of data collection/release, during updates while training of the model, or after training by perturbing learned model parameters. We evaluate each implementation across a range of privacy budgets, and datasets, each implementation providing the same mathematical privacy guarantees. By measuring the models' resistance to real world attacks of membership and attribute inference, and their classification accuracy. we determine which implementations provide the most desirable tradeoff between privacy and utility. We found that the number of classes of a given dataset is unlikely to influence where the privacy and utility tradeoff occurs. Additionally, in the scenario that high privacy constraints are required, perturbing input training data does not trade off as much utility, as compared to noise added later in the ML process.

preprint2020arXiv

S2CE: A Hybrid Cloud and Edge Orchestrator for Mining Exascale Distributed Streams

The explosive increase in volume, velocity, variety, and veracity of data generated by distributed and heterogeneous nodes such as IoT and other devices, continuously challenge the state of art in big data processing platforms and mining techniques. Consequently, it reveals an urgent need to address the ever-growing gap between this expected exascale data generation and the extraction of insights from these data. To address this need, this paper proposes Stream to Cloud & Edge (S2CE), a first of its kind, optimized, multi-cloud and edge orchestrator, easily configurable, scalable, and extensible. S2CE will enable machine and deep learning over voluminous and heterogeneous data streams running on hybrid cloud and edge settings, while offering the necessary functionalities for practical and scalable processing: data fusion and preprocessing, sampling and synthetic stream generation, cloud and edge smart resource management, and distributed processing.

preprint2020arXiv

Stop Tracking Me Bro! Differential Tracking Of User Demographics On Hyper-partisan Websites

Websites with hyper-partisan, left or right-leaning focus offer content that is typically biased towards the expectations of their target audience. Such content often polarizes users, who are repeatedly primed to specific (extreme) content, usually reflecting hard party lines on political and socio-economic topics. Though this polarization has been extensively studied with respect to content, it is still unknown how it associates with the online tracking experienced by browsing users, especially when they exhibit certain demographic characteristics. For example, it is unclear how such websites enable the ad-ecosystem to track users based on their gender or age. In this paper, we take a first step to shed light and measure such potential differences in tracking imposed on users when visiting specific party-line's websites. For this, we design and deploy a methodology to systematically probe such websites and measure differences in user tracking. This methodology allows us to create user personas with specific attributes like gender and age and automate their browsing behavior in a consistent and repeatable manner. Thus, we systematically study how personas are being tracked by these websites and their third parties, especially if they exhibit particular demographic properties. Overall, we test 9 personas on 556 hyper-partisan websites and find that right-leaning websites tend to track users more intensely than left-leaning, depending on user demographics, using both cookies and cookie synchronization methods and leading to more costly delivered ads.

preprint2016arXiv

The Minimum Wiener Connector

The Wiener index of a graph is the sum of all pairwise shortest-path distances between its vertices. In this paper we study the novel problem of finding a minimum Wiener connector: given a connected graph $G=(V,E)$ and a set $Q\subseteq V$ of query vertices, find a subgraph of $G$ that connects all query vertices and has minimum Wiener index. We show that The Minimum Wiener Connector admits a polynomial-time (albeit impractical) exact algorithm for the special case where the number of query vertices is bounded. We show that in general the problem is NP-hard, and has no PTAS unless $\mathbf{P} = \mathbf{NP}$. Our main contribution is a constant-factor approximation algorithm running in time $\widetilde{O}(|Q||E|)$. A thorough experimentation on a large variety of real-world graphs confirms that our method returns smaller and denser solutions than other methods, and does so by adding to the query set $Q$ a small number of important vertices (i.e., vertices with high centrality).

preprint2016arXiv

VHT: Vertical Hoeffding Tree

IoT Big Data requires new machine learning methods able to scale to large size of data arriving at high speed. Decision trees are popular machine learning models since they are very effective, yet easy to interpret and visualize. In the literature, we can find distributed algorithms for learning decision trees, and also streaming algorithms, but not algorithms that combine both features. In this paper we present the Vertical Hoeffding Tree (VHT), the first distributed streaming algorithm for learning decision trees. It features a novel way of distributing decision trees via vertical parallelism. The algorithm is implemented on top of Apache SAMOA, a platform for mining distributed data streams, and thus able to run on real-world clusters. We run several experiments to study the accuracy and throughput performance of our new VHT algorithm, as well as its ability to scale while keeping its superior performance with respect to non-distributed decision trees.

preprint2016arXiv

When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing

Carefully balancing load in distributed stream processing systems has a fundamental impact on execution latency and throughput. Load balancing is challenging because real-world workloads are skewed: some tuples in the stream are associated to keys which are significantly more frequent than others. Skew is remarkably more problematic in large deployments: more workers implies fewer keys per worker, so it becomes harder to "average out" the cost of hot keys with cold keys. We propose a novel load balancing technique that uses a heaving hitter algorithm to efficiently identify the hottest keys in the stream. These hot keys are assigned to $d \geq 2$ choices to ensure a balanced load, where $d$ is tuned automatically to minimize the memory and computation cost of operator replication. The technique works online and does not require the use of routing tables. Our extensive evaluation shows that our technique can balance real-world workloads on large deployments, and improve throughput and latency by $\mathbf{150\%}$ and $\mathbf{60\%}$ respectively over the previous state-of-the-art when deployed on Apache Storm.

preprint2015arXiv

Cultures in Community Question Answering

CQA services are collaborative platforms where users ask and answer questions. We investigate the influence of national culture on people's online questioning and answering behavior. For this, we analyzed a sample of 200 thousand users in Yahoo Answers from 67 countries. We measure empirically a set of cultural metrics defined in Geert Hofstede's cultural dimensions and Robert Levine's Pace of Life and show that behavioral cultural differences exist in community question answering platforms. We find that national cultures differ in Yahoo Answers along a number of dimensions such as temporal predictability of activities, contribution-related behavioral patterns, privacy concerns, and power inequality.

preprint2015arXiv

Dynamic Matrix Factorization with Priors on Unknown Values

Advanced and effective collaborative filtering methods based on explicit feedback assume that unknown ratings do not follow the same model as the observed ones (\emph{not missing at random}). In this work, we build on this assumption, and introduce a novel dynamic matrix factorization framework that allows to set an explicit prior on unknown values. When new ratings, users, or items enter the system, we can update the factorization in time independent of the size of data (number of users, items and ratings). Hence, we can quickly recommend items even to very recent users. We test our methods on three large datasets, including two very sparse ones, in static and dynamic conditions. In each case, we outrank state-of-the-art matrix factorization methods that do not use a prior on unknown ratings.

preprint2015arXiv

Enabling Social Applications via Decentralized Social Data Management

An unprecedented information wealth produced by online social networks, further augmented by location/collocation data, is currently fragmented across different proprietary services. Combined, it can accurately represent the social world and enable novel socially-aware applications. We present Prometheus, a socially-aware peer-to-peer service that collects social information from multiple sources into a multigraph managed in a decentralized fashion on user-contributed nodes, and exposes it through an interface implementing non-trivial social inferences while complying with user-defined access policies. Simulations and experiments on PlanetLab with emulated application workloads show the system exhibits good end-to-end response time, low communication overhead and resilience to malicious attacks.

preprint2015arXiv

Partial Key Grouping: Load-Balanced Partitioning of Distributed Streams

We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 175% in throughput and up to 45% in latency when deployed on a real Storm cluster. PKG has been integrated in Apache Storm v0.10.

preprint2015arXiv

Privacy Concerns vs. User Behavior in Community Question Answering

Community-based question answering (CQA) platforms are crowd-sourced services for sharing user expertise on various topics, from mechanical repairs to parenting. While they naturally build-in an online social network infrastructure, they carry a very different purpose from Facebook-like social networks, where users "hang-out" with their friends and tend to share more personal information. It is unclear, thus, how the privacy concerns and their correlation with user behavior in an online social network translate into a CQA platform. This study analyzes one year of recorded traces from a mature CQA platform to understand the association between users' privacy concerns as manifested by their account settings and their activity in the platform. The results show that privacy preference is correlated with behavior in the community in terms of engagement, retention, accomplishments and deviance from the norm. We find privacy-concerned users have higher qualitative and quantitative contributions, show higher retention, report more abuses, have higher perception on answer quality and have larger social circles. However, at the same time, these users also exhibit more deviant behavior than the users with public profiles.

preprint2015arXiv

Scalable Online Betweenness Centrality in Evolving Graphs

Betweenness centrality is a classic measure that quantifies the importance of a graph element (vertex or edge) according to the fraction of shortest paths passing through it. This measure is notoriously expensive to compute, and the best known algorithm runs in O(nm) time. The problems of efficiency and scalability are exacerbated in a dynamic setting, where the input is an evolving graph seen edge by edge, and the goal is to keep the betweenness centrality up to date. In this paper we propose the first truly scalable algorithm for online computation of betweenness centrality of both vertices and edges in an evolving graph where new edges are added and existing edges are removed. Our algorithm is carefully engineered with out-of-core techniques and tailored for modern parallel stream processing engines that run on clusters of shared-nothing commodity hardware. Hence, it is amenable to real-world deployment. We experiment on graphs that are two orders of magnitude larger than previous studies. Our method is able to keep the betweenness centrality measures up to date online, i.e., the time to update the measures is smaller than the inter-arrival time between two consecutive updates.

preprint2015arXiv

Socially-Aware Distributed Hash Tables for Decentralized Online Social Networks

Many decentralized online social networks (DOSNs) have been proposed due to an increase in awareness related to privacy and scalability issues in centralized social networks. Such decentralized networks transfer processing and storage functionalities from the service providers towards the end users. DOSNs require individualistic implementation for services, (i.e., search, information dissemination, storage, and publish/subscribe). However, many of these services mostly perform social queries, where OSN users are interested in accessing information of their friends. In our work, we design a socially-aware distributed hash table (DHTs) for efficient implementation of DOSNs. In particular, we propose a gossip-based algorithm to place users in a DHT, while maximizing the social awareness among them. Through a set of experiments, we show that our approach reduces the lookup latency by almost 30% and improves the reliability of the communication by nearly 10% via trusted contacts.

preprint2015arXiv

The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines

We study the problem of load balancing in distributed stream processing engines, which is exacerbated in the presence of skew. We introduce Partial Key Grouping (PKG), a new stream partitioning scheme that adapts the classical "power of two choices" to a distributed streaming setting by leveraging two novel techniques: key splitting and local load estimation. In so doing, it achieves better load balancing than key grouping while being more scalable than shuffle grouping. We test PKG on several large datasets, both real-world and synthetic. Compared to standard hashing, PKG reduces the load imbalance by up to several orders of magnitude, and often achieves nearly-perfect load balance. This result translates into an improvement of up to 60% in throughput and up to 45% in latency when deployed on a real Storm cluster.

preprint2015arXiv

The Social World of Content Abusers in Community Question Answering

Community-based question answering platforms can be rich sources of information on a variety of specialized topics, from finance to cooking. The usefulness of such platforms depends heavily on user contributions (questions and answers), but also on respecting the community rules. As a crowd-sourced service, such platforms rely on their users for monitoring and flagging content that violates community rules. Common wisdom is to eliminate the users who receive many flags. Our analysis of a year of traces from a mature Q&A site shows that the number of flags does not tell the full story: on one hand, users with many flags may still contribute positively to the community. On the other hand, users who never get flagged are found to violate community rules and get their accounts suspended. This analysis, however, also shows that abusive users are betrayed by their network properties: we find strong evidence of homophilous behavior and use this finding to detect abusive users who go under the community radar. Based on our empirical observations, we build a classifier that is able to detect abusive users with an accuracy as high as 83%.

preprint2014arXiv

The power of indirect social ties

While direct social ties have been intensely studied in the context of computer-mediated social networks, indirect ties (e.g., friends of friends) have seen little attention. Yet in real life, we often rely on friends of our friends for recommendations (of good doctors, good schools, or good babysitters), for introduction to a new job opportunity, and for many other occasional needs. In this work we attempt to 1) quantify the strength of indirect social ties, 2) validate it, and 3) empirically demonstrate its usefulness for distributed applications on two examples. We quantify social strength of indirect ties using a(ny) measure of the strength of the direct ties that connect two people and the intuition provided by the sociology literature. We validate the proposed metric experimentally by comparing correlations with other direct social tie evaluators. We show via data-driven experiments that the proposed metric for social strength can be used successfully for social applications. Specifically, we show that it alleviates known problems in friend-to-friend storage systems by addressing two previously documented shortcomings: reduced set of storage candidates and data availability correlations. We also show that it can be used for predicting the effects of a social diffusion with an accuracy of up to 93.5%.

preprint2013arXiv

Data Survivability in Networks of Mobile Robots in Urban Disaster Environments

Mobile multi-robot teams deployed for monitoring or search-and-rescue missions in urban disaster areas can greatly improve the quality of vital data collected on-site. Analysis of such data can identify hazards and save lives. Unfortunately, such real deployments at scale are cost prohibitive and robot failures lead to data loss. Moreover, scaled-down deployments do not capture significant levels of interaction and communication complexity. To tackle this problem, we propose novel mobility and failure generation frameworks that allow realistic simulations of mobile robot networks for large scale disaster scenarios. Furthermore, since data replication techniques can improve the survivability of data collected during the operation, we propose an adaptive, scalable data replication technique that achieves high data survivability with low overhead. Our technique considers the anticipated robot failures and robot heterogeneity to decide how aggressively to replicate data. In addition, it considers survivability priorities, with some data requiring more effort to be saved than others. Using our novel simulation generation frameworks, we compare our adaptive technique with flooding and broadcast-based replication techniques and show that for failure rates of up to 60% it ensures better data survivability with lower communication costs.

preprint2012arXiv

Leveraging Peer Centrality in the Design of Socially-Informed Peer-to-Peer Systems

Social applications mine user social graphs to improve performance in search, provide recommendations, allow resource sharing and increase data privacy. When such applications are implemented on a peer-to-peer (P2P) architecture, the social graph is distributed on the P2P system: the traversal of the social graph translates into a socially-informed routing in the peer-to-peer layer. In this work we introduce the model of a projection graph that is the result of decentralizing a social graph onto a peer-to-peer network. We focus on three social network metrics: degree, node betweenness and edge betweenness centrality and analytically formulate the relation between metrics in the social graph and in the projection graph. Through experimental evaluation on real networks, we demonstrate that when mapping user communities of sizes up to 50-150 users on each peer, the association between the properties of the social graph and the projection graph is high, and thus the properties of the (dynamic) projection graph can be inferred from the properties of the (slower changing) social graph. Furthermore, we demonstrate with two application scenarios on large-scale social networks the usability of the projection graph in designing social search applications and unstructured P2P overlays.

preprint2011arXiv

Cheaters in the Steam Community Gaming Social Network

Online gaming is a multi-billion dollar industry that entertains a large, global population. One unfortunate phenomenon, however, poisons the competition and the fun: cheating. The costs of cheating span from industry-supported expenditures to detect and limit cheating, to victims' monetary losses due to cyber crime. This paper studies cheaters in the Steam Community, an online social network built on top of the world's dominant digital game delivery platform. We collected information about more than 12 million gamers connected in a global social network, of which more than 700 thousand have their profiles flagged as cheaters. We also collected in-game interaction data of over 10 thousand players from a popular multiplayer gaming server. We show that cheaters are well embedded in the social and interaction networks: their network position is largely undistinguishable from that of fair players. We observe that the cheating behavior appears to spread through a social mechanism: the presence and the number of cheater friends of a fair player is correlated with the likelihood of her becoming a cheater in the future. Also, we observe that there is a social penalty involved with being labeled as a cheater: cheaters are likely to switch to more restrictive privacy settings once they are tagged and they lose more friends than fair players. Finally, we observe that the number of cheaters is not correlated with the geographical, real-world population density, or with the local popularity of the Steam Community. This analysis can ultimately inform the design of mechanisms to deal with anti-social behavior (e.g., spamming, automated collection of data) in generic online social networks.

Nicolas Kourtellis

What is connected

Connect this record

See the researcher in context

Building this map preview

30 published item(s)

A deep dive into the consistently toxic 1% of Twitter

Hierarchical Federated Learning with Privacy

Leveraging Google's Publisher-specific IDs to Detect Website Administration

Measuring the (Over)use of Service Workers for In-Page Push Advertising Purposes

User Tracking in the Post-cookie Era: How Websites Bypass GDPR Consent to Track Users

YouTubers Not madeForKids: Detecting Channels Sharing Inappropriate Videos Targeting Children

Differential Tracking Across Topical Webpages of Indian News Media

Under the Spotlight: Web Tracking in Indian Partisan News Websites

Clash of the Trackers: Measuring the Evolution of the Online Tracking Ecosystem

Cookie Synchronization: Everything You Always Wanted to Know But Were Afraid to Ask

I call BS: Fraud Detection in Crowdfunding Campaigns

Not one but many Tradeoffs: Privacy Vs. Utility in Differentially Private Machine Learning

S2CE: A Hybrid Cloud and Edge Orchestrator for Mining Exascale Distributed Streams

Stop Tracking Me Bro! Differential Tracking Of User Demographics On Hyper-partisan Websites

The Minimum Wiener Connector

VHT: Vertical Hoeffding Tree

When Two Choices Are not Enough: Balancing at Scale in Distributed Stream Processing

Cultures in Community Question Answering

Dynamic Matrix Factorization with Priors on Unknown Values

Enabling Social Applications via Decentralized Social Data Management

Partial Key Grouping: Load-Balanced Partitioning of Distributed Streams

Privacy Concerns vs. User Behavior in Community Question Answering

Scalable Online Betweenness Centrality in Evolving Graphs

Socially-Aware Distributed Hash Tables for Decentralized Online Social Networks

The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines

The Social World of Content Abusers in Community Question Answering

The power of indirect social ties

Data Survivability in Networks of Mobile Robots in Urban Disaster Environments

Leveraging Peer Centrality in the Design of Socially-Informed Peer-to-Peer Systems

Cheaters in the Steam Community Gaming Social Network