Source author record

Emiliano De Cristofaro

Emiliano De Cristofaro appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security cs.CY Social and Information Networks Networking and Internet Architecture Artificial Intelligence Genomics Human-Computer Interaction Machine Learning Computational Engineering, Finance, and Science Computer Vision Emerging Technologies physics.soc-ph Software Engineering

Catalog footprint

What is connected

45works

13topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Adherence to Misinformation on Social Media Through Socio-Cognitive and Group-Based Processes

Previous work suggests that people's preference for different kinds of information depends on more than just accuracy. This could happen because the messages contained within different pieces of information may either be well-liked or repulsive. Whereas factual information must often convey uncomfortable truths, misinformation can have little regard for veracity and leverage psychological processes which increase its attractiveness and proliferation on social media. In this review, we argue that when misinformation proliferates, this happens because the social media environment enables adherence to misinformation by reducing, rather than increasing, the psychological cost of doing so. We cover how attention may often be shifted away from accuracy and towards other goals, how social and individual cognition is affected by misinformation and the cases under which debunking it is most effective, and how the formation of online groups affects information consumption patterns, often leading to more polarization and radicalization. Throughout, we make the case that polarization and misinformation adherence are closely tied. We identify ways in which the psychological cost of adhering to misinformation can be increased when designing anti-misinformation interventions or resilient affordances, and we outline open research questions that the CSCW community can take up in further understanding this cost.

preprint2022arXiv

Cerberus: Exploring Federated Prediction of Security Events

Modern defenses against cyberattacks increasingly rely on proactive approaches, e.g., to predict the adversary's next actions based on past events. Building accurate prediction models requires knowledge from many organizations; alas, this entails disclosing sensitive information, such as network structures, security postures, and policies, which might often be undesirable or outright impossible. In this paper, we explore the feasibility of using Federated Learning (FL) to predict future security events. To this end, we introduce Cerberus, a system enabling collaborative training of Recurrent Neural Network (RNN) models for participating organizations. The intuition is that FL could potentially offer a middle-ground between the non-private approach where the training data is pooled at a central server and the low-utility alternative of only training local models. We instantiate Cerberus on a dataset obtained from a major security company's intrusion prevention product and evaluate it vis-a-vis utility, robustness, and privacy, as well as how participants contribute to and benefit from the system. Overall, our work sheds light on both the positive aspects and the challenges of using FL for this task and paves the way for deploying federated approaches to predictive security.

preprint2022arXiv

Feels Bad Man: Dissecting Automated Hateful Meme Detection Through the Lens of Facebook's Challenge

Internet memes have become a dominant method of communication; at the same time, however, they are also increasingly being used to advocate extremism and foster derogatory beliefs. Nonetheless, we do not have a firm understanding as to which perceptual aspects of memes cause this phenomenon. In this work, we assess the efficacy of current state-of-the-art multimodal machine learning models toward hateful meme detection, and in particular with respect to their generalizability across platforms. We use two benchmark datasets comprising 12,140 and 10,567 images from 4chan's "Politically Incorrect" board (/pol/) and Facebook's Hateful Memes Challenge dataset to train the competition's top-ranking machine learning models for the discovery of the most prominent features that distinguish viral hateful memes from benign ones. We conduct three experiments to determine the importance of multimodality on classification performance, the influential capacity of fringe Web communities on mainstream social platforms and vice versa, and the models' learning transferability on 4chan memes. Our experiments show that memes' image characteristics provide a greater wealth of information than its textual content. We also find that current systems developed for online detection of hate speech in memes necessitate further concentration on its visual elements to improve their interpretation of underlying cultural connotations, implying that multimodal models fail to adequately grasp the intricacies of hate speech in memes and generalize across social media platforms.

preprint2022arXiv

Local and Central Differential Privacy for Robustness and Privacy in Federated Learning

Federated Learning (FL) allows multiple participants to train machine learning models collaboratively by keeping their datasets local while only exchanging model updates. Alas, this is not necessarily free from privacy and robustness vulnerabilities, e.g., via membership, property, and backdoor attacks. This paper investigates whether and to what extent one can use differential Privacy (DP) to protect both privacy and robustness in FL. To this end, we present a first-of-its-kind evaluation of Local and Central Differential Privacy (LDP/CDP) techniques in FL, assessing their feasibility and effectiveness. Our experiments show that both DP variants do d fend against backdoor attacks, albeit with varying levels of protection-utility trade-offs, but anyway more effectively than other robustness defenses. DP also mitigates white-box membership inference attacks in FL, and our work is the first to show it empirically. Neither LDP nor CDP, however, defend against property inference. Overall, our work provides a comprehensive, re-usable measurement methodology to quantify the trade-offs between robustness/privacy and utility in differentially private FL.

preprint2022arXiv

On Utility and Privacy in Synthetic Genomic Data

The availability of genomic data is essential to progress in biomedical research, personalized medicine, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. As a result, several initiatives have been launched to experiment with synthetic genomic data, e.g., using generative models to learn the underlying distribution of the real data and generate artificial datasets that preserve its salient characteristics without exposing it. This paper provides the first evaluation of both utility and privacy protection of six state-of-the-art models for generating synthetic genomic data. We assess the performance of the synthetic data on several common tasks, such as allele population statistics and linkage disequilibrium. We then measure privacy through the lens of membership inference attacks, i.e., inferring whether a record was part of the training data. Our experiments show that no single approach to generate synthetic genomic data yields both high utility and strong privacy across the board. Also, the size and nature of the training dataset matter. Moreover, while some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Looking forward, our techniques can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild and serve as a benchmark for future work.

preprint2022arXiv

Robin Hood and Matthew Effects: Differential Privacy Has Disparate Impact on Synthetic Data

Generative models trained with Differential Privacy (DP) can be used to generate synthetic data while minimizing privacy risks. We analyze the impact of DP on these models vis-a-vis underrepresented classes/subgroups of data, specifically, studying: 1) the size of classes/subgroups in the synthetic data, and 2) the accuracy of classification tasks run on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our analysis uses three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN) and shows that DP yields opposite size distributions in the generated synthetic data. It affects the gap between the majority and minority classes/subgroups; in some cases by reducing it (a "Robin Hood" effect) and, in others, by increasing it (a "Matthew" effect). Either way, this leads to (similar) disparate impacts on the accuracy of classification tasks on the synthetic data, affecting disproportionately more the underrepresented subparts of the data. Consequently, when training models on synthetic data, one might incur the risk of treating different subpopulations unevenly, leading to unreliable or unfair conclusions.

preprint2022arXiv

The Gospel According to Q: Understanding the QAnon Conspiracy from the Perspective of Canonical Information

The QAnon conspiracy theory claims that a cabal of (literally) blood-thirsty politicians and media personalities are engaged in a war to destroy society. By interpreting cryptic "drops" of information from an anonymous insider calling themself Q, adherents of the conspiracy theory believe that Donald Trump is leading them in an active fight against this cabal. QAnon has been covered extensively by the media, as its adherents have been involved in multiple violent acts, including the January 6th, 2021 seditious storming of the US Capitol building. Nevertheless, we still have relatively little understanding of how the theory evolved and spread on the Web, and the role played in that by multiple platforms. To address this gap, we study QAnon from the perspective of "Q" themself. We build a dataset of 4,949 canonical Q drops collected from six "aggregation sites," which curate and archive them from their original posting to anonymous and ephemeral image boards. We expose that these sites have a relatively low (overall) agreement, and thus at least some Q drops should probably be considered apocryphal. We then analyze the Q drops' contents to identify topics of discussion and find statistically significant indications that drops were not authored by a single individual. Finally, we look at how posts on Reddit are used to disseminate Q drops to wider audiences. We find that dissemination was (initially) limited to a few sub-communities and that, while heavy-handed moderation decisions have reduced the overall issue, the "gospel" of Q persists on the Web.

preprint2022arXiv

Toxicity in the Decentralized Web and the Potential for Model Sharing

The "Decentralised Web" (DW) is an evolving concept, which encompasses technologies aimed at providing greater transparency and openness on the web. The DW relies on independent servers (aka instances) that mesh together in a peer-to-peer fashion to deliver a range of services (e.g. micro-blogs, image sharing, video streaming). However, toxic content moderation in this decentralised context is challenging. This is because there is no central entity that can define toxicity, nor a large central pool of data that can be used to build universal classifiers. It is therefore unsurprising that there have been several high-profile cases of the DW being misused to coordinate and disseminate harmful material. Using a dataset of 9.9M posts from 117K users on Pleroma (a popular DW microblogging service), we quantify the presence of toxic content. We find that toxic content is prevalent and spreads rapidly between instances. We show that automating per-instance content moderation is challenging due to the lack of sufficient training data available and the effort required in labelling. We therefore propose and evaluate ModPair, a model sharing system that effectively detects toxic content, gaining an average per-instance macro-F1 score 0.89.

preprint2022arXiv

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Chatbots are used in many applications, e.g., automated agents, smart home assistants, interactive characters in online games, etc. Therefore, it is crucial to ensure they do not behave in undesired manners, providing offensive or toxic responses to users. This is not a trivial task as state-of-the-art chatbot models are trained on large, public datasets openly collected from the Internet. This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots. We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries. Even more worryingly, some non-toxic queries can trigger toxic responses too. We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries that make chatbots respond in a toxic manner. Our extensive experimental evaluation demonstrates that our attack is effective against public chatbot models and outperforms manually-crafted malicious queries proposed by previous work. We also evaluate three defense mechanisms against ToxicBuddy, showing that they either reduce the attack performance at the cost of affecting the chatbot's utility or are only effective at mitigating a portion of the attack. This highlights the need for more research from the computer security and online safety communities to ensure that chatbot models do not hurt their users. Overall, we are confident that ToxicBuddy can be used as an auditing tool and that our work will pave the way toward designing more effective defenses for chatbot safety.

preprint2021arXiv

"Is it a Qoincidence?": An Exploratory Study of QAnon on Voat

Online fringe communities offer fertile grounds for users seeking and sharing ideas fueling suspicion of mainstream news and conspiracy theories. Among these, the QAnon conspiracy theory emerged in 2017 on 4chan, broadly supporting the idea that powerful politicians, aristocrats, and celebrities are closely engaged in a global pedophile ring. Simultaneously, governments are thought to be controlled by "puppet masters," as democratically elected officials serve as a fake showroom of democracy. This paper provides an empirical exploratory analysis of the QAnon community on Voat.co, a Reddit-esque news aggregator, which has captured the interest of the press for its toxicity and for providing a platform to QAnon followers. More precisely, we analyze a large dataset from /v/GreatAwakening, the most popular QAnon-related subverse (the Voat equivalent of a subreddit), to characterize activity and user engagement. To further understand the discourse around QAnon, we study the most popular named entities mentioned in the posts, along with the most prominent topics of discussion, which focus on US politics, Donald Trump, and world events. We also use word embeddings to identify narratives around QAnon-specific keywords. Our graph visualization shows that some of the QAnon-related ones are closely related to those from the Pizzagate conspiracy theory and so-called drops by "Q." Finally, we analyze content toxicity, finding that discussions on /v/GreatAwakening are less toxic than in the broad Voat community.

preprint2021arXiv

A Multi-Platform Analysis of Political News Discussion and Sharing on Web Communities

The news ecosystem has become increasingly complex, encompassing a wide range of sources with varying levels of trustworthiness, and with public commentary giving different spins to the same stories. In this paper, we present a multi-platform measurement of this ecosystem. We compile a list of 1,073 news websites and extract posts from four Web communities (Twitter, Reddit, 4chan, and Gab) that contain URLs from these sources. This yields a dataset of 38M posts containing 15M news URLs, spanning almost three years. We study the data along several axes, assessing the trustworthiness of shared news, designing a method to group news articles into stories, analyzing these stories are discussed and measuring the influence various Web communities have in that. Our analysis shows that different communities discuss different types of news, with polarized communities like Gab and /r/The_Donald subreddit disproportionately referencing untrustworthy sources. We also find that fringe communities often have a disproportionate influence on other platforms w.r.t. pushing narratives around certain news, for example about political elections, immigration, or foreign policy.

preprint2021arXiv

Dissecting the Meme Magic: Understanding Indicators of Virality in Image Memes

Despite the increasingly important role played by image memes, we do not yet have a solid understanding of the elements that might make a meme go viral on social media. In this paper, we investigate what visual elements distinguish image memes that are highly viral on social media from those that do not get re-shared, across three dimensions: composition, subjects, and target audience. Drawing from research in art theory, psychology, marketing, and neuroscience, we develop a codebook to characterize image memes, and use it to annotate a set of 100 image memes collected from 4chan's Politically Incorrect Board (/pol/). On the one hand, we find that highly viral memes are more likely to use a close-up scale, contain characters, and include positive or negative emotions. On the other hand, image memes that do not present a clear subject the viewer can focus attention on, or that include long text are not likely to be re-shared by users. We train machine learning models to distinguish between image memes that are likely to go viral and those that are unlikely to be re-shared, obtaining an AUC of 0.866 on our dataset. We also show that the indicators of virality identified by our model can help characterize the most viral memes posted on mainstream online social networks too, as our classifiers are able to predict 19 out of the 20 most popular image memes posted on Twitter and Reddit between 2016 and 2018. Overall, our analysis sheds light on what indicators characterize viral and non-viral visual content online, and set the basis for developing better techniques to create or moderate content that is more likely to catch the viewer's attention.

preprint2020arXiv

An Overview of Privacy in Machine Learning

Over the past few years, providers such as Google, Microsoft, and Amazon have started to provide customers with access to software interfaces allowing them to easily embed machine learning tasks into their applications. Overall, organizations can now use Machine Learning as a Service (MLaaS) engines to outsource complex tasks, e.g., training classifiers, performing predictions, clustering, etc. They can also let others query models trained on their data. Naturally, this approach can also be used (and is often advocated) in other contexts, including government collaborations, citizen science projects, and business-to-business partnerships. However, if malicious users were able to recover data used to train these models, the resulting information leakage would create serious issues. Likewise, if the inner parameters of the model are considered proprietary information, then access to the model should not allow an adversary to learn such parameters. In this document, we set to review privacy challenges in this space, providing a systematic review of the relevant research literature, also exploring possible countermeasures. More specifically, we provide ample background information on relevant concepts around machine learning and privacy. Then, we discuss possible adversarial models and settings, cover a wide range of attacks that relate to private and/or sensitive information leakage, and review recent results attempting to defend against such attacks. Finally, we conclude with a list of open problems that require more work, including the need for better evaluations, more targeted defenses, and the study of the relation to policy and data protection efforts.

preprint2020arXiv

Measuring Membership Privacy on Aggregate Location Time-Series

While location data is extremely valuable for various applications, disclosing it prompts serious threats to individuals' privacy. To limit such concerns, organizations often provide analysts with aggregate time-series that indicate, e.g., how many people are in a location at a time interval, rather than raw individual traces. In this paper, we perform a measurement study to understand Membership Inference Attacks (MIAs) on aggregate location time-series, where an adversary tries to infer whether a specific user contributed to the aggregates. We find that the volume of contributed data, as well as the regularity and particularity of users' mobility patterns, play a crucial role in the attack's success. We experiment with a wide range of defenses based on generalization, hiding, and perturbation, and evaluate their ability to thwart the attack vis-a-vis the utility loss they introduce for various mobility analytics tasks. Our results show that some defenses fail across the board, while others work for specific tasks on aggregate location time-series. For instance, suppressing small counts can be used for ranking hotspots, data generalization for forecasting traffic, hotspot discovery, and map inference, while sampling is effective for location labeling and anomaly detection when the dataset is sparse. Differentially private techniques provide reasonable accuracy only in very specific settings, e.g., discovering hotspots and forecasting their traffic, and more so when using weaker privacy notions like crowd-blending privacy. Overall, our measurements show that there does not exist a unique generic defense that can preserve the utility of the analytics for arbitrary applications, and provide useful insights regarding the disclosure of sanitized aggregate location time-series.

preprint2020arXiv

On the Feasibility of Acoustic Attacks Using Commodity Smart Devices

Sound at frequencies above (ultrasonic) or below (infrasonic) the range of human hearing can, in some settings, cause adverse physiological and psychological effects to individuals. In this paper, we investigate the feasibility of cyber-attacks that could make smart consumer devices produce possibly imperceptible sound at both high (17-21kHz) and low (60-100Hz) frequencies, at the maximum available volume setting, potentially turning them into acoustic cyber-weapons. To do so, we deploy attacks targeting different smart devices and take sound measurements in an anechoic chamber. For comparison, we also test possible attacks on traditional devices. Overall, we find that many of the devices tested are capable of reproducing frequencies within both high and low ranges, at levels exceeding those recommended in published guidelines. Generally speaking, such attacks are often trivial to develop and in many cases could be added to existing malware payloads, as they may be attractive to adversaries with specific motivations or targets. Finally, we suggest a number of countermeasures, both platform-specific and generic ones.

preprint2020arXiv

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.

preprint2019arXiv

Characterizing the Use of Images in State-Sponsored Information Warfare Operations by Russian Trolls on Twitter

State-sponsored organizations are increasingly linked to efforts aimed to exploit social media for information warfare and manipulating public opinion. Typically, their activities rely on a number of social network accounts they control, aka trolls, that post and interact with other users disguised as "regular" users. These accounts often use images and memes, along with textual content, in order to increase the engagement and the credibility of their posts. In this paper, we present the first study of images shared by state-sponsored accounts by analyzing a ground truth dataset of 1.8M images posted to Twitter by accounts controlled by the Russian Internet Research Agency. First, we analyze the content of the images as well as their posting activity. Then, using Hawkes Processes, we quantify their influence on popular Web communities like Twitter, Reddit, 4chan's Politically Incorrect board (/pol/), and Gab, with respect to the dissemination of images. We find that the extensive image posting activity of Russian trolls coincides with real-world events (e.g., the Unite the Right rally in Charlottesville), and shed light on their targets as well as the content disseminated via images. Finally, we show that the trolls were more effective in disseminating politics-related imagery than other images.

preprint2016arXiv

Ad-Blocking and Counter Blocking: A Slice of the Arms Race

Adblocking tools like Adblock Plus continue to rise in popularity, potentially threatening the dynamics of advertising revenue streams. In response, a number of publishers have ramped up efforts to develop and deploy mechanisms for detecting and/or counter-blocking adblockers (which we refer to as anti-adblockers), effectively escalating the online advertising arms race. In this paper, we develop a scalable approach for identifying third-party services shared across multiple web-sites and use it to provide a first characterization of anti-adblocking across the Alexa Top-5K websites. We map websites that perform anti-adblocking as well as the entities that provide anti-adblocking scripts. We study the modus operandi of these scripts and their impact on popular adblockers. We find that at least 6.7% of websites in the Alexa Top-5K use anti-adblocking scripts, acquired from 12 distinct entities -- some of which have a direct interest in nourishing the online advertising industry.

preprint2016arXiv

Combating Fraud in Online Social Networks: Detecting Stealthy Facebook Like Farms

As businesses increasingly rely on social networking sites to engage with their customers, it is crucial to understand and counter reputation manipulation activities, including fraudulently boosting the number of Facebook page likes using like farms. To this end, several fraud detection algorithms have been proposed and some deployed by Facebook that use graph co-clustering to distinguish between genuine likes and those generated by farm-controlled profiles. However, as we show in this paper, these tools do not work well with stealthy farms whose users spread likes over longer timespans and like popular pages, aiming to mimic regular users. We present an empirical analysis of the graph-based detection tools used by Facebook and highlight their shortcomings against more sophisticated farms. Next, we focus on characterizing content generated by social networks accounts on their timelines, as an indicator of genuine versus fake social activity. We analyze a wide range of features extracted from timeline posts, which we group into two main classes: lexical and non-lexical. We postulate and verify that like farm accounts tend to often re-share content, use fewer words and poorer vocabulary, and more often generate duplicate comments and likes compared to normal users. We extract relevant lexical and non-lexical features and and use them to build a classifier to detect like farms accounts, achieving significantly higher accuracy, namely, at least 99% precision and 93% recall.

preprint2016arXiv

Efficient Private Statistics with Succinct Sketches

Large-scale collection of contextual information is often essential in order to gather statistics, train machine learning models, and extract knowledge from data. The ability to do so in a {\em privacy-preserving} way -- i.e., without collecting fine-grained user data -- enables a number of additional computational scenarios that would be hard, or outright impossible, to realize without strong privacy guarantees. In this paper, we present the design and implementation of practical techniques for privately gathering statistics from large data streams. We build on efficient cryptographic protocols for private aggregation and on data structures for succinct data representation, namely, Count-Min Sketch and Count Sketch. These allow us to reduce the communication and computation complexity incurred by each data source (e.g., end-users) from linear to logarithmic in the size of their input, while introducing a parametrized upper-bounded error that does not compromise the quality of the statistics. We then show how to use our techniques, efficiently, to instantiate real-world privacy-friendly systems, supporting recommendations for media streaming services, prediction of user locations, and computation of median statistics for Tor hidden services.

preprint2016arXiv

Experimental Analysis of Popular Smartphone Apps Offering Anonymity, Ephemerality, and End-to-End Encryption

As social networking takes to the mobile world, smartphone apps provide users with ever-changing ways to interact with each other. Over the past couple of years, an increasing number of apps have entered the market offering end-to-end encryption, self-destructing messages, or some degree of anonymity. However, little work thus far has examined the properties they offer. To this end, this paper presents a taxonomy of 18 of these apps: we first look at the features they promise in their appeal to broaden their reach and focus on 8 of the more popular ones. We present a technical evaluation, based on static and dynamic analysis, and identify a number of gaps between the claims and reality of their promises.

preprint2016arXiv

Privacy-Friendly Mobility Analytics using Aggregate Location Data

Location data can be extremely useful to study commuting patterns and disruptions, as well as to predict real-time traffic volumes. At the same time, however, the fine-grained collection of user locations raises serious privacy concerns, as this can reveal sensitive information about the users, such as, life style, political and religious inclinations, or even identities. In this paper, we study the feasibility of crowd-sourced mobility analytics over aggregate location information: users periodically report their location, using a privacy-preserving aggregation protocol, so that the server can only recover aggregates -- i.e., how many, but not which, users are in a region at a given time. We experiment with real-world mobility datasets obtained from the Transport For London authority and the San Francisco Cabs network, and present a novel methodology based on time series modeling that is geared to forecast traffic volumes in regions of interest and to detect mobility anomalies in them. In the presence of anomalies, we also make enhanced traffic volume predictions by feeding our model with additional information from correlated regions. Finally, we present and evaluate a mobile app prototype, called Mobility Data Donors (MDD), in terms of computation, communication, and energy overhead, demonstrating the real-world deployability of our techniques.

preprint2016arXiv

Privacy-Preserving Genetic Relatedness Test

An increasing number of individuals are turning to Direct-To-Consumer (DTC) genetic testing to learn about their predisposition to diseases, traits, and/or ancestry. DTC companies like 23andme and Ancestry.com have started to offer popular and affordable ancestry and genealogy tests, with services allowing users to find unknown relatives and long-distant cousins. Naturally, access and possible dissemination of genetic data prompts serious privacy concerns, thus motivating the need to design efficient primitives supporting private genetic tests. In this paper, we present an effective protocol for privacy-preserving genetic relatedness test (PPGRT), enabling a cloud server to run relatedness tests on input an encrypted genetic database and a test facility's encrypted genetic sample. We reduce the test to a data matching problem and perform it, privately, using searchable encryption. Finally, a performance evaluation of hamming distance based PP-GRT attests to the practicality of our proposals.

preprint2016arXiv

Private Processing of Outsourced Network Functions: Feasibility and Constructions

Aiming to reduce the cost and complexity of maintaining networking infrastructures, organizations are increasingly outsourcing their network functions (e.g., firewalls, traffic shapers and intrusion detection systems) to the cloud, and a number of industrial players have started to offer network function virtualization (NFV)-based solutions. Alas, outsourcing network functions in its current setting implies that sensitive network policies, such as firewall rules, are revealed to the cloud provider. In this paper, we investigate the use of cryptographic primitives for processing outsourced network functions, so that the provider does not learn any sensitive information. More specifically, we present a cryptographic treatment of privacy-preserving outsourcing of network functions, introducing security definitions as well as an abstract model of generic network functions, and then propose a few instantiations using partial homomorphic encryption and public-key encryption with keyword search. We include a proof-of-concept implementation of our constructions and show that network functions can be privately processed by an untrusted cloud provider in a few milliseconds.

preprint2016arXiv

SplitBox: Toward Efficient Private Network Function Virtualization

This paper presents SplitBox, a scalable system for privately processing network functions that are outsourced as software processes to the cloud. Specifically, providers processing the network functions do not learn the network policies instructing how the functions are to be processed. We first propose an abstract model of a generic network function based on match-action pairs, assuming that this is processed in a distributed manner by multiple honest-but-curious providers. Then, we introduce our SplitBox system for private network function virtualization and present a proof-of-concept implementation on FastClick -- an extension of the Click modular router -- using a firewall as a use case. Our experimental results show that SplitBox achieves a throughput of over 2 Gbps with 1 kB-sized packets on average, traversing up to 60 firewall rules.

preprint2015arXiv

"`They brought in the horrible key ring thing!" Analysing the Usability of Two-Factor Authentication in UK Online Banking

To prevent password breaches and guessing attacks, banks increasingly turn to two-factor authentication (2FA), requiring users to present at least one more factor, such as a one-time password generated by a hardware token or received via SMS, besides a password. We can expect some solutions -- especially those adding a token -- to create extra work for users, but little research has investigated usability, user acceptance, and perceived security of deployed 2FA. This paper presents an in-depth study of 2FA usability with 21 UK online banking customers, 16 of whom had accounts with more than one bank. We collected a rich set of qualitative and quantitative data through two rounds of semi-structured interviews, and an authentication diary over an average of 11 days. Our participants reported a wide range of usability issues, especially with the use of hardware tokens, showing that the mental and physical workload involved shapes how they use online banking. Key targets for improvements are (i) the reduction in the number of authentication steps, and (ii) removing features that do not add any security but negatively affect the user experience.

preprint2015arXiv

Controlled Data Sharing for Collaborative Predictive Blacklisting

Although sharing data across organizations is often advocated as a promising way to enhance cybersecurity, collaborative initiatives are rarely put into practice owing to confidentiality, trust, and liability challenges. In this paper, we investigate whether collaborative threat mitigation can be realized via a controlled data sharing approach, whereby organizations make informed decisions as to whether or not, and how much, to share. Using appropriate cryptographic tools, entities can estimate the benefits of collaboration and agree on what to share in a privacy-preserving way, without having to disclose their datasets. We focus on collaborative predictive blacklisting, i.e., forecasting attack sources based on one's logs and those contributed by other organizations. We study the impact of different sharing strategies by experimenting on a real-world dataset of two billion suspicious IP addresses collected from Dshield over two months. We find that controlled data sharing yields up to 105% accuracy improvement on average, while also reducing the false positive rate.

preprint2015arXiv

Danger is My Middle Name: Experimenting with SSL Vulnerabilities in Android Apps

This paper presents a measurement study of information leakage and SSL vulnerabilities in popular Android apps. We perform static and dynamic analysis on 100 apps, downloaded at least 10M times, that request full network access. Our experiments show that, although prior work has drawn a lot of attention to SSL implementations on mobile platforms, several popular apps (32/100) accept all certificates and all hostnames, and four actually transmit sensitive data unencrypted. We set up an experimental testbed simulating man-in-the-middle attacks and find that many apps (up to 91% when the adversary has a certificate installed on the victim's device) are vulnerable, allowing the attacker to access sensitive information, including credentials, files, personal details, and credit card numbers. Finally, we provide a few recommendations to app developers and highlight several open research problems.

preprint2015arXiv

How Far Removed Are You? Scalable Privacy-Preserving Estimation of Social Path Length with Social PaL

Social relationships are a natural basis on which humans make trust decisions. Online Social Networks (OSNs) are increasingly often used to let users base trust decisions on the existence and the strength of social relationships. While most OSNs allow users to discover the length of the social path to other users, they do so in a centralized way, thus requiring them to rely on the service provider and reveal their interest in each other. This paper presents Social PaL, a system supporting the privacy-preserving discovery of arbitrary-length social paths between any two social network users. We overcome the bootstrapping problem encountered in all related prior work, demonstrating that Social PaL allows its users to find all paths of length two and to discover a significant fraction of longer paths, even when only a small fraction of OSN users is in the Social PaL system - e.g., discovering 70% of all paths with only 40% of the users. We implement Social PaL using a scalable server-side architecture and a modular Android client library, allowing developers to seamlessly integrate it into their apps.

preprint2015arXiv

The Chills and Thrills of Whole Genome Sequencing

In recent years, Whole Genome Sequencing (WGS) evolved from a futuristic-sounding research project to an increasingly affordable technology for determining complete genome sequences of complex organisms, including humans. This prompts a wide range of revolutionary applications, as WGS promises to improve modern healthcare and provide a better understanding of the human genome -- in particular, its relation to diseases and response to treatments. However, this progress raises worrisome privacy and ethical issues, since, besides uniquely identifying its owner, the genome contains a treasure trove of highly personal and sensitive information. In this article, after summarizing recent advances in genomics, we discuss some important privacy issues associated with human genomic information and identify a number of particularly relevant research challenges.

preprint2015arXiv

Whole Genome Sequencing: Innovation Dream or Privacy Nightmare?

Over the past several years, DNA sequencing has emerged as one of the driving forces in life-sciences, paving the way for affordable and accurate whole genome sequencing. As genomes represent the entirety of an organism's hereditary information, the availability of complete human genomes prompts a wide range of revolutionary applications. The hope for improving modern healthcare and better understanding the human genome propels many interesting and challenging research frontiers. Unfortunately, however, the proliferation of human genomes amplifies worrisome privacy concerns, since a genome represents a treasure trove of highly personal and sensitive information. In this article, we provide an overview of positive results and biomedical advances in the field, and discuss privacy issues associated with human genomic information. Finally, we survey available privacy-enhancing technologies and list a number of open research challenges.

preprint2014arXiv

A Comparative Usability Study of Two-Factor Authentication

Two-factor authentication (2F) aims to enhance resilience of password-based authentication by requiring users to provide an additional authentication factor, e.g., a code generated by a security token. However, it also introduces non-negligible costs for service providers and requires users to carry out additional actions during the authentication process. In this paper, we present an exploratory comparative study of the usability of 2F technologies. First, we conduct a pre-study interview to identify popular technologies as well as contexts and motivations in which they are used. We then present the results of a quantitative study based on a survey completed by 219 Mechanical Turk users, aiming to measure the usability of three popular 2F solutions: codes generated by security tokens, one-time PINs received via email or SMS, and dedicated smartphone apps (e.g., Google Authenticator). We record contexts and motivations, and study their impact on perceived usability. We find that 2F technologies are overall perceived as usable, regardless of motivation and/or context of use. We also present an exploratory factor analysis, highlighting that three metrics -- ease-of-use, required cognitive efforts, and trustworthiness -- are enough to capture key factors affecting 2F usability.

preprint2014arXiv

An Exploratory Ethnographic Study of Issues and Concerns with Whole Genome Sequencing

Progress in Whole Genome Sequencing (WGS) will soon allow a large number of individuals to have their genome fully sequenced. This lays the foundations to improve modern healthcare, enabling a new era of personalized medicine where diagnosis and treatment is tailored to the patient's genetic makeup. It also allows individuals motivated by personal curiosity to have access to their genetic information, and use it, e.g., to trace their ancestry. However, the very same progress also amplifies a number of ethical and privacy concerns, that stem from the unprecedented sensitivity of genomic information and that are not well studied. This paper presents an exploratory ethnographic study of users' perception of privacy and ethical issues with WGS, as well as their attitude toward different WGS programs. We report on a series of semi-structured interviews, involving 16 participants, and analyze the results both quantitatively and qualitatively. Our analysis shows that users exhibit common trust concerns and fear of discrimination, and demand to retain strict control over their genetic information. Finally, we highlight the need for further research in the area and follow-up studies that build on our initial findings.

preprint2014arXiv

Censorship in the Wild: Analyzing Internet Filtering in Syria

Internet censorship is enforced by numerous governments worldwide, however, due to the lack of publicly available information, as well as the inherent risks of performing active measurements, it is often hard for the research community to investigate censorship practices in the wild. Thus, the leak of 600GB worth of logs from 7 Blue Coat SG-9000 proxies, deployed in Syria to filter Internet traffic at a country scale, represents a unique opportunity to provide a detailed snapshot of a real-world censorship ecosystem. This paper presents the methodology and the results of a measurement analysis of the leaked Blue Coat logs, revealing a relatively stealthy, yet quite targeted, censorship. We find that traffic is filtered in several ways: using IP addresses and domain names to block subnets or websites, and keywords or categories to target specific content. We show that keyword-based censorship produces some collateral damage as many requests are blocked even if they do not relate to sensitive content. We also discover that Instant Messaging is heavily censored, while filtering of social media is limited to specific pages. Finally, we show that Syrian users try to evade censorship by using web/socks proxies, Tor, VPNs, and BitTorrent. To the best of our knowledge, our work provides the first analytical look into Internet filtering in Syria.

preprint2014arXiv

Paying for Likes? Understanding Facebook Like Fraud Using Honeypots

Facebook pages offer an easy way to reach out to a very large audience as they can easily be promoted using Facebook's advertising platform. Recently, the number of likes of a Facebook page has become a measure of its popularity and profitability, and an underground market of services boosting page likes, aka like farms, has emerged. Some reports have suggested that like farms use a network of profiles that also like other pages to elude fraud protection algorithms, however, to the best of our knowledge, there has been no systematic analysis of Facebook pages' promotion methods. This paper presents a comparative measurement study of page likes garnered via Facebook ads and by a few like farms. We deploy a set of honeypot pages, promote them using both methods, and analyze garnered likes based on likers' demographic, temporal, and social characteristics. We highlight a few interesting findings, including that some farms seem to be operated by bots and do not really try to hide the nature of their operations, while others follow a stealthier approach, mimicking regular users' behavior.

preprint2014arXiv

What's the Gist? Privacy-Preserving Aggregation of User Profiles

Over the past few years, online service providers have started gathering increasing amounts of personal information to build user profiles and monetize them with advertisers and data brokers. Users have little control of what information is processed and are often left with an all-or-nothing decision between receiving free services or refusing to be profiled. This paper explores an alternative approach where users only disclose an aggregate model -- the "gist" -- of their data. We aim to preserve data utility and simultaneously provide user privacy. We show that this approach can be efficiently supported by letting users contribute encrypted and differentially-private data to an aggregator. The aggregator combines encrypted contributions and can only extract an aggregate model of the underlying data. We evaluate our framework on a dataset of 100,000 U.S. users obtained from the U.S. Census Bureau and show that (i) it provides accurate aggregates with as little as 100 users, (ii) it generates revenue for both users and data brokers, and (iii) its overhead is appreciably low.

preprint2013arXiv

Bootstrapping Trust in Online Dating: Social Verification of Online Dating Profiles

Online dating is an increasingly thriving business which boasts billion-dollar revenues and attracts users in the tens of millions. Notwithstanding its popularity, online dating is not impervious to worrisome trust and privacy concerns raised by the disclosure of potentially sensitive data as well as the exposure to self-reported (and thus potentially misrepresented) information. Nonetheless, little research has, thus far, focused on how to enhance privacy and trustworthiness. In this paper, we report on a series of semi-structured interviews involving 20 participants, and show that users are significantly concerned with the veracity of online dating profiles. To address some of these concerns, we present the user-centered design of an interface, called Certifeye, which aims to bootstrap trust in online dating profiles using existing social network data. Certifeye verifies that the information users report on their online dating profile (e.g., age, relationship status, and/or photos) matches that displayed on their own Facebook profile. Finally, we present the results of a 161-user Mechanical Turk study assessing whether our veracity-enhancing interface successfully reduced concerns in online dating users and find a statistically significant trust increase.

preprint2013arXiv

EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity

Electronic information is increasingly often shared among entities without complete mutual trust. To address related security and privacy issues, a few cryptographic techniques have emerged that support privacy-preserving information sharing and retrieval. One interesting open problem in this context involves two parties that need to assess the similarity of their datasets, but are reluctant to disclose their actual content. This paper presents an efficient and provably-secure construction supporting the privacy-preserving evaluation of sample set similarity, where similarity is measured as the Jaccard index. We present two protocols: the first securely computes the (Jaccard) similarity of two sets, and the second approximates it, using MinHash techniques, with lower complexities. We show that our novel protocols are attractive in many compelling applications, including document/multimedia similarity, biometric authentication, and genetic tests. In the process, we demonstrate that our constructions are appreciably more efficient than prior work.

preprint2013arXiv

Extended Capabilities for a Privacy-Enhanced Participatory Sensing Infrastructure (PEPSI)

Participatory sensing is emerging as an innovative computing paradigm that targets the ubiquity of always-connected mobile phones and their sensing capabilities. In this context, a multitude of pioneering applications increasingly carry out pervasive collection and dissemination of information and environmental data, such as, traffic conditions, pollution, temperature, etc. Participants collect and report measurements from their mobile devices and entrust them to the cloud to be made available to applications and users. Naturally, due to the personal information associated to the reports (e.g., location, movements, etc.), a number of privacy concerns need to be taken into account prior to a large-scale deployment of these applications. Motivated by the need for privacy protection in Participatory Sensing, this work presents PEPSI: a Privacy-Enhanced Participatory Sensing Infrastructure. We explore realistic architectural assumptions and a minimal set of formal requirements aiming at protecting privacy of both data producers and consumers. We propose two instantiations that attain privacy guarantees with provable security at very low additional computational cost and almost no extra communication overhead.

preprint2013arXiv

Participatory Privacy: Enabling Privacy in Participatory Sensing

Participatory Sensing is an emerging computing paradigm that enables the distributed collection of data by self-selected participants. It allows the increasing number of mobile phone users to share local knowledge acquired by their sensor-equipped devices, e.g., to monitor temperature, pollution level or consumer pricing information. While research initiatives and prototypes proliferate, their real-world impact is often bounded to comprehensive user participation. If users have no incentive, or feel that their privacy might be endangered, it is likely that they will not participate. In this article, we focus on privacy protection in Participatory Sensing and introduce a suitable privacy-enhanced infrastructure. First, we provide a set of definitions of privacy requirements for both data producers (i.e., users providing sensed information) and consumers (i.e., applications accessing the data). Then, we propose an efficient solution designed for mobile phone users, which incurs very low overhead. Finally, we discuss a number of open problems and possible research directions.

preprint2013arXiv

Privacy in Content-Oriented Networking: Threats and Countermeasures

As the Internet struggles to cope with scalability, mobility, and security issues, new network architectures are being proposed to better accommodate the needs of modern systems and applications. In particular, Content-Oriented Networking (CON) has emerged as a promising next-generation Internet architecture: it sets to decouple content from hosts, at the network layer, by naming data rather than hosts. CON comes with a potential for a wide range of benefits, including reduced congestion and improved delivery speed by means of content caching, simpler configuration of network devices, and security at the data level. However, it remains an interesting open question whether or not, and to what extent, this emerging networking paradigm bears new privacy challenges. In this paper, we provide a systematic privacy analysis of CON and the common building blocks among its various architectural instances in order to highlight emerging privacy threats, and analyze a few potential countermeasures. Finally, we present a comparison between CON and today's Internet in the context of a few privacy concepts, such as, anonymity, censoring, traceability, and confidentiality.

preprint2012arXiv

Harvesting SSL Certificate Data to Identify Web-Fraud

Web-fraud is one of the most unpleasant features of today's Internet. Two well-known examples of fraudulent activities on the web are phishing and typosquatting. Their effects range from relatively benign (such as unwanted ads) to downright sinister (especially, when typosquatting is combined with phishing). This paper presents a novel technique to detect web-fraud domains that utilize HTTPS. To this end, we conduct the first comprehensive study of SSL certificates. We analyze certificates of legitimate and popular domains and those used by fraudulent ones. Drawing from extensive measurements, we build a classifier that detects such malicious domains with high accuracy.

preprint2011arXiv

Countering Gattaca: Efficient and Secure Testing of Fully-Sequenced Human Genomes (Full Version)

Recent advances in DNA sequencing technologies have put ubiquitous availability of fully sequenced human genomes within reach. It is no longer hard to imagine the day when everyone will have the means to obtain and store one's own DNA sequence. Widespread and affordable availability of fully sequenced genomes immediately opens up important opportunities in a number of health-related fields. In particular, common genomic applications and tests performed in vitro today will soon be conducted computationally, using digitized genomes. New applications will be developed as genome-enabled medicine becomes increasingly preventive and personalized. However, this progress also prompts significant privacy challenges associated with potential loss, theft, or misuse of genomic data. In this paper, we begin to address genomic privacy by focusing on three important applications: Paternity Tests, Personalized Medicine, and Genetic Compatibility Tests. After carefully analyzing these applications and their privacy requirements, we propose a set of efficient techniques based on private set operations. This allows us to implement in in silico some operations that are currently performed via in vitro methods, in a secure fashion. Experimental results demonstrate that proposed techniques are both feasible and practical today.

preprint2011arXiv

EphPub: Toward Robust Ephemeral Publishing

The increasing amount of personal and sensitive information disseminated over the Internet prompts commensurately growing privacy concerns. Digital data often lingers indefinitely and users lose its control. This motivates the desire to restrict content availability to an expiration time set by the data owner. This paper presents and formalizes the notion of Ephemeral Publishing (EphPub), to prevent the access to expired content. We propose an efficient and robust protocol that builds on the Domain Name System (DNS) and its caching mechanism. With EphPub, sensitive content is published encrypted and the key material is distributed, in a steganographic manner, to randomly selected and independent resolvers. The availability of content is then limited by the evanescence of DNS cache entries. The EphPub protocol is transparent to existing applications, and does not rely on trusted hardware, centralized servers, or user proactive actions. We analyze its robustness and show that it incurs a negligible overhead on the DNS infrastructure. We also perform a large-scale study of the caching behavior of 900K open DNS resolvers. Finally, we propose Firefox and Thunderbird extensions that provide ephemeral publishing capabilities, as well as a command-line tool to create ephemeral files.

preprint2010arXiv

Private Information Disclosure from Web Searches. (The case of Google Web History)

As the amount of personal information stored at remote service providers increases, so does the danger of data theft. When connections to remote services are made in the clear and authenticated sessions are kept using HTTP cookies, data theft becomes extremely easy to achieve. In this paper, we study the architecture of the world's largest service provider, i.e., Google. First, with the exception of a few services that can only be accessed over HTTPS (e.g., Gmail), we find that many Google services are still vulnerable to simple session hijacking. Next, we present the Historiographer, a novel attack that reconstructs the web search history of Google users, i.e., Google's Web History, even though such a service is supposedly protected from session hijacking by a stricter access control policy. The Historiographer uses a reconstruction technique inferring search history from the personalized suggestions fed by the Google search engine. We validate our technique through experiments conducted over real network traffic and discuss possible countermeasures. Our attacks are general and not only specific to Google, and highlight privacy concerns of mixed architectures using both secure and insecure connections.

Emiliano De Cristofaro

What is connected

Connect this record

See the researcher in context

Building this map preview

45 published item(s)

Adherence to Misinformation on Social Media Through Socio-Cognitive and Group-Based Processes

Cerberus: Exploring Federated Prediction of Security Events

Feels Bad Man: Dissecting Automated Hateful Meme Detection Through the Lens of Facebook's Challenge

Local and Central Differential Privacy for Robustness and Privacy in Federated Learning

On Utility and Privacy in Synthetic Genomic Data

Robin Hood and Matthew Effects: Differential Privacy Has Disparate Impact on Synthetic Data

The Gospel According to Q: Understanding the QAnon Conspiracy from the Perspective of Canonical Information

Toxicity in the Decentralized Web and the Potential for Model Sharing

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

"Is it a Qoincidence?": An Exploratory Study of QAnon on Voat

A Multi-Platform Analysis of Political News Discussion and Sharing on Web Communities

Dissecting the Meme Magic: Understanding Indicators of Virality in Image Memes

An Overview of Privacy in Machine Learning

Measuring Membership Privacy on Aggregate Location Time-Series

On the Feasibility of Acoustic Attacks Using Commodity Smart Devices

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

Characterizing the Use of Images in State-Sponsored Information Warfare Operations by Russian Trolls on Twitter

Ad-Blocking and Counter Blocking: A Slice of the Arms Race

Combating Fraud in Online Social Networks: Detecting Stealthy Facebook Like Farms

Efficient Private Statistics with Succinct Sketches

Experimental Analysis of Popular Smartphone Apps Offering Anonymity, Ephemerality, and End-to-End Encryption

Privacy-Friendly Mobility Analytics using Aggregate Location Data

Privacy-Preserving Genetic Relatedness Test

Private Processing of Outsourced Network Functions: Feasibility and Constructions

SplitBox: Toward Efficient Private Network Function Virtualization

"`They brought in the horrible key ring thing!" Analysing the Usability of Two-Factor Authentication in UK Online Banking

Controlled Data Sharing for Collaborative Predictive Blacklisting

Danger is My Middle Name: Experimenting with SSL Vulnerabilities in Android Apps

How Far Removed Are You? Scalable Privacy-Preserving Estimation of Social Path Length with Social PaL

The Chills and Thrills of Whole Genome Sequencing

Whole Genome Sequencing: Innovation Dream or Privacy Nightmare?

A Comparative Usability Study of Two-Factor Authentication

An Exploratory Ethnographic Study of Issues and Concerns with Whole Genome Sequencing

Censorship in the Wild: Analyzing Internet Filtering in Syria

Paying for Likes? Understanding Facebook Like Fraud Using Honeypots

What's the Gist? Privacy-Preserving Aggregation of User Profiles

Bootstrapping Trust in Online Dating: Social Verification of Online Dating Profiles

EsPRESSo: Efficient Privacy-Preserving Evaluation of Sample Set Similarity

Extended Capabilities for a Privacy-Enhanced Participatory Sensing Infrastructure (PEPSI)

Participatory Privacy: Enabling Privacy in Participatory Sensing

Privacy in Content-Oriented Networking: Threats and Countermeasures

Harvesting SSL Certificate Data to Identify Web-Fraud

Countering Gattaca: Efficient and Secure Testing of Fully-Sequenced Human Genomes (Full Version)

EphPub: Toward Robust Ephemeral Publishing

Private Information Disclosure from Web Searches. (The case of Google Web History)