Researcher profile

Jeremy Blackburn

Jeremy Blackburn contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2022arXiv

Feels Bad Man: Dissecting Automated Hateful Meme Detection Through the Lens of Facebook's Challenge

Internet memes have become a dominant method of communication; at the same time, however, they are also increasingly being used to advocate extremism and foster derogatory beliefs. Nonetheless, we do not have a firm understanding as to which perceptual aspects of memes cause this phenomenon. In this work, we assess the efficacy of current state-of-the-art multimodal machine learning models toward hateful meme detection, and in particular with respect to their generalizability across platforms. We use two benchmark datasets comprising 12,140 and 10,567 images from 4chan's "Politically Incorrect" board (/pol/) and Facebook's Hateful Memes Challenge dataset to train the competition's top-ranking machine learning models for the discovery of the most prominent features that distinguish viral hateful memes from benign ones. We conduct three experiments to determine the importance of multimodality on classification performance, the influential capacity of fringe Web communities on mainstream social platforms and vice versa, and the models' learning transferability on 4chan memes. Our experiments show that memes' image characteristics provide a greater wealth of information than its textual content. We also find that current systems developed for online detection of hate speech in memes necessitate further concentration on its visual elements to improve their interpretation of underlying cultural connotations, implying that multimodal models fail to adequately grasp the intricacies of hate speech in memes and generalize across social media platforms.

preprint2022arXiv

On Xing Tian and the Perseverance of Anti-China Sentiment Online

Sinophobia, anti-Chinese sentiment, has existed on the Web for a long time. The outbreak of COVID-19 and the extended quarantine has further amplified it. However, we lack a quantitative understanding of the cause of Sinophobia as well as how it evolves over time. In this paper, we conduct a large-scale longitudinal measurement of Sinophobia, between 2016 and 2021, on two mainstream and fringe Web communities. By analyzing 8B posts from Reddit and 206M posts from 4chan's /pol/, we investigate the origins, evolution, and content of Sinophobia. We find that, anti-Chinese content may be evoked by political events not directly related to China, e.g., the U.S. withdrawal from the Paris Agreement. And during the COVID-19 pandemic, daily usage of Sinophobic slurs has significantly increased even with the hate-speech ban policy. We also show that the semantic meaning of the words "China" and "Chinese" are shifting towards Sinophobic slurs with the rise of COVID-19 and remain the same in the pandemic period. We further use topic modeling to show the topics of Sinophobic discussion are pretty diverse and broad. We find that both Web communities share some common Sinophobic topics like ethnics, economics and commerce, weapons and military, foreign relations, etc. However, compared to 4chan's /pol/, more daily life-related topics including food, game, and stock are found in Reddit. Our finding also reveals that the topics related to COVID-19 and blaming the Chinese government are more prevalent in the pandemic period. To the best of our knowledge, this paper is the longest quantitative measurement of Sinophobia.

preprint2022arXiv

The Gospel According to Q: Understanding the QAnon Conspiracy from the Perspective of Canonical Information

The QAnon conspiracy theory claims that a cabal of (literally) blood-thirsty politicians and media personalities are engaged in a war to destroy society. By interpreting cryptic "drops" of information from an anonymous insider calling themself Q, adherents of the conspiracy theory believe that Donald Trump is leading them in an active fight against this cabal. QAnon has been covered extensively by the media, as its adherents have been involved in multiple violent acts, including the January 6th, 2021 seditious storming of the US Capitol building. Nevertheless, we still have relatively little understanding of how the theory evolved and spread on the Web, and the role played in that by multiple platforms. To address this gap, we study QAnon from the perspective of "Q" themself. We build a dataset of 4,949 canonical Q drops collected from six "aggregation sites," which curate and archive them from their original posting to anonymous and ephemeral image boards. We expose that these sites have a relatively low (overall) agreement, and thus at least some Q drops should probably be considered apocryphal. We then analyze the Q drops' contents to identify topics of discussion and find statistically significant indications that drops were not authored by a single individual. Finally, we look at how posts on Reddit are used to disseminate Q drops to wider audiences. We find that dissemination was (initially) limited to a few sub-communities and that, while heavy-handed moderation decisions have reduced the overall issue, the "gospel" of Q persists on the Web.

preprint2022arXiv

Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots

Chatbots are used in many applications, e.g., automated agents, smart home assistants, interactive characters in online games, etc. Therefore, it is crucial to ensure they do not behave in undesired manners, providing offensive or toxic responses to users. This is not a trivial task as state-of-the-art chatbot models are trained on large, public datasets openly collected from the Internet. This paper presents a first-of-its-kind, large-scale measurement of toxicity in chatbots. We show that publicly available chatbots are prone to providing toxic responses when fed toxic queries. Even more worryingly, some non-toxic queries can trigger toxic responses too. We then set out to design and experiment with an attack, ToxicBuddy, which relies on fine-tuning GPT-2 to generate non-toxic queries that make chatbots respond in a toxic manner. Our extensive experimental evaluation demonstrates that our attack is effective against public chatbot models and outperforms manually-crafted malicious queries proposed by previous work. We also evaluate three defense mechanisms against ToxicBuddy, showing that they either reduce the attack performance at the cost of affecting the chatbot's utility or are only effective at mitigating a portion of the attack. This highlights the need for more research from the computer security and online safety communities to ensure that chatbot models do not hurt their users. Overall, we are confident that ToxicBuddy can be used as an auditing tool and that our work will pave the way toward designing more effective defenses for chatbot safety.

preprint2021arXiv

"Go eat a bat, Chang!": On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19

The outbreak of the COVID-19 pandemic has changed our lives in unprecedented ways. In the face of the projected catastrophic consequences, many countries have enacted social distancing measures in an attempt to limit the spread of the virus. Under these conditions, the Web has become an indispensable medium for information acquisition, communication, and entertainment. At the same time, unfortunately, the Web is being exploited for the dissemination of potentially harmful and disturbing content, such as the spread of conspiracy theories and hateful speech towards specific ethnic groups, in particular towards Chinese people since COVID-19 is believed to have originated from China. In this paper, we make a first attempt to study the emergence of Sinophobic behavior on the Web during the outbreak of the COVID-19 pandemic. We collect two large-scale datasets from Twitter and 4chan's Politically Incorrect board (/pol/) over a time period of approximately five months and analyze them to investigate whether there is a rise or important differences with regard to the dissemination of Sinophobic content. We find that COVID-19 indeed drives the rise of Sinophobia on the Web and that the dissemination of Sinophobic content is a cross-platform phenomenon: it exists on fringe Web communities like \dspol, and to a lesser extent on mainstream ones like Twitter. Also, using word embeddings over time, we characterize the evolution and emergence of new Sinophobic slurs on both Twitter and /pol/. Finally, we find interesting differences in the context in which words related to Chinese people are used on the Web before and after the COVID-19 outbreak: on Twitter we observe a shift towards blaming China for the situation, while on /pol/ we find a shift towards using more (and new) Sinophobic slurs.

preprint2021arXiv

"Is it a Qoincidence?": An Exploratory Study of QAnon on Voat

Online fringe communities offer fertile grounds for users seeking and sharing ideas fueling suspicion of mainstream news and conspiracy theories. Among these, the QAnon conspiracy theory emerged in 2017 on 4chan, broadly supporting the idea that powerful politicians, aristocrats, and celebrities are closely engaged in a global pedophile ring. Simultaneously, governments are thought to be controlled by "puppet masters," as democratically elected officials serve as a fake showroom of democracy. This paper provides an empirical exploratory analysis of the QAnon community on Voat.co, a Reddit-esque news aggregator, which has captured the interest of the press for its toxicity and for providing a platform to QAnon followers. More precisely, we analyze a large dataset from /v/GreatAwakening, the most popular QAnon-related subverse (the Voat equivalent of a subreddit), to characterize activity and user engagement. To further understand the discourse around QAnon, we study the most popular named entities mentioned in the posts, along with the most prominent topics of discussion, which focus on US politics, Donald Trump, and world events. We also use word embeddings to identify narratives around QAnon-specific keywords. Our graph visualization shows that some of the QAnon-related ones are closely related to those from the Pizzagate conspiracy theory and so-called drops by "Q." Finally, we analyze content toxicity, finding that discussions on /v/GreatAwakening are less toxic than in the broad Voat community.

preprint2021arXiv

A Multi-Platform Analysis of Political News Discussion and Sharing on Web Communities

The news ecosystem has become increasingly complex, encompassing a wide range of sources with varying levels of trustworthiness, and with public commentary giving different spins to the same stories. In this paper, we present a multi-platform measurement of this ecosystem. We compile a list of 1,073 news websites and extract posts from four Web communities (Twitter, Reddit, 4chan, and Gab) that contain URLs from these sources. This yields a dataset of 38M posts containing 15M news URLs, spanning almost three years. We study the data along several axes, assessing the trustworthiness of shared news, designing a method to group news articles into stories, analyzing these stories are discussed and measuring the influence various Web communities have in that. Our analysis shows that different communities discuss different types of news, with polarized communities like Gab and /r/The_Donald subreddit disproportionately referencing untrustworthy sources. We also find that fringe communities often have a disproportionate influence on other platforms w.r.t. pushing narratives around certain news, for example about political elections, immigration, or foreign policy.

preprint2021arXiv

Dissecting the Meme Magic: Understanding Indicators of Virality in Image Memes

Despite the increasingly important role played by image memes, we do not yet have a solid understanding of the elements that might make a meme go viral on social media. In this paper, we investigate what visual elements distinguish image memes that are highly viral on social media from those that do not get re-shared, across three dimensions: composition, subjects, and target audience. Drawing from research in art theory, psychology, marketing, and neuroscience, we develop a codebook to characterize image memes, and use it to annotate a set of 100 image memes collected from 4chan's Politically Incorrect Board (/pol/). On the one hand, we find that highly viral memes are more likely to use a close-up scale, contain characters, and include positive or negative emotions. On the other hand, image memes that do not present a clear subject the viewer can focus attention on, or that include long text are not likely to be re-shared by users. We train machine learning models to distinguish between image memes that are likely to go viral and those that are unlikely to be re-shared, obtaining an AUC of 0.866 on our dataset. We also show that the indicators of virality identified by our model can help characterize the most viral memes posted on mainstream online social networks too, as our classifiers are able to predict 19 out of the 20 most popular image memes posted on Twitter and Reddit between 2016 and 2018. Overall, our analysis sheds light on what indicators characterize viral and non-viral visual content online, and set the basis for developing better techniques to create or moderate content that is more likely to catch the viewer's attention.

preprint2020arXiv

A First Look at Zoombombing

Online meeting tools like Zoom and Google Meet have become central to our professional, educational, and personal lives. This has opened up new opportunities for large scale harassment. In particular, a phenomenon known as zoombombing has emerged, in which aggressors join online meetings with the goal of disrupting them and harassing their participants. In this paper, we conduct the first data-driven analysis of calls for zoombombing attacks on social media. We identify ten popular online meeting tools and extract posts containing meeting invitations to these platforms on a mainstream social network, Twitter, and on a fringe community known for organizing coordinated attacks against online users, 4chan. We then perform manual annotation to identify posts that are calling for zoombombing attacks, and apply thematic analysis to develop a codebook to better characterize the discussion surrounding calls for zoombombing. During the first seven months of 2020, we identify over 200 calls for zoombombing between Twitter and 4chan, and analyze these calls both quantitatively and qualitatively. Our findings indicate that the vast majority of calls for zoombombing are not made by attackers stumbling upon meeting invitations or bruteforcing their meeting ID, but rather by insiders who have legitimate access to these meetings, particularly students in high school and college classes. This has important security implications, because it makes common protections against zoombombing, such as password protection, ineffective. We also find instances of insiders instructing attackers to adopt the names of legitimate participants in the class to avoid detection, making countermeasures like setting up a waiting room and vetting participants less effective. Based on these observations, we argue that the only effective defense against zoombombing is creating unique join links for each participant.

preprint2020arXiv

Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board

This paper presents a dataset with over 3.3M threads and 134.5M posts from the Politically Incorrect board (/pol/) of the imageboard forum 4chan, posted over a period of almost 3.5 years (June 2016-November 2019). To the best of our knowledge, this represents the largest publicly available 4chan dataset, providing the community with an archive of posts that have been permanently deleted from 4chan and are otherwise inaccessible. We augment the data with a set of additional labels, including toxicity scores and the named entities mentioned in each post. We also present a statistical analysis of the dataset, providing an overview of what researchers interested in using it can expect, as well as a simple content analysis, shedding light on the most prominent discussion topics, the most popular entities mentioned, and the toxicity level of each post. Overall, we are confident that our work will motivate and assist researchers in studying and understanding 4chan, as well as its role on the greater Web. For instance, we hope this dataset may be used for cross-platform studies of social media, as well as being useful for other types of research like natural language processing. Finally, our dataset can assist qualitative work focusing on in-depth case studies of specific narratives, events, or social theories.

preprint2020arXiv

The Pushshift Reddit Dataset

Social media data has become crucial to the advancement of scientific understanding. However, even though it has become ubiquitous, just collecting large-scale social media data involves a high degree of engineering skill set and computational resources. In fact, research is often times gated by data engineering problems that must be overcome before analysis can proceed. This has resulted recognition of datasets as meaningful research contributions in and of themselves. Reddit, the so called "front page of the Internet," in particular has been the subject of numerous scientific studies. Although Reddit is relatively open to data acquisition compared to social media platforms like Facebook and Twitter, the technical barriers to acquisition still remain. Thus, Reddit's millions of subreddits, hundreds of millions of users, and hundreds of billions of comments are at the same time relatively accessible, but time consuming to collect and analyze systematically. In this paper, we present the Pushshift Reddit dataset. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception. In addition to monthly dumps, Pushshift provides computational tools to aid in searching, aggregating, and performing exploratory analysis on the entirety of the dataset. The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.

preprint2020arXiv

The Pushshift Telegram Dataset

Messaging platforms, especially those with a mobile focus, have become increasingly ubiquitous in society. These mobile messaging platforms can have deceivingly large user bases, and in addition to being a way for people to stay in touch, are often used to organize social movements, as well as a place for extremists and other ne'er-do-well to congregate. In this paper, we present a dataset from one such mobile messaging platform: Telegram. Our dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users. To the best of our knowledge, our dataset comprises the largest and most complete of its kind. In addition to the raw data, we also provide the source code used to collect it, allowing researchers to run their own data collection instance. We believe the Pushshift Telegram dataset can help researchers from a variety of disciplines interested in studying online social movements, protests, political extremism, and disinformation.

preprint2019arXiv

A Quantitative Approach to Understanding Online Antisemitism

A new wave of growing antisemitism, driven by fringe Web communities, is an increasingly worrying presence in the socio-political realm. The ubiquitous and global nature of the Web has provided tools used by these groups to spread their ideology to the rest of the Internet. Although the study of antisemitism and hate is not new, the scale and rate of change of online data has impacted the efficacy of traditional approaches to measure and understand these troubling trends. In this paper, we present a large-scale, quantitative study of online antisemitism. We collect hundreds of million posts and images from alt-right Web communities like 4chan's Politically Incorrect board (/pol/) and Gab. Using scientifically grounded methods, we quantify the escalation and spread of antisemitic memes and rhetoric across the Web. We find the frequency of antisemitic content greatly increases (in some cases more than doubling) after major political events such as the 2016 US Presidential Election and the "Unite the Right" rally in Charlottesville. We extract semantic embeddings from our corpus of posts and demonstrate how automated techniques can discover and categorize the use of antisemitic terminology. We additionally examine the prevalence and spread of the antisemitic "Happy Merchant" meme, and in particular how these fringe communities influence its propagation to more mainstream communities like Twitter and Reddit. Taken together, our results provide a data-driven, quantitative framework for understanding online antisemitism. Our methods serve as a framework to augment current qualitative efforts by anti-hate groups, providing new insights into the growth and spread of hate online.

preprint2019arXiv

Characterizing the Use of Images in State-Sponsored Information Warfare Operations by Russian Trolls on Twitter

State-sponsored organizations are increasingly linked to efforts aimed to exploit social media for information warfare and manipulating public opinion. Typically, their activities rely on a number of social network accounts they control, aka trolls, that post and interact with other users disguised as "regular" users. These accounts often use images and memes, along with textual content, in order to increase the engagement and the credibility of their posts. In this paper, we present the first study of images shared by state-sponsored accounts by analyzing a ground truth dataset of 1.8M images posted to Twitter by accounts controlled by the Russian Internet Research Agency. First, we analyze the content of the images as well as their posting activity. Then, using Hawkes Processes, we quantify their influence on popular Web communities like Twitter, Reddit, 4chan's Politically Incorrect board (/pol/), and Gab, with respect to the dissemination of images. We find that the extensive image posting activity of Russian trolls coincides with real-world events (e.g., the Unite the Right rally in Charlottesville), and shed light on their targets as well as the content disseminated via images. Finally, we show that the trolls were more effective in disseminating politics-related imagery than other images.