Researcher profile

Gareth Tyson

Gareth Tyson contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

Understanding the Consequences of VTuber Reincarnation

The rapid proliferation of VTubers, digital avatars controlled and voiced by human actors (Nakanohito), has created a lucrative and popular entertainment ecosystem. However, the prevailing industry model, where corporations retain ownership of the VTuber persona while the Nakanohito bears the immense pressure of dual-identity management, exposes the Nakanohito to significant vulnerabilities, including burnout, harassment, and precarious labor conditions. When these pressures become untenable, the Nakanohito may terminate their contracts and later debut with a new persona, a process known as "reincarnation". This phenomenon, a rising concern in the industry, inflicts substantial losses on the Nakanohito, agencies, and audiences alike. Understanding the quantitative fallout of reincarnation is crucial for mitigating this damage and fostering a more sustainable industry. To address this gap, we conduct the first large-scale empirical study of VTuber reincarnation, analyzing 12 significant cases using a comprehensive dataset of 728K livestream sessions and 4.5B viewer interaction records. Our results suggest reincarnation significantly damages a Nakanohito's career, leading to a decline in audience and financial support, an increase in harassment, and negative repercussions for the wider VTuber industry. Overall, these insights carry immediate implications for mitigating the significant professional and personal costs of the reincarnation, and fostering a healthier and more equitable VTuber ecosystem.

preprint2023arXiv

A Twitter Dataset for Pakistani Political Discourse

We share the largest dataset for the Pakistani Twittersphere consisting of over 49 million tweets, collected during one of the most politically active periods in the country. We collect the data after the deposition of the government by a No Confidence Vote in April 2022. This large-scale dataset can be used for several downstream tasks such as political bias, bots detection, trolling behavior, (dis)misinformation, and censorship related to Pakistani Twitter users. In addition, this dataset provides a large collection of tweets in Urdu and Roman Urdu that can be used for optimizing language processing tasks.

preprint2022arXiv

A Reddit Dataset for the Russo-Ukrainian Conflict in 2022

Reddit consists of sub-communities that cover a focused topic. This paper provides a list of relevant subreddits for the ongoing Russo-Ukrainian crisis. We perform an exhaustive subreddit exploration using keyword search and shortlist 12 subreddits as potential candidates that contain nominal discourse related to the crisis. These subreddits contain over 300,000 posts and 8 million comments collectively. We provide an additional categorization of content into two categories, "R-U Conflict", and "Military Related", based on their primary focus. We further perform content characterization of those subreddits. The results show a surge of posts and comments soon after Russia launched the invasion. "Military Related" posts are more likely to receive more replies than "R-U Conflict" posts. Our textual analysis shows an apparent preference for the Pro-Ukraine stance in "R-U Conflict", while "Military Related" retain a neutral stance.

preprint2022arXiv

A Study of Third-party Resources Loading on Web

This paper performs a large-scale study of dependency chains in the web, to find that around 50% of first-party websites render content that they did not directly load. Although the majority (84.91%) of websites have short dependency chains (below 3 levels), we find websites with dependency chains exceeding 30. Using VirusTotal, we show that 1.2% of these third-parties are classified as suspicious -- although seemingly small, this limited set of suspicious third-parties have remarkable reach into the wider ecosystem. We find that 73% of websites under-study load resources from suspicious third-parties, and 24.8% of first-party webpages contain at least three third-parties classified as suspicious in their dependency chain. By running sandboxed experiments, we observe a range of activities with the majority of suspicious JavaScript codes downloading malware.

preprint2022arXiv

Design and Evaluation of IPFS: A Storage Layer for the Decentralized Web

Recent years have witnessed growing consolidation of web operations. For example, the majority of web traffic now originates from a few organizations, and even micro-websites often choose to host on large pre-existing cloud infrastructures. In response to this, the "Decentralized Web" attempts to distribute ownership and operation of web services more evenly. This paper describes the design and implementation of the largest and most widely used Decentralized Web platform - the InterPlanetary File System (IPFS) - an open-source, content-addressable peer-to-peer network that provides distributed data storage and delivery. IPFS has millions of daily content retrievals and already underpins dozens of third-party applications. This paper evaluates the performance of IPFS by introducing a set of measurement methodologies that allow us to uncover the characteristics of peers in the IPFS network. We reveal presence in more than 2700 Autonomous Systems and 152 countries, the majority of which operate outside large central cloud providers like Amazon or Azure. We further evaluate IPFS performance, showing that both publication and retrieval delays are acceptable for a wide range of use cases. Finally, we share our datasets, experiences and lessons learned.

preprint2022arXiv

Jettisoning Junk Messaging in the Era of End-to-End Encryption: A Case Study of WhatsApp

WhatsApp is a popular messaging app used by over a billion users around the globe. Due to this popularity, understanding misbehavior on WhatsApp is an important issue. The sending of unwanted junk messages by unknown contacts via WhatsApp remains understudied by researchers, in part because of the end-to-end encryption offered by the platform. We address this gap by studying junk messaging on a multilingual dataset of 2.6M messages sent to 5K public WhatsApp groups in India. We characterise both junk content and senders. We find that nearly 1 in 10 messages is unwanted content sent by junk senders, and a number of unique strategies are employed to reflect challenges faced on WhatsApp, e.g., the need to change phone numbers regularly. We finally experiment with on-device classification to automate the detection of junk, whilst respecting end-to-end encryption.

preprint2022arXiv

Toxicity in the Decentralized Web and the Potential for Model Sharing

The "Decentralised Web" (DW) is an evolving concept, which encompasses technologies aimed at providing greater transparency and openness on the web. The DW relies on independent servers (aka instances) that mesh together in a peer-to-peer fashion to deliver a range of services (e.g. micro-blogs, image sharing, video streaming). However, toxic content moderation in this decentralised context is challenging. This is because there is no central entity that can define toxicity, nor a large central pool of data that can be used to build universal classifiers. It is therefore unsurprising that there have been several high-profile cases of the DW being misused to coordinate and disseminate harmful material. Using a dataset of 9.9M posts from 117K users on Pleroma (a popular DW microblogging service), we quantify the presence of toxic content. We find that toxic content is prevalent and spreads rapidly between instances. We show that automating per-instance content moderation is challenging due to the lack of sufficient training data available and the effort required in labelling. We therefore propose and evaluate ModPair, a model sharing system that effectively detects toxic content, gaining an average per-instance macro-F1 score 0.89.

preprint2022arXiv

Twitter Dataset for 2022 Russo-Ukrainian Crisis

Online Social Networks (OSNs) play a significant role in information sharing during a crisis. The data collected during such a crisis can reflect the large scale public opinions and sentiment. In addition, OSN data can also be used to study different campaigns that are employed by various entities to engineer public opinions. Such information sharing campaigns can range from spreading factual information to propaganda and misinformation. We provide a Twitter dataset of the 2022 Russo-Ukrainian conflict. In the first release, we share over 1.6 million tweets shared during the 1st week of the crisis.

preprint2021arXiv

An Empirical Assessment of Global COVID-19 Contact Tracing Applications

The rapid spread of COVID-19 has made manual contact tracing difficult. Thus, various public health authorities have experimented with automatic contact tracing using mobile applications (or "apps"). These apps, however, have raised security and privacy concerns. In this paper, we propose an automated security and privacy assessment tool, COVIDGUARDIAN, which combines identification and analysis of Personal Identification Information (PII), static program analysis and data flow analysis, to determine security and privacy weaknesses. Furthermore, in light of our findings, we undertake a user study to investigate concerns regarding contact tracing apps. We hope that COVIDGUARDIAN, and the issues raised through responsible disclosure to vendors, can contribute to the safe deployment of mobile contact tracing. As part of this, we offer concrete guidelines, and highlight gaps between user requirements and app performance.

preprint2020arXiv

A First Instagram Dataset on COVID-19

The novel coronavirus (COVID-19) pandemic outbreak is drastically shaping and reshaping many aspects of our life, with a huge impact on our social life. In this era of lockdown policies in most of the major cities around the world, we see a huge increase in people and professional engagement in social media. Social media is playing an important role in news propagation as well as keeping people in contact. At the same time, this source is both a blessing and a curse as the coronavirus infodemic has become a major concern, and is already a topic that needs special attention and further research. In this paper, we provide a multilingual coronavirus (COVID-19) Instagram dataset that we have been continuously collected since March 30, 2020. We are making our dataset available to the research community at Github. We believe that this contribution will help the community to better understand the dynamics behind this phenomenon in Instagram, as one of the major social media. This dataset could also help study the propagation of misinformation related to this outbreak.

preprint2020arXiv

Analyzing Temporal Relationships between Trending Terms on Twitter and Urban Dictionary Activity

As an online, crowd-sourced, open English-language slang dictionary, the Urban Dictionary platform contains a wealth of opinions, jokes, and definitions of terms, phrases, acronyms, and more. However, it is unclear exactly how activity on this platform relates to larger conversations happening elsewhere on the web, such as discussions on larger, more popular social media platforms. In this research, we study the temporal activity trends on Urban Dictionary and provide the first analysis of how this activity relates to content being discussed on a major social network: Twitter. By collecting the whole of Urban Dictionary, as well as a large sample of tweets over seven years, we explore the connections between the words and phrases that are defined and searched for on Urban Dictionary and the content that is talked about on Twitter. Through a series of cross-correlation calculations, we identify cases in which Urban Dictionary activity closely reflects the larger conversation happening on Twitter. Then, we analyze the types of terms that have a stronger connection to discussions on Twitter, finding that Urban Dictionary activity that is positively correlated with Twitter is centered around terms related to memes, popular public figures, and offline events. Finally, We explore the relationship between periods of time when terms are trending on Twitter and the corresponding activity on Urban Dictionary, revealing that new definitions are more likely to be added to Urban Dictionary for terms that are currently trending on Twitter.

preprint2020arXiv

Characterising User Content on a Multi-lingual Social Network

Social media has been on the vanguard of political information diffusion in the 21st century. Most studies that look into disinformation, political influence and fake-news focus on mainstream social media platforms. This has inevitably made English an important factor in our current understanding of political activity on social media. As a result, there has only been a limited number of studies into a large portion of the world, including the largest, multilingual and multi-cultural democracy: India. In this paper we present our characterisation of a multilingual social network in India called ShareChat. We collect an exhaustive dataset across 72 weeks before and during the Indian general elections of 2019, across 14 languages. We investigate the cross lingual dynamics by clustering visually similar images together, and exploring how they move across language barriers. We find that Telugu, Malayalam, Tamil and Kannada languages tend to be dominant in soliciting political images (often referred to as memes), and posts from Hindi have the largest cross-lingual diffusion across ShareChat (as well as images containing text in English). In the case of images containing text that cross language barriers, we see that language translation is used to widen the accessibility. That said, we find cases where the same image is associated with very different text (and therefore meanings). This initial characterisation paves the way for more advanced pipelines to understand the dynamics of fake and political content in a multi-lingual and non-textual setting.

preprint2020arXiv

Characterizing EOSIO Blockchain

EOSIO has become one of the most popular blockchain platforms since its mainnet launch in June 2018. In contrast to the traditional PoW-based systems (e.g., Bitcoin and Ethereum), which are limited by low throughput, EOSIO is the first high throughput Delegated Proof of Stake system that has been widely adopted by many applications. Although EOSIO has millions of accounts and billions of transactions, little is known about its ecosystem, especially related to security and fraud. In this paper, we perform a large-scale measurement study of the EOSIO blockchain and its associated DApps. We gather a large-scale dataset of EOSIO and characterize activities including money transfers, account creation and contract invocation. Using our insights, we then develop techniques to automatically detect bots and fraudulent activity. We discover thousands of bot accounts (over 30\% of the accounts in the platform) and a number of real-world attacks (301 attack accounts). By the time of our study, 80 attack accounts we identified have been confirmed by DApp teams, causing 828,824 EOS tokens losses (roughly 2.6 million US\$) in total.