Source author record

Gianluca Demartini

Gianluca Demartini appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Information Retrieval Artificial Intelligence Computation and Language Human-Computer Interaction Databases Social and Information Networks

Catalog footprint

What is connected

9works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis

Large Language Models (LLMs) have been used as relevance assessors for Information Retrieval (IR) evaluation collection creation due to reduced cost and increased scalability as compared to human assessors. While previous research has looked at the reliability of LLMs as compared to human assessors, in this work, we aim to understand if LLMs make systematic mistakes when judging relevance, rather than just understanding how good they are on average. To this aim, we propose a novel representational method for queries and documents that allows us to analyze relevance label distributions and compare LLM and human labels to identify patterns of disagreement and localize systematic areas of disagreement. We introduce a clustering-based framework that embeds query-document (Q-D) pairs into a joint semantic space, treating relevance as a relational property. Experiments on TREC Deep Learning 2019 and 2020 show that systematic disagreement between humans and LLMs is concentrated in specific semantic clusters rather than distributed randomly. Query-level analyses reveal recurring failures, most often in definition-seeking, policy-related, or ambiguous contexts. Queries with large variation in agreement across their clusters emerge as disagreement hotspots, where LLMs tend to under-recall relevant content or over-include irrelevant material. This framework links global diagnostics with localized clustering to uncover hidden weaknesses in LLM judgments, enabling bias-aware and more reliable IR evaluation.

preprint2026arXiv

The Impact of AI Generated Content on Decision Making for Topics Requiring Expertise

Modelling users' online decision-making and opinion change is a complex issue that needs to consider users' personal determinants, the nature of the topic and the information retrieval activities. Furthermore, generative-AIbased products like ChatGPT gradually become an essential element for the retrieval of online information. However, the interaction between domainspecific knowledge and AI-generated content during online decision-making is unclear. We conducted a lab-based explanatory sequential study with university students to overcome this research gap. In the experiment, we surveyed participants about a set of general domain topics that are easy to grasp and another set of domain-specific topics that require adequate levels of chemical science knowledge to fully comprehend. We provided participants with decision-supporting information that was either produced using generative AI or collected from selected expert human-written sources to explore the role of AI-generated content compared to ordinary information during decision-making. Our result revealed that participants are less likely to change opinions on domain-specific topics. Since participants without professional knowledge had difficulty performing in-depth and independent reasoning based on the information, they favoured relying on conclusions presented in the provided materials and tended to stick to their initial opinion. Besides, information that is labelled as AI-generated is equivalently helpful as information labelled as dedicatedly human-written for participants in this experiment, indicating the vast potential as well as concerns for AI replacing human experts to help users tackle professional topics or issues.

preprint2024arXiv

Identification of Regulatory Requirements Relevant to Business Processes: A Comparative Study on Generative AI, Embedding-based Ranking, Crowd and Expert-driven Methods

Organizations face the challenge of ensuring compliance with an increasing amount of requirements from various regulatory documents. Which requirements are relevant depends on aspects such as the geographic location of the organization, its domain, size, and business processes. Considering these contextual factors, as a first step, relevant documents (e.g., laws, regulations, directives, policies) are identified, followed by a more detailed analysis of which parts of the identified documents are relevant for which step of a given business process. Nowadays the identification of regulatory requirements relevant to business processes is mostly done manually by domain and legal experts, posing a tremendous effort on them, especially for a large number of regulatory documents which might frequently change. Hence, this work examines how legal and domain experts can be assisted in the assessment of relevant requirements. For this, we compare an embedding-based NLP ranking method, a generative AI method using GPT-4, and a crowdsourced method with the purely manual method of creating relevancy labels by experts. The proposed methods are evaluated based on two case studies: an Australian insurance case created with domain experts and a global banking use case, adapted from SAP Signavio's workflow example of an international guideline. A gold standard is created for both BPMN2.0 processes and matched to real-world textual requirements from multiple regulatory documents. The evaluation and discussion provide insights into strengths and weaknesses of each method regarding applicability, automation, transparency, and reproducibility and provide guidelines on which method combinations will maximize benefits for given characteristics such as process usage, impact, and dynamics of an application scenario.

preprint2022arXiv

Crowdsourced Fact-Checking at Twitter: How Does the Crowd Compare With Experts?

Fact-checking is one of the effective solutions in fighting online misinformation. However, traditional fact-checking is a process requiring scarce expert human resources, and thus does not scale well on social media because of the continuous flow of new content to be checked. Methods based on crowdsourcing have been proposed to tackle this challenge, as they can scale with a smaller cost, but, while they have shown to be feasible, have always been studied in controlled environments. In this work, we study the first large-scale effort of crowdsourced fact-checking deployed in practice, started by Twitter with the Birdwatch program. Our analysis shows that crowdsourcing may be an effective fact-checking strategy in some settings, even comparable to results obtained by human experts, but does not lead to consistent, actionable results in others. We processed 11.9k tweets verified by the Birdwatch program and report empirical evidence of i) differences in how the crowd and experts select content to be fact-checked, ii) how the crowd and the experts retrieve different resources to fact-check, and iii) the edge the crowd shows in fact-checking scalability and efficiency as compared to expert checkers.

preprint2020arXiv

Can The Crowd Identify Misinformation Objectively? The Effects of Judgment Scale and Assessor's Background

Truthfulness judgments are a fundamental step in the process of fighting misinformation, as they are crucial to train and evaluate classifiers that automatically distinguish true and false statements. Usually such judgments are made by experts, like journalists for political statements or medical doctors for medical statements. In this paper, we follow a different approach and rely on (non-expert) crowd workers. This of course leads to the following research question: Can crowdsourcing be reliably used to assess the truthfulness of information and to create large-scale labeled collections for information credibility systems? To address this issue, we present the results of an extensive study based on crowdsourcing: we collect thousands of truthfulness assessments over two datasets, and we compare expert judgments with crowd judgments, expressed on scales with various granularity levels. We also measure the political bias and the cognitive background of the workers, and quantify their effect on the reliability of the data provided by the crowd.

preprint2020arXiv

Proceedings of the KG-BIAS Workshop 2020 at AKBC 2020

The KG-BIAS 2020 workshop touches on biases and how they surface in knowledge graphs (KGs), biases in the source data that is used to create KGs, methods for measuring or remediating bias in KGs, but also identifying other biases such as how and which languages are represented in automatically constructed KGs or how personal KGs might incur inherent biases. The goal of this workshop is to uncover how various types of biases are introduced into KGs, investigate how to measure, and propose methods to remediate them.

preprint2020arXiv

The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively?

Misinformation is an ever increasing problem that is difficult to solve for the research community and has a negative impact on the society at large. Very recently, the problem has been addressed with a crowdsourcing-based approach to scale up labeling efforts: to assess the truthfulness of a statement, instead of relying on a few experts, a crowd of (non-expert) judges is exploited. We follow the same approach to study whether crowdsourcing is an effective and reliable method to assess statements truthfulness during a pandemic. We specifically target statements related to the COVID-19 health emergency, that is still ongoing at the time of the study and has arguably caused an increase of the amount of misinformation that is spreading online (a phenomenon for which the term "infodemic" has been used). By doing so, we are able to address (mis)information that is both related to a sensitive and personal issue like health and very recent as compared to when the judgment is done: two issues that have not been analyzed in related work. In our experiment, crowd workers are asked to assess the truthfulness of statements, as well as to provide evidence for the assessments as a URL and a text justification. Besides showing that the crowd is able to accurately judge the truthfulness of the statements, we also report results on many different aspects, including: agreement among workers, the effect of different aggregation functions, of scales transformations, and of workers background / bias. We also analyze workers behavior, in terms of queries submitted, URLs found / selected, text justifications, and other behavioral data like clicks and mouse actions collected by means of an ad hoc logger.

preprint2016arXiv

Pairwise, Magnitude, or Stars: What's the Best Way for Crowds to Rate?

We compare three popular techniques of rating content: the ubiquitous five star rating, the less used pairwise comparison, and the recently introduced (in crowdsourcing) magnitude estimation approach. Each system has specific advantages and disadvantages, in terms of required user effort, achievable user preference prediction accuracy and number of ratings required. We design an experiment where the three techniques are compared in an unbiased way. We collected 39'000 ratings on a popular crowdsourcing platform, allowing us to release a dataset that will be useful for many related studies on user rating techniques.

preprint2016arXiv

The Effect of Class Imbalance and Order on Crowdsourced Relevance Judgments

In this paper we study the effect on crowd worker efficiency and effectiveness of the dominance of one class in the data they process. We aim at understanding if there is any positive or negative bias in workers seeing many negative examples in the identification of positive labels. To test our hypothesis, we design an experiment where crowd workers are asked to judge the relevance of documents presented in different orders. Our findings indicate that there is a significant improvement in the quality of relevance judgements when presenting relevant results before the non-relevant ones.

Gianluca Demartini

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Query-Document Dense Vectors for LLM Relevance Judgment Bias Analysis

The Impact of AI Generated Content on Decision Making for Topics Requiring Expertise

Identification of Regulatory Requirements Relevant to Business Processes: A Comparative Study on Generative AI, Embedding-based Ranking, Crowd and Expert-driven Methods

Crowdsourced Fact-Checking at Twitter: How Does the Crowd Compare With Experts?

Can The Crowd Identify Misinformation Objectively? The Effects of Judgment Scale and Assessor's Background

Proceedings of the KG-BIAS Workshop 2020 at AKBC 2020

The COVID-19 Infodemic: Can the Crowd Judge Recent Misinformation Objectively?

Pairwise, Magnitude, or Stars: What's the Best Way for Crowds to Rate?

The Effect of Class Imbalance and Order on Crowdsourced Relevance Judgments