Source author record

Michael Mathioudakis

Michael Mathioudakis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cs.CY Data Structures and Algorithms Social and Information Networks Databases Machine Learning Artificial Intelligence Computation and Language Digital Libraries Information Retrieval

Catalog footprint

What is connected

11works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke's foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a "lexical gatekeeping" effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.

preprint2021arXiv

Affirmative Action Policies for Top-k Candidates Selection, With an Application to the Design of Policies for University Admissions

We consider the problem of designing affirmative action policies for selecting the top-k candidates from a pool of applicants. We assume that for each candidate we have socio-demographic attributes and a series of variables that serve as indicators of future performance (e.g., results on standardized tests). We further assume that we have access to historical data including the actual performance of previously selected candidates. Critically, performance information is only available for candidates who were selected under some previous selection policy. In this work we assume that due to legal requirements or voluntary commitments, an organization wants to increase the presence of people from disadvantaged socio-demographic groups among the selected candidates. Hence, we seek to design an affirmative action or positive action policy. This policy has two concurrent objectives: (i) to select candidates who, given what can be learnt from historical data, are more likely to perform well, and (ii) to select candidates in a way that increases the representation of disadvantaged socio-demographic groups. Our motivating application is the design of university admission policies to bachelor's degrees. We use a causal model as a framework to describe several families of policies (changing component weights, giving bonuses, and enacting quotas), and compare them both theoretically and through extensive experimentation on a large real-world dataset containing thousands of university applicants. Our paper is the first to place the problem of affirmative-action policy design within the framework of algorithmic fairness. Our empirical results indicate that simple policies could favor the admission of disadvantaged groups without significantly compromising on the quality of accepted candidates.

preprint2021arXiv

Fair and Representative Subset Selection from Data Streams

We study the problem of extracting a small subset of representative items from a large data stream. In many data mining and machine learning applications such as social network analysis and recommender systems, this problem can be formulated as maximizing a monotone submodular function subject to a cardinality constraint $k$. In this work, we consider the setting where data items in the stream belong to one of several disjoint groups and investigate the optimization problem with an additional \emph{fairness} constraint that limits selection to a given number of items from each group. We then propose efficient algorithms for the fairness-aware variant of the streaming submodular maximization problem. In particular, we first give a $ (\frac{1}{2}-\varepsilon) $-approximation algorithm that requires $ O(\frac{1}{\varepsilon} \log \frac{k}{\varepsilon}) $ passes over the stream for any constant $ \varepsilon>0 $. Moreover, we give a single-pass streaming algorithm that has the same approximation ratio of $(\frac{1}{2}-\varepsilon)$ when unlimited buffer sizes and post-processing time are permitted, and discuss how to adapt it to more practical settings where the buffer sizes are bounded. Finally, we demonstrate the efficiency and effectiveness of our proposed algorithms on two real-world applications, namely \emph{maximum coverage on large graphs} and \emph{personalized recommendation}.

preprint2021arXiv

Intersectional Affirmative Action Policies for Top-k Candidates Selection

We study the problem of selecting the top-k candidates from a pool of applicants, where each candidate is associated with a score indicating his/her aptitude. Depending on the specific scenario, such as job search or college admissions, these scores may be the results of standardized tests or other predictors of future performance and utility. We consider a situation in which some groups of candidates experience historical and present disadvantage that makes their chances of being accepted much lower than other groups. In these circumstances, we wish to apply an affirmative action policy to reduce acceptance rate disparities, while avoiding any large decrease in the aptitude of the candidates that are eventually selected. Our algorithmic design is motivated by the frequently observed phenomenon that discrimination disproportionately affects individuals who simultaneously belong to multiple disadvantaged groups, defined along intersecting dimensions such as gender, race, sexual orientation, socio-economic status, and disability. In short, our algorithm's objective is to simultaneously: select candidates with high utility, and level up the representation of disadvantaged intersectional classes. This naturally involves trade-offs and is computationally challenging due to the the combinatorial explosion of potential subgroups as more attributes are considered. We propose two algorithms to solve this problem, analyze them, and evaluate them experimentally using a dataset of university application scores and admissions to bachelor degrees in an OECD country. Our conclusion is that it is possible to significantly reduce disparities in admission rates affecting intersectional classes with a small loss in terms of selected candidate aptitude. To the best of our knowledge, we are the first to study fairness constraints with regards to intersectional classes in the context of top-k selection.

preprint2021arXiv

Query the model: precomputations for efficient inference with Bayesian Networks

Variable Elimination is a fundamental algorithm for probabilistic inference over Bayesian networks. In this paper, we propose a novel materialization method for Variable Elimination, which can lead to significant efficiency gains when answering inference queries. We evaluate our technique using real-world Bayesian networks. Our results show that a modest amount of materialization can lead to significant improvements in the running time of queries. Furthermore, in comparison with junction tree methods that also rely on materialization, our approach achieves comparable efficiency during inference using significantly lighter materialization.

preprint2020arXiv

GRMR: Generalized Regret-Minimizing Representatives

Extracting a small subset of representative tuples from a large database is an important task in multi-criteria decision making. The regret-minimizing set (RMS) problem is recently proposed for representative discovery from databases. Specifically, for a set of tuples (points) in $d$ dimensions, an RMS problem finds the smallest subset such that, for any possible ranking function, the relative difference in scores between the top-ranked point in the subset and the top-ranked point in the entire database is within a parameter $\varepsilon \in (0,1)$. Although RMS and its variations have been extensively investigated in the literature, existing approaches only consider the class of nonnegative (monotonic) linear functions for ranking, which have limitations in modeling user preferences and decision-making processes. To address this issue, we define the generalized regret-minimizing representative (GRMR) problem that extends RMS by taking into account all linear functions including non-monotonic ones with negative weights. For two-dimensional databases, we propose an optimal algorithm for GRMR via a transformation into the shortest cycle problem in a directed graph. Since GRMR is proven to be NP-hard even in three dimensions, we further develop a polynomial-time heuristic algorithm for GRMR on databases in arbitrary dimensions. Finally, we conduct extensive experiments on real and synthetic datasets to confirm the efficiency, effectiveness, and scalability of our proposed algorithms.

preprint2020arXiv

Towards Data-Driven Affirmative Action Policies under Uncertainty

In this paper, we study university admissions under a centralized system that uses grades and standardized test scores to match applicants to university programs. We consider affirmative action policies that seek to increase the number of admitted applicants from underrepresented groups. Since such a policy has to be announced before the start of the application period, there is uncertainty about the score distribution of the students applying to each program. This poses a difficult challenge for policy-makers. We explore the possibility of using a predictive model trained on historical data to help optimize the parameters of such policies.

preprint2018arXiv

Markov Chain Monitoring

In networking applications, one often wishes to obtain estimates about the number of objects at different parts of the network (e.g., the number of cars at an intersection of a road network or the number of packets expected to reach a node in a computer network) by monitoring the traffic in a small number of network nodes or edges. We formalize this task by defining the 'Markov Chain Monitoring' problem. Given an initial distribution of items over the nodes of a Markov chain, we wish to estimate the distribution of items at subsequent times. We do this by asking a limited number of queries that retrieve, for example, how many items transitioned to a specific node or over a specific edge at a particular time. We consider different types of queries, each defining a different variant of the Markov chain monitoring. For each variant, we design efficient algorithms for choosing the queries that make our estimates as accurate as possible. In our experiments with synthetic and real datasets we demonstrate the efficiency and the efficacy of our algorithms in a variety of settings.

preprint2016arXiv

Extracting Patterns of Urban Activity from Geotagged Social Data

Data generated on location-based social networks provide rich information on the whereabouts of urban dwellers. Specifically, such data reveal who spends time where, when, and on what type of activity (e.g., shopping at a mall, or dining at a restaurant). That information can, in turn, be used to describe city regions in terms of activity that takes place therein. For example, the data might reveal that citizens visit one region mainly for shopping in the morning, while another for dining in the evening. Furthermore, once such a description is available, one can ask more elaborate questions: What are the features that distinguish one region from another -- is it simply the type of venues they host or is it the visitors they attract? What regions are similar across cities? In this paper, we attempt to answer these questions using publicly shared Foursquare data. In contrast with previous work, our method makes use of a probabilistic model with minimal assumptions about the data and thus relieves us from having to make arbitrary decisions in our analysis (e.g., regarding the granularity of discovered regions or the importance of different features). We perform an empirical comparison with previous work and discuss insights obtained through our findings.

preprint2015arXiv

Absorbing random-walk centrality: Theory and algorithms

We study a new notion of graph centrality based on absorbing random walks. Given a graph $G=(V,E)$ and a set of query nodes $Q\subseteq V$, we aim to identify the $k$ most central nodes in $G$ with respect to $Q$. Specifically, we consider central nodes to be absorbing for random walks that start at the query nodes $Q$. The goal is to find the set of $k$ central nodes that minimizes the expected length of a random walk until absorption. The proposed measure, which we call $k$ absorbing random-walk centrality, favors diverse sets, as it is beneficial to place the $k$ absorbing nodes in different parts of the graph so as to "intercept" random walks that start from different query nodes. Although similar problem definitions have been considered in the literature, e.g., in information-retrieval settings where the goal is to diversify web-search results, in this paper we study the problem formally and prove some of its properties. We show that the problem is NP-hard, while the objective function is monotone and supermodular, implying that a greedy algorithm provides solutions with an approximation guarantee. On the other hand, the greedy algorithm involves expensive matrix operations that make it prohibitive to employ on large datasets. To confront this challenge, we develop more efficient algorithms based on spectral clustering and on personalized PageRank.

preprint2015arXiv

Exploring Controversy in Twitter

Among the topics discussed on social media, some spark more heated debate than others. For example, experience suggests that major political events, such as a vote for healthcare law in the US, would spark more debate between opposing sides than other events, such as a concert of a popular music band. Exploring the topics of discussion on Twitter and understanding which ones are controversial is extremely useful for a variety of purposes, such as for journalists to understand what issues divide the public, or for social scientists to understand how controversy is manifested in social interactions. The system we present processes the daily trending topics discussed on the platform, and assigns to each topic a controversy score, which is computed based on the interactions among Twitter users, and a visualization of these interactions, which provides an intuitive visual cue regarding the controversy of the topic. The system also allows users to explore the messages (tweets) associated with each topic, and sort and explore the topics by different criteria (e.g., by controversy score, time, or related keywords).

Michael Mathioudakis

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke

Affirmative Action Policies for Top-k Candidates Selection, With an Application to the Design of Policies for University Admissions

Fair and Representative Subset Selection from Data Streams

Intersectional Affirmative Action Policies for Top-k Candidates Selection

Query the model: precomputations for efficient inference with Bayesian Networks

GRMR: Generalized Regret-Minimizing Representatives

Towards Data-Driven Affirmative Action Policies under Uncertainty

Markov Chain Monitoring

Extracting Patterns of Urban Activity from Geotagged Social Data

Absorbing random-walk centrality: Theory and algorithms

Exploring Controversy in Twitter