Source author record

Jeffrey Chan

Jeffrey Chan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning astro-ph.GA astro-ph.CO cs.CY Information Retrieval Artificial Intelligence Computation and Language Data Structures and Algorithms math.PR physics.soc-ph Populations and Evolution Social and Information Networks

Catalog footprint

What is connected

14works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models

Large Language Model (LLM)-based agent simulation has emerged as a promising approach to meet the increasing demand for real-time and rigorous evaluation in modern recommender systems. A typical LLM-driven simulation framework comprises three essential components: the profile module, memory module, and action module. However, existing studies have primarily concentrated on enhancing the memory and action modules, with limited attention to profile generation, which plays a pivotal role in ensuring realistic agent behaviours and aligning simulated interactions with real user dynamics. Moreover, the scarcity of datasets specifically designed for recommendation simulations has led to heavy reliance on manually crafted profiles, significantly limiting the scalability and generalisability of simulation frameworks across different datasets. To address these challenges, this work proposes an Automated Profile Generation Framework for Recommendation Simulation, APG4RecSim, that constructs realistic, coherent, and robust user profiles with minimal supervision. Extensive experiments on three benchmark datasets demonstrate that APG4RecSim achieves the best overall performance on discrimination, ranking, and rating tasks, improving ranking quality by up to 7% in nDCG@10 and reducing rating distribution divergence by 8% in JSD compared to existing profile-generation baselines. Beyond overall performance gains, our results show that profiles generated by APG4RecSim are resilient to popularity- and position-induced biases and maintain stable performance across datasets and different LLMs.

preprint2022arXiv

Measuring disentangled generative spatio-temporal representation

Disentangled representation learning offers useful properties such as dimension reduction and interpretability, which are essential to modern deep learning approaches. Although deep learning techniques have been widely applied to spatio-temporal data mining, there has been little attention to further disentangle the latent features and understanding their contribution to the model performance, particularly their mutual information and correlation across features. In this study, we adopt two state-of-the-art disentangled representation learning methods and apply them to three large-scale public spatio-temporal datasets. To evaluate their performance, we propose an internal evaluation metric focusing on the degree of correlations among latent variables of the learned representations and the prediction performance of the downstream tasks. Empirical results show that our modified method can learn disentangled representations that achieve the same level of performance as existing state-of-the-art ST deep learning methods in a spatio-temporal sequence forecasting problem. Additionally, we find that our methods can be used to discover real-world spatial-temporal semantics to describe the variables in the learned representation.

preprint2022arXiv

MurTree: Optimal Classification Trees via Dynamic Programming and Search

Decision tree learning is a widely used approach in machine learning, favoured in applications that require concise and interpretable models. Heuristic methods are traditionally used to quickly produce models with reasonably high accuracy. A commonly criticised point, however, is that the resulting trees may not necessarily be the best representation of the data in terms of accuracy and size. In recent years, this motivated the development of optimal classification tree algorithms that globally optimise the decision tree in contrast to heuristic methods that perform a sequence of locally optimal decisions. We follow this line of work and provide a novel algorithm for learning optimal classification trees based on dynamic programming and search. Our algorithm supports constraints on the depth of the tree and number of nodes. The success of our approach is attributed to a series of specialised techniques that exploit properties unique to classification trees. Whereas algorithms for optimal classification trees have traditionally been plagued by high runtimes and limited scalability, we show in a detailed experimental study that our approach uses only a fraction of the time required by the state-of-the-art and can handle datasets with tens of thousands of instances, providing several orders of magnitude improvements and notably contributing towards the practical realisation of optimal decision trees.

preprint2022arXiv

Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems

Model-free deep-reinforcement-based learning algorithms have been applied to a range of COPs~\cite{bello2016neural}~\cite{kool2018attention}~\cite{nazari2018reinforcement}. However, these approaches suffer from two key challenges when applied to combinatorial problems: insufficient exploration and the requirement of many training examples of the search space to achieve reasonable performance. Combinatorial optimisation can be complex, characterised by search spaces with many optimas and large spaces to search and learn. Therefore, a new method is needed to find good solutions that are more efficient by being more sample efficient. This paper presents a new reinforcement learning approach that is based on entropy. In addition, we design an off-policy-based reinforcement learning technique that maximises the expected return and improves the sample efficiency to achieve faster learning during training time. We systematically evaluate our approach on a range of route optimisation tasks typically used to evaluate learning-based optimisation, such as the such as the Travelling Salesman problems (TSP), Capacitated Vehicle Routing Problem (CVRP). In this paper, we show that our model can generalise to various route problems, such as the split-delivery VRP (SDVRP), and compare the performance of our method with that of current state-of-the-art approaches. The Empirical results show that the proposed method can improve on state-of-the-art methods in terms of solution quality and computation time and generalise to problems of different sizes.

preprint2020arXiv

An Ambient-Physical System to Infer Concentration in Open-plan Workplace

One of the core challenges in open-plan workspaces is to ensure a good level of concentration for the workers while performing their tasks. Hence, being able to infer concentration levels of workers will allow building designers, managers, and workers to estimate what effect different open-plan layouts will have and to find an optimal one. In this research, we present an ambient-physical system to investigate the concentration inference problem. Specifically, we deploy a series of pervasive sensors to capture various ambient and physical signals related to perceived concentration at work. The practicality of our system has been tested on two large open-plan workplaces with different designs and layouts. The empirical results highlight promising applications of pervasive sensing in occupational concentration inference, which can be adopted to enhance the capabilities of modern workplaces.

preprint2020arXiv

Less is More: Rejecting Unreliable Reviews for Product Question Answering

Promptly and accurately answering questions on products is important for e-commerce applications. Manually answering product questions (e.g. on community question answering platforms) results in slow response and does not scale. Recent studies show that product reviews are a good source for real-time, automatic product question answering (PQA). In the literature, PQA is formulated as a retrieval problem with the goal to search for the most relevant reviews to answer a given product question. In this paper, we focus on the issue of answerability and answer reliability for PQA using reviews. Our investigation is based on the intuition that many questions may not be answerable with a finite set of reviews. When a question is not answerable, a system should return nil answers rather than providing a list of irrelevant reviews, which can have significant negative impact on user experience. Moreover, for answerable questions, only the most relevant reviews that answer the question should be included in the result. We propose a conformal prediction based framework to improve the reliability of PQA systems, where we reject unreliable answers so that the returned results are more concise and accurate at answering the product question, including returning nil answers for unanswerable questions. Experiments on a widely used Amazon dataset show encouraging results of our proposed framework. More broadly, our results demonstrate a novel and effective application of conformal methods to a retrieval task.

preprint2020arXiv

The kinematics of massive quiescent galaxies at $1.4 < z < 2.1$: dark matter fractions, IMF variation, and the relation to local early-type galaxies

We study the dynamical properties of massive quiescent galaxies at $1.4 < z < 2.1$ using deep Hubble Space Telescope WFC3/F160W imaging and a combination of literature stellar velocity dispersion measurements and new near-infrared spectra obtained using KMOS on the ESO VLT. We use these data to show that the typical dynamical-to-stellar mass ratio has increased by $\sim$0.2 dex from $z = 2$ to the present day, and investigate this evolution in the context of possible changes in the stellar initial mass function (IMF) and/or fraction of dark matter contained within the galaxy effective radius, $f_\mathrm{DM}$. Comparing our high-redshift sample to their likely descendants at low-redshift, we find that $f_\mathrm{DM}$ has increased by a factor of more than 4 since $z \approx 1.8$, from $f_\mathrm{DM}$ = $6.6\pm1.0$% to $\sim$24%. The observed increase appears robust to changes in the methods used to estimate dynamical masses or match progenitors and descendants. We quantify possible variation of the stellar IMF through the offset parameter $α$, defined as the ratio of dynamical mass in stars to the stellar mass estimated using a Chabrier IMF. We demonstrate that the correlation between stellar velocity dispersion and $α$ reported among quiescent galaxies at low-redshift is already in place at $z = 2$, and argue that subsequent evolution through (mostly minor) merging should act to preserve this relation while contributing significantly to galaxies overall growth in size and stellar mass.

preprint2016arXiv

Ground Truth Bias in External Cluster Validity Indices

It has been noticed that some external CVIs exhibit a preferential bias towards a larger or smaller number of clusters which is monotonic (directly or inversely) in the number of clusters in candidate partitions. This type of bias is caused by the functional form of the CVI model. For example, the popular Rand index (RI) exhibits a monotone increasing (NCinc) bias, while the Jaccard Index (JI) index suffers from a monotone decreasing (NCdec) bias. This type of bias has been previously recognized in the literature. In this work, we identify a new type of bias arising from the distribution of the ground truth (reference) partition against which candidate partitions are compared. We call this new type of bias ground truth (GT) bias. This type of bias occurs if a change in the reference partition causes a change in the bias status (e.g., NCinc, NCdec) of a CVI. For example, NCinc bias in the RI can be changed to NCdec bias by skewing the distribution of clusters in the ground truth partition. It is important for users to be aware of this new type of biased behaviour, since it may affect the interpretations of CVI results. The objective of this article is to study the empirical and theoretical implications of GT bias. To the best of our knowledge, this is the first extensive study of such a property for external cluster validity indices.

preprint2016arXiv

The evolution of metallicity and metallicity gradients from z=2.7-0.6 with KMOS3D

We present measurements of the [NII]/Ha ratio as a probe of gas-phase oxygen abundance for a sample of 419 star-forming galaxies at z=0.6-2.7 from the KMOS3D near-IR multi-IFU survey. The mass-metallicity relation (MZR) is determined consistently with the same sample selection, metallicity tracer, and methodology over the wide redshift range probed by the survey. We find good agreement with long-slit surveys in the literature, except for the low-mass slope of the relation at z~2.3, where this sample is less biased than previous samples based on optical spectroscopic redshifts. In this regime we measure a steeper slope than some literature results. Excluding the AGN contribution from the MZR reduces sensitivity at the high mass end, but produces otherwise consistent results. There is no significant dependence of the [NII]/Ha ratio on SFR or environment at fixed redshift and stellar mass. The IFU data allow spatially resolved measurements of [NII]/Ha, from which we can infer abundance gradients for 180 galaxies, thus tripling the current sample in the literature. The observed gradients are on average flat, with only 15 gradients statistically offset from zero at >3sigma. We have modelled the effect of beam-smearing, assuming a smooth intrinsic radial gradient and known seeing, inclination and effective radius for each galaxy. Our seeing-limited observations can recover up to 70% of the intrinsic gradient for the largest, face-on disks, but only 30% for the smaller, more inclined galaxies. We do not find significant trends between observed or corrected gradients and any stellar population, dynamical or structural galaxy parameters, mostly in agreement with existing studies with much smaller sample sizes. In cosmological simulations, strong feedback is generally required to produce flat gradients at high redshift.

preprint2016arXiv

Two-Locus Likelihoods under Variable Population Size and Fine-Scale Recombination Rate Estimation

Two-locus sampling probabilities have played a central role in devising an efficient composite likelihood method for estimating fine-scale recombination rates. Due to mathematical and computational challenges, these sampling probabilities are typically computed under the unrealistic assumption of a constant population size, and simulation studies have shown that resulting recombination rate estimates can be severely biased in certain cases of historical population size changes. To alleviate this problem, we develop here new methods to compute the sampling probability for variable population size functions that are piecewise constant. Our main theoretical result, implemented in a new software package called LDpop, is a novel formula for the sampling probability that can be evaluated by numerically exponentiating a large but sparse matrix. This formula can handle moderate sample sizes ($n \leq 50$) and demographic size histories with a large number of epochs ($\mathcal{D} \geq 64$). In addition, LDpop implements an approximate formula for the sampling probability that is reasonably accurate and scales to hundreds in sample size ($n \geq 256$). Finally, LDpop includes an importance sampler for the posterior distribution of two-locus genealogies, based on a new result for the optimal proposal distribution in the variable-size setting. Using our methods, we study how a sharp population bottleneck followed by rapid growth affects the correlation between partially linked sites. Then, through an extensive simulation study, we show that accounting for population size changes under such a demographic model leads to substantial improvements in fine-scale recombination rate estimation. LDpop is freely available for download at https://github.com/popgenmethods/ldpop

preprint2015arXiv

First results from the VIRIAL survey: the stellar content of $UVJ$-selected quiescent galaxies at $1.5 < z < 2$ from KMOS

We investigate the stellar populations of 25 massive, galaxies ($\log[M_\ast/M_\odot] \geq 10.9$) at $1.5 < z < 2$ using data obtained with the K-band Multi-Object Spectrograph (KMOS) on the ESO VLT. Targets were selected to be quiescent based on their broadband colors and redshifts using data from the 3D-HST grism survey. The mean redshift of our sample is $\bar{z} = 1.75$, where KMOS YJ-band data probe age- and metallicity-sensitive absorption features in the rest-frame optical, including the $G$ band, Fe I, and high-order Balmer lines. Fitting simple stellar population models to a stack of our KMOS spectra, we derive a mean age of $1.03^{+0.13}_{-0.08}$ Gyr. We confirm previous results suggesting a correlation between color and age for quiescent galaxies, finding mean ages of $1.22^{+0.56}_{-0.19}$ Gyr and $0.85^{+0.08}_{-0.05}$ Gyr for the reddest and bluest galaxies in our sample. Combining our KMOS measurements with those obtained from previous studies at $0.2 < z < 2$ we find evidence for a $2-3$ Gyr spread in the formation epoch of massive galaxies. At $z < 1$ the measured stellar ages are consistent with passive evolution, while at $1 < z \lesssim2$ they appear to saturate at $\sim$1 Gyr, which likely reflects changing demographics of the (mean) progenitor population. By comparing to star-formation histories inferred for "normal" star-forming galaxies, we show that the timescales required to form massive galaxies at $z \gtrsim 1.5$ are consistent with the enhanced $α$-element abundances found in massive local early-type galaxies.

preprint2015arXiv

MOOCs Meet Measurement Theory: A Topic-Modelling Approach

This paper adapts topic models to the psychometric testing of MOOC students based on their online forum postings. Measurement theory from education and psychology provides statistical models for quantifying a person's attainment of intangible attributes such as attitudes, abilities or intelligence. Such models infer latent skill levels by relating them to individuals' observed responses on a series of items such as quiz questions. The set of items can be used to measure a latent skill if individuals' responses on them conform to a Guttman scale. Such well-scaled items differentiate between individuals and inferred levels span the entire range from most basic to the advanced. In practice, education researchers manually devise items (quiz questions) while optimising well-scaled conformance. Due to the costly nature and expert requirements of this process, psychometric testing has found limited use in everyday teaching. We aim to develop usable measurement models for highly-instrumented MOOC delivery platforms, by using participation in automatically-extracted online forum topics as items. The challenge is to formalise the Guttman scale educational constraint and incorporate it into topic models. To favour topics that automatically conform to a Guttman scale, we introduce a novel regularisation into non-negative matrix factorisation-based topic modelling. We demonstrate the suitability of our approach with both quantitative experiments on three Coursera MOOCs, and with a qualitative survey of topic interpretability on two MOOCs by domain expert interviews.

preprint2014arXiv

A Consistent Study of Metallicity Evolution at 0.8 < z < 2.6

We present the correlations between stellar mass, star formation rate (SFR) and [NII]/Ha flux ratio as indicator of gas-phase metallicity for a sample of 222 galaxies at 0.8 < z < 2.6 and log(M*/Msun)=9.0-11.5 from the LUCI, SINS/zC-SINF and KMOS3D surveys. This sample provides a unique analysis of the mass-metallicity relation (MZR) over an extended redshift range using consistent data analysis techniques and strong-line metallicity indicator. We find a constant slope at the low-mass end of the relation and can fully describe its redshift evolution through the evolution of the characteristic turnover mass where the relation begins to flatten at the asymptotic metallicity. At fixed mass and redshift, our data do not show a correlation between the [NII]/Ha ratio and SFR, which disagrees with the 0.2-0.3dex offset in [NII]/Ha predicted by the "fundamental relation" between stellar mass, SFR and metallicity discussed in recent literature. However, the overall evolution towards lower [NII]/Ha at earlier times does broadly agree with these predictions.

preprint2012arXiv

A Time Decoupling Approach for Studying Forum Dynamics

Online forums are rich sources of information about user communication activity over time. Finding temporal patterns in online forum communication threads can advance our understanding of the dynamics of conversations. The main challenge of temporal analysis in this context is the complexity of forum data. There can be thousands of interacting users, who can be numerically described in many different ways. Moreover, user characteristics can evolve over time. We propose an approach that decouples temporal information about users into sequences of user events and inter-event times. We develop a new feature space to represent the event sequences as paths, and we model the distribution of the inter-event times. We study over 30,000 users across four Internet forums, and discover novel patterns in user communication. We find that users tend to exhibit consistency over time. Furthermore, in our feature space, we observe regions that represent unlikely user behaviors. Finally, we show how to derive a numerical representation for each forum, and we then use this representation to derive a novel clustering of multiple forums.

Jeffrey Chan

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Task-Aware Automated User Profile Generation for Recommendation Simulation Using Large Language Models

Measuring disentangled generative spatio-temporal representation

MurTree: Optimal Classification Trees via Dynamic Programming and Search

Sample-Efficient, Exploration-Based Policy Optimisation for Routing Problems

An Ambient-Physical System to Infer Concentration in Open-plan Workplace

Less is More: Rejecting Unreliable Reviews for Product Question Answering

The kinematics of massive quiescent galaxies at $1.4 < z < 2.1$: dark matter fractions, IMF variation, and the relation to local early-type galaxies

Ground Truth Bias in External Cluster Validity Indices

The evolution of metallicity and metallicity gradients from z=2.7-0.6 with KMOS3D

Two-Locus Likelihoods under Variable Population Size and Fine-Scale Recombination Rate Estimation

First results from the VIRIAL survey: the stellar content of $UVJ$-selected quiescent galaxies at $1.5 < z < 2$ from KMOS

MOOCs Meet Measurement Theory: A Topic-Modelling Approach

A Consistent Study of Metallicity Evolution at 0.8 < z < 2.6

A Time Decoupling Approach for Studying Forum Dynamics