Researcher profile

Lin Shi

Lin Shi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

preprint2022arXiv

Automatic Comment Generation via Multi-Pass Deliberation

Deliberation is a common and natural behavior in human daily life. For example, when writing papers or articles, we usually first write drafts, and then iteratively polish them until satisfied. In light of such a human cognitive process, we propose DECOM, which is a multi-pass deliberation framework for automatic comment generation. DECOM consists of multiple Deliberation Models and one Evaluation Model. Given a code snippet, we first extract keywords from the code and retrieve a similar code fragment from a pre-defined corpus. Then, we treat the comment of the retrieved code as the initial draft and input it with the code and keywords into DECOM to start the iterative deliberation process. At each deliberation, the deliberation model polishes the draft and generates a new comment. The evaluation model measures the quality of the newly generated comment to determine whether to end the iterative process or not. When the iterative process is terminated, the best-generated comment will be selected as the target comment. Our approach is evaluated on two real-world datasets in Java (87K) and Python (108K), and experiment results show that our approach outperforms the state-of-the-art baselines. A human evaluation study also confirms the comments generated by DECOM tend to be more readable, informative, and useful.

preprint2022arXiv

BugListener: Identifying and Synthesizing Bug Reports from Collaborative Live Chats

In community-based software development, developers frequently rely on live-chatting to discuss emergent bugs/errors they encounter in daily development tasks. However, it remains a challenging task to accurately record such knowledge due to the noisy nature of interleaved dialogs in live chat data. In this paper, we first formulate the task of identifying and synthesizing bug reports from community live chats, and propose a novel approach, named BugListener, to address the challenges. Specifically, BugListener automates three sub-tasks: 1) Disentangle the dialogs from massive chat logs by using a Feed-Forward neural network; 2) Identify the bug-report dialogs from separated dialogs by modeling the original dialog to the graph-structured dialog and leveraging the graph neural network to learn the contextual information; 3) Synthesize the bug reports by utilizing the TextCNN model and Transfer Learning network to classify the sentences into three groups: observed behaviors (OB), expected behaviors (EB), and steps to reproduce the bug (SR). BugListener is evaluated on six open source projects. The results show that: for bug report identification, BugListener achieves the average F1 of 74.21%, improving the best baseline by 10.37%; and for bug report synthesis task, BugListener could classify the OB, EB, and SR sentences with the F1 of 67.37%, 87.14%, and 65.03%, improving the best baselines by 7.21%, 7.38%, 5.30%, respectively. A human evaluation also confirms the effectiveness of BugListener in generating relevant and accurate bug reports. These demonstrate the significant potential of applying BugListener in community-based software development, for promoting bug discovery and quality improvement.

preprint2022arXiv

Where is Your App Frustrating Users?

User reviews of mobile apps provide a communication channel for developers to perceive user satisfaction. Many app features that users have problems with are usually expressed by key phrases such as "upload pictures", which could be buried in the review texts. The lack of fine-grained view about problematic features could obscure the developers' understanding of where the app is frustrating users, and postpone the improvement of the apps. Existing pattern-based approaches to extract target phrases suffer from low accuracy due to insufficient semantic understanding of the reviews, thus can only summarize the high-level topics/aspects of the reviews. This paper proposes a semantic-aware, fine-grained app review analysis approach (SIRA) to extract, cluster, and visualize the problematic features of apps. The main component of SIRA is a novel BERT+Attr-CRF model for fine-grained problematic feature extraction, which combines textual descriptions and review attributes to better model the semantics of reviews and boost the performance of the traditional BERT-CRF model. SIRA also clusters the extracted phrases based on their semantic relations and presents a visualization of the summaries. Our evaluation on 3,426 reviews from six apps confirms the effectiveness of SIRA in problematic feature extraction and clustering. We further conduct an empirical study with SIRA on 318,534 reviews of 18 popular apps to explore its potential application and examine its usefulness in real-world practice.

preprint2020arXiv

QC-SPHRAM: Quasi-conformal Spherical Harmonics Based Geometric Distortions on Hippocampal Surfaces for Early Detection of the Alzheimer's Disease

We propose a disease classification model, called the QC-SPHARM, for the early detection of the Alzheimer's Disease (AD). The proposed QC-SPHARM can distinguish between normal control (NC) subjects and AD patients, as well as between amnestic mild cognitive impairment (aMCI) patients having high possibility progressing into AD and those who do not. Using the spherical harmonics (SPHARM) based registration, hippocampal surfaces segmented from the ADNI data are individually registered to a template surface constructed from the NC subjects using SPHARM. Local geometric distortions of the deformation from the template surface to each subject are quantified in terms of conformality distortions and curvatures distortions. The measurements are combined with the spherical harmonics coefficients and the total volume change of the subject from the template. Afterwards, a t-test based feature selection method incorporating the bagging strategy is applied to extract those local regions having high discriminating power of the two classes. The disease diagnosis machine can therefore be built using the data under the Support Vector Machine (SVM) setting. Using 110 NC subjects and 110 AD patients from the ADNI database, the proposed algorithm achieves 85:2% testing accuracy on 80 random samples as testing subjects, with the incorporation of surface geometry in the classification machine. Using 20 aMCI patients who has advanced to AD during a two-year period and another 20 aMCI patients who remain non-AD for the next two years, the algorithm achieves 81:2% accuracy using 10 randomly picked subjects as testing data. Our proposed method is 6%-15% better than other classification models without the incorporation of surface geometry. The results demonstrate the advantages of using local geometric distortions as the discriminating criterion for early AD diagnosis.

preprint2019arXiv

Anharmonic corrections to the multiphonon deep-level charge capture ab initio calculations for semiconductors

Nonradiative carrier recombination at semiconductor deep centers is of great importance to both fundamental physics and device engineering. In this letter, we provide a revised analysis of K. Huang's original nonradiative multi-phonon (NMP) theory with ab initio calculations. First, we identify at first-principle level that Huang's concise formula gives the same results as the matrix based formula, and Huang's high temperature formula provides an analytical expression for the coupling constant in Marcus theory. Secondly, the anharmonic effects are corrected by taking into account local phonon mode variation at different charge states of the defect. The corrected capture rates for defects in GaN and SiC agree well with experiments.