Researcher profile

Yi Bu

Yi Bu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization

Large Language Models (LLMs) have become a cornerstone for automated visualization code generation, enabling users to create charts through natural language instructions. Despite improvements from techniques like few-shot prompting and query expansion, existing methods often struggle when requests are underspecified in actionable details (e.g., data preprocessing assumptions, solver or library choices, etc.), frequently necessitating manual intervention. To overcome these limitations, we propose VisPath: a Multi-Path Reasoning and Feedback-Driven Optimization Framework for Visualization Code Generation. VisPath handles underspecified queries through structured, multi-stage processing. It begins by using Chain-of-Thought (CoT) prompting to reformulate the initial user input, generating multiple extended queries in parallel to surface alternative plausible concretizations of the request. These queries then generate candidate visualization scripts, which are executed to produce diverse images. By assessing the visual quality and correctness of each output, VisPath generates targeted feedback that is aggregated to synthesize an optimal final result. Extensive experiments on MatPlotBench and Qwen-Agent Code Interpreter Benchmark show that VisPath outperforms state-of-the-art methods, providing a more reliable framework for AI-driven visualization generation.

preprint2026arXiv

Scilit with the Integrated Impact Indicator Assessment

In this study, we systematically elucidate the background and functionality of the Scilit database and evaluate the feasibility and advantages of the comprehensive impact metrics I3 and I3/N, introduced within the Scilit framework. Using a matched dataset of 17,816 journals, we conduct a comparative analysis of Scilit I3/N, Journal Impact Factor, and CiteScore for 2023 and 2024, covering descriptive statistics and distributional characteristics from both disciplinary and publisher perspectives. The comparison reveals that the Scilit I3 and I3/N framework significantly outperforms traditional mean-based metrics in terms of coverage, methodological robustness, and disciplinary fairness. It provides a more accurate, diagnosable, and responsible solution for interdisciplinary journal impact assessment. Our research serves as a "getting started guide" for Scilit, offering scholars, librarians, and academic publishers in the fields of bibliometrics or scientometrics a valuable perspective for exploring I3 and I3/N within an inclusive database. This enables a more accurate and comprehensive understanding of disciplinary development and scientific progress. We advocate for piloting and validating this method in broader evaluation contexts to foster a more precise and diverse representation of scientific progress.

preprint2026arXiv

SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science

Large Language Models (LLMs) have enabled dynamic reasoning in automated data analytics, yet recent multi-agent systems remain limited by rigid, single-path workflows that restrict strategic exploration and often lead to suboptimal outcomes. To overcome these limitations, we propose SPIO (Sequential Plan Integration and Optimization), a framework that replaces rigid workflows with adaptive, multi-path planning across four core modules: data preprocessing, feature engineering, model selection, and hyperparameter tuning. In each module, specialized agents generate diverse candidate strategies, which are cascaded and refined by an optimization agent. SPIO offers two operating modes: SPIO-S for selecting a single optimal pipeline, and SPIO-E for ensembling top-k pipelines to maximize robustness. Extensive evaluations on Kaggle and OpenML benchmarks show that SPIO consistently outperforms state-of-the-art baselines, achieving an average performance gain of 5.6%. By explicitly exploring and integrating multiple solution paths, SPIO delivers a more flexible, accurate, and reliable foundation for automated data science.

preprint2022arXiv

Exploring the Distribution Regularities of User Attention and Sentiment toward Product Aspects in Online Reviews

[Purpose] To better understand the online reviews and help potential consumers, businessmen, and product manufacturers effectively obtain users' evaluation on product aspects, this paper explores the distribution regularities of user attention and sentiment toward product aspects from the temporal perspective of online reviews. [Design/methodology/approach] Temporal characteristics of online reviews (purchase time, review time, and time intervals between purchase time and review time), similar attributes clustering, and attribute-level sentiment computing technologies are employed based on more than 340k smartphone reviews of three products from JD.COM (a famous online shopping platform in China) to explore the distribution regularities of user attention and sentiment toward product aspects in this article. [Findings] The empirical results show that a power-law distribution can fit user attention to product aspects, and the reviews posted in short time intervals contain more product aspects. Besides, the results show that the values of user sentiment of product aspects are significantly higher/lower in short time intervals which contribute to judging the advantages and weaknesses of a product. [Research limitations] The paper can't acquire online reviews for more products with temporal characteristics to verify the findings because of the restriction on reviews crawling by the shopping platforms. [Originality/value] This work reveals the distribution regularities of user attention and sentiment toward product aspects, which is of great significance in assisting decision-making, optimizing review presentation, and improving the shopping experience.

preprint2022arXiv

Team formation and team performance: The balance between team freshness and repeat collaboration

Incorporating fresh members in teams is considered a pathway to team creativity. However, whether freshness improves team performance or not remains unclear, as well as the optimal involvement of fresh members for team performance. This study uses a group of authors on the byline of a publication as a proxy for a scientific team. We extend an indicator, i.e., team freshness, to measure the extent to which a scientific team incorporates new members, by calculating the fraction of new collaboration relations established within the team. Based on more than 43 million scientific publications covering more than a half-century of research from Microsoft Academic Graph, this study provides a holistic picture of the current development of team freshness by outlining the temporal evolution of freshness, and its disciplinary distribution. Subsequently, using a multivariable regression approach, we examine the association between team freshness and papers'short-term and long-term citations.The major findings are as follows: (1)team freshness in scientific teams has been increasing in the past half-century; (2)there exists an inverted-U-shaped association between team freshness and papers' citations in all the disciplines and in different periods;(3)the inverted-U-shaped relationship between team freshness and papers' citations is only found in small teams, while, in large teams, team freshness is significantly positively related to papers' citations.

preprint2022arXiv

Team Power Dynamics and Team Impact: New Perspectives on Scientific Collaboration using Career Age as a Proxy for Team Power

Power dynamics influence every aspect of scientific collaboration. Team power dynamics can be measured by team power level and team power hierarchy. Team power level is conceptualized as the average level of the possession of resources, expertise, or decision-making authorities of a team. Team power hierarchy represents the vertical differences of the possessions of resources in a team. In Science of Science, few studies have looked at scientific collaboration from the perspective of team power dynamics. This research examines how team power dynamics affect team impact to fill the research gap. In this research, all co-authors of one publication are treated as one team. Team power level and team power hierarchy of one team are measured by the mean and Gini index of career age of co-authors in this team. Team impact is quantified by citations of a paper authored by this team. By analyzing over 7.7 million teams from Science (e.g., Computer Science, Physics), Social Sciences (e.g., Sociology, Library & Information Science), and Arts & Humanities (e.g., Art), we find that flat team structure is associated with higher team impact, especially when teams have high team power level. These findings have been repeated in all five disciplines except Art, and are consistent in various types of teams from Computer Science including teams from industry or academia, teams with different gender groups, teams with geographical contrast, and teams with distinct size.

preprint2022arXiv

The Gene of Scientific Success

This paper elaborates how to identify and evaluate causal factors to improve scientific impact. Currently, analyzing scientific impact can be beneficial to various academic activities including funding application, mentor recommendation, and discovering potential cooperators etc. It is universally acknowledged that high-impact scholars often have more opportunities to receive awards as an encouragement for their hard working. Therefore, scholars spend great efforts in making scientific achievements and improving scientific impact during their academic life. However, what are the determinate factors that control scholars' academic success? The answer to this question can help scholars conduct their research more efficiently. Under this consideration, our paper presents and analyzes the causal factors that are crucial for scholars' academic success. We first propose five major factors including article-centered factors, author-centered factors, venue-centered factors, institution-centered factors, and temporal factors. Then, we apply recent advanced machine learning algorithms and jackknife method to assess the importance of each causal factor. Our empirical results show that author-centered and article-centered factors have the highest relevancy to scholars' future success in the computer science area. Additionally, we discover an interesting phenomenon that the h-index of scholars within the same institution or university are actually very close to each other.

preprint2020arXiv

An empirical review of the different variants of the Probabilistic Affinity Index as applied to scientific collaboration

Responsible indicators are crucial for research assessment and monitoring. Transparency and accuracy of indicators are required to make research assessment fair and ensure reproducibility. However, sometimes it is difficult to conduct or replicate studies based on indicators due to the lack of transparency in conceptualization and operationalization. In this paper, we review the different variants of the Probabilistic Affinity Index (PAI), considering both the conceptual and empirical underpinnings. We begin with a review of the historical development of the indicator and the different alternatives proposed. To demonstrate the utility of the indicator, we demonstrate the application of PAI to identifying preferred partners in scientific collaboration. A streamlined procedure is provided, to demonstrate the variations and appropriate calculations. We then compare the results of implementation for five specific countries involved in international scientific collaboration. Despite the different proposals on its calculation, we do not observe large differences between the PAI variants, particularly with respect to country size. As with any indicator, the selection of a particular variant is dependent on the research question. To facilitate appropriate use, we provide recommendations for the use of the indicator given specific contexts.

preprint2020arXiv

Building a PubMed knowledge graph

PubMed is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguated, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID, and identifying fine-grained affiliation data from MapAffil. Through the integration of the credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving a F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. The PKG is freely available on Figshare (https://figshare.com/s/6327a55355fc2c99f3a2, simplified version that exclude PubMed raw data) and TACC website (http://er.tacc.utexas.edu/datasets/ped, full version).

preprint2020arXiv

Citation Cascade and the Evolution of Topic Relevance

Citation analysis, as a tool for quantitative studies of science, has long emphasized direct citation relations, leaving indirect or high order citations overlooked. However, a series of early and recent studies demonstrate the existence of indirect and continuous citation impact across generations. Adding to the literature on high order citations, we introduce the concept of a citation cascade: the constitution of a series of subsequent citing events initiated by a certain publication. We investigate this citation structure by analyzing more than 450,000 articles and over 6 million citation relations. We show that citation impact exists not only within the three generations documented in prior research, but also in much further generations. Still, our experimental results indicate that two to four generations are generally adequate to trace a work's scientific impact. We also explore specific structural properties such as depth, width, structural virality, and size, which account for differences among individual citation cascades. Finally, we find evidence that it is more important for a scientific work to inspire trans domain (or indirectly related domain) works than to receive only intra domain recognition in order to achieve high impact. Our methods and findings can serve as a new tool for scientific evaluation and the modeling of scientific history.

preprint2020arXiv

Coronavirus Knowledge Graph: A Case Study

The emergence of the novel COVID-19 pandemic has had a significant impact on global healthcare and the economy over the past few months. The virus's rapid widespread has led to a proliferation in biomedical research addressing the pandemic and its related topics. One of the essential Knowledge Discovery tools that could help the biomedical research community understand and eventually find a cure for COVID-19 are Knowledge Graphs. The CORD-19 dataset is a collection of publicly available full-text research articles that have been recently published on COVID-19 and coronavirus topics. Here, we use several Machine Learning, Deep Learning, and Knowledge Graph construction and mining techniques to formalize and extract insights from the PubMed dataset and the CORD-19 dataset to identify COVID-19 related experts and bio-entities. Besides, we suggest possible techniques to predict related diseases, drug candidates, gene, gene mutations, and related compounds as part of a systematic effort to apply Knowledge Discovery methods to help biomedical researchers tackle the pandemic.

preprint2020arXiv

The Pace of Artificial Intelligence Innovations: Speed, Talent, and Trial-and-Error

Innovations in artificial intelligence (AI) are occurring at speeds faster than ever witnessed before. However, few studies have managed to measure or depict this increasing velocity of innovations in the field of AI. In this paper, we combine data on AI from arXiv and Semantic Scholar to explore the pace of AI innovations from three perspectives: AI publications, AI players, and AI updates (trial and error). A research framework and three novel indicators, Average Time Interval (ATI), Innovation Speed (IS) and Update Speed (US), are proposed to measure the pace of innovations in the field of AI. The results show that: (1) in 2019, more than 3 AI preprints were submitted to arXiv per hour, over 148 times faster than in 1994. Furthermore, there was one deep learning-related preprint submitted to arXiv every 0.87 hours in 2019, over 1,064 times faster than in 1994. (2) For AI players, 5.26 new researchers entered into the field of AI each hour in 2019, more than 175 times faster than in the 1990s. (3) As for AI updates (trial and error), one updated AI preprint was submitted to arXiv every 41 days, with around 33% of AI preprints having been updated at least twice in 2019. In addition, as reported in 2019, it took, on average, only around 0.2 year for AI preprints to receive their first citations, which is 5 times faster than 2000-2007. This swift pace in AI illustrates the increase in popularity of AI innovation. The systematic and fine-grained analysis of the AI field enabled to portrait the pace of AI innovation and demonstrated that the proposed approach can be adopted to understand other fast-growing fields such as cancer research and nano science.