Source author record

Jinseok Kim

Jinseok Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Digital Libraries Information Retrieval Social and Information Networks Machine Learning physics.soc-ph Databases

Catalog footprint

What is connected

9works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Author name disambiguation results are often evaluated by measures such as Cluster-F, K-metric, Pairwise-F, Splitting & Lumping Error, and B-cubed. Although these measures have distinctive evaluation schemes, this paper shows that they can be calculated in a single framework by a set of common steps that compare truth and predicted clusters through two hash tables recording information about name instances with their predicted cluster indices and frequencies of those indices per truth cluster. This integrative calculation reduces greatly calculation runtime, which is scalable to a clustering task involving millions of name instances within a few seconds. During the integration process, B-cubed and K-metric are shown to produce the same precision and recall scores. In this framework, especially, name instance pairs for Pairwise-F are counted using a heuristic, surpassing a state-of-the-art algorithm in speedy calculation. Details of the integrative calculation are described with examples and pseudo-code to assist scholars to implement each measure easily and validate the correctness of implementation. The integrative calculation will help scholars compare similarities and differences of multiple measures before they select ones that characterize best the clustering performances of their disambiguation methods.

preprint2021arXiv

Effect of forename string on author name disambiguation

In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performances of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled datasets under varying ratios and lengths of full forenames, reflecting real-world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). Results show that increasing the ratios of full forenames improves substantially the performances of both heuristic and machine-learning-based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonym is prevalent. As the ratios of full forenames increase, however, they become marginal compared to the performances by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation compared to using full-length strings. These findings provide practical suggestions such as restoring initialized forenames into a full-string format via record linkage for improved disambiguation performances.

preprint2021arXiv

Formational bounds of link prediction in collaboration networks

Link prediction in collaboration networks is often solved by identifying structural properties of existing nodes that are disconnected at one point in time, and that share a link later on. The maximally possible recall rate or upper bound of this approach's success is capped by the proportion of links that are formed among existing nodes embedded in these properties. Consequentially, sustained ties as well as links that involve one or two new network participants are typically not predicted. The purpose of this study is to highlight formational constraints that need to be considered to increase the practical value of link prediction methods for collaboration networks. In this study, we identify the distribution of basic link formation types based on four large-scale, over-time collaboration networks, showing that current link predictors can maximally anticipate around 25% of links that involve at least one prior network member. This implies that for collaboration networks, increasing the accuracy of computational link prediction solutions may not be a reasonable goal when the ratio of collaboration ties that are eligible to the classic link prediction process is low.

preprint2021arXiv

Generating automatically labeled data for author name disambiguation: An iterative clustering method

To train algorithms for supervised author name disambiguation, many studies have relied on hand-labeled truth data that are very laborious to generate. This paper shows that labeled training data can be automatically generated using information features such as email address, coauthor names, and cited references that are available from publication records. For this purpose, high-precision rules for matching name instances on each feature are decided using an external-authority database. Then, selected name instances in target ambiguous data go through the process of pairwise matching based on the rules. Next, they are merged into clusters by a generic entity resolution algorithm. The clustering procedure is repeated over other features until further merging is impossible. Tested on 26,566 instances out of the population of 228K author name instances, this iterative clustering produced accurately labeled data with pairwise F1 = 0.99. The labeled data represented the population data in terms of name ethnicity and co-disambiguating name group size distributions. In addition, trained on the labeled data, machine learning algorithms disambiguated 24K names in test data with performance of pairwise F1 = 0.90 ~ 0.92. Several challenges are discussed for applying this method to resolving author name ambiguity in large-scale scholarly data.

preprint2021arXiv

ORCID-linked labeled data for evaluating author name disambiguation at scale

How can we evaluate the performance of a disambiguation method implemented on big bibliographic data? This study suggests that the open researcher profile system, ORCID, can be used as an authority source to label name instances at scale. This study demonstrates the potential by evaluating the disambiguation performances of Author-ity2009 (which algorithmically disambiguates author names in MEDLINE) using 3 million name instances that are automatically labeled through linkage to 5 million ORCID researcher profiles. Results show that although ORCID-linked labeled data do not effectively represent the population of name instances in Author-ity2009, they do effectively capture the 'high precision over high recall' performances of Author-ity2009. In addition, ORCID-linked labeled data can provide nuanced details about the Author-ity2009's performance when name instances are evaluated within and across ethnicity categories. As ORCID continues to be expanded to include more researchers, labeled data via ORCID-linkage can be improved in representing the population of a whole disambiguated data and updated on a regular basis. This can benefit author name disambiguation researchers and practitioners who need large-scale labeled data but lack resources for manual labeling or access to other authority sources for linkage-based labeling. The ORCID-linked labeled data for Author-tiy2009 are publicly available for validation and reuse.

preprint2021arXiv

Over-time measurement of triadic closure in coauthorship networks

Applying the concept of triadic closure to coauthorship networks means that scholars are likely to publish a joint paper if they have previously coauthored with the same people. Prior research has identified moderate to high (20% to 40%) closure rates; suggesting that this mechanism is a reasonable explanation for tie formation between future coauthors. We show how calculating triadic closure based on prior operationalizations of closure, namely Newman's measure for one-mode networks (NCC) and Opsahl's measure for two-mode networks (OCC), may lead to higher amounts of closure as compared to measuring closure over time via a metric that we introduce and test in this paper. Based on empirical experiments using four large-scale, longitudinal datasets, we find a lower bound of about 1~3% closure rates and an upper bound of about 4~7%. These results motivate research on new explanatory factors for the formation of coauthorship links.

preprint2020arXiv

BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Binary Neural Networks (BNNs) have been garnering interest thanks to their compute cost reduction and memory savings. However, BNNs suffer from performance degradation mainly due to the gradient mismatch caused by binarizing activations. Previous works tried to address the gradient mismatch problem by reducing the discrepancy between activation functions used at forward pass and its differentiable approximation used at backward pass, which is an indirect measure. In this work, we use the gradient of smoothed loss function to better estimate the gradient mismatch in quantized neural network. Analysis using the gradient mismatch estimator indicates that using higher precision for activation is more effective than modifying the differentiable approximation of activation function. Based on the observation, we propose a new training scheme for binary activation networks called BinaryDuo in which two binary activations are coupled into a ternary activation during training. Experimental results show that BinaryDuo outperforms state-of-the-art BNNs on various benchmarks with the same amount of parameters and computing cost.

preprint2015arXiv

Coauthorship networks: A directed network approach considering the order and number of coauthors

In many scientific fields, the order of coauthors on a paper conveys information about each individual's contribution to a piece of joint work. We argue that in prior network analyses of coauthorship networks, the information on ordering has been insufficiently considered because ties between authors are typically symmetrized. This is basically the same as assuming that each co-author has contributed equally to a paper. We introduce a solution to this problem by adopting a coauthorship credit allocation model proposed by Kim and Diesner (2014), which in its core conceptualizes co-authoring as a directed, weighted, and self-looped network. We test and validate our application of the adopted framework based on a sample data of 861 authors who have published in the journal Psychometrika. Results suggest that this novel sociometric approach can complement traditional measures based on undirected networks and expand insights into coauthoring patterns such as the hierarchy of collaboration among scholars. As another form of validation, we also show how our approach accurately detects prominent scholars in the Psychometric Society affiliated with the journal.

preprint2015arXiv

Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks

Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests this assumption by analyzing coauthorship networks from five academic fields - biology, computer science, nanoscience, neuroscience, and physics - and an interdisciplinary journal, PNAS. Name instances in datasets of this study were disambiguated based on heuristics gained from previous algorithmic disambiguation solutions. We use disambiguated data as a proxy of ground-truth to test the performance of three types of initial-based disambiguation. Our results show that initial-based disambiguation can misrepresent statistical properties of coauthorship networks: it deflates the number of unique authors, number of component, average shortest paths, clustering coefficient, and assortativity, while it inflates average productivity, density, average coauthor number per author, and largest component size. Also, on average, more than half of top 10 productive or collaborative authors drop off the lists. Asian names were found to account for the majority of misidentification by initial-based disambiguation due to their common surname and given name initials.

Jinseok Kim

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

A fast and integrative algorithm for clustering performance evaluation in author name disambiguation

Effect of forename string on author name disambiguation

Formational bounds of link prediction in collaboration networks

Generating automatically labeled data for author name disambiguation: An iterative clustering method

ORCID-linked labeled data for evaluating author name disambiguation at scale

Over-time measurement of triadic closure in coauthorship networks

BinaryDuo: Reducing Gradient Mismatch in Binary Activation Network by Coupling Binary Activations

Coauthorship networks: A directed network approach considering the order and number of coauthors

Distortive Effects of Initial-Based Name Disambiguation on Measurements of Large-Scale Coauthorship Networks