Source author record

Vincent Larivière

Vincent Larivière appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Digital Libraries Computation and Language Information Retrieval physics.soc-ph Social and Information Networks Artificial Intelligence cs.CY Machine Learning

Catalog footprint

What is connected

16works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Sorting the Babble in Babel: Assessing the Performance of Language Identification Algorithms on the OpenAlex Database

This project aims to optimize the linguistic indexing of the OpenAlex database by comparing the performance of various Python-based language identification procedures on different metadata corpora extracted from a manually-annotated article sample \footnote{OpenAlex used the results presented in this article to inform the language metadata overhaul carried out as part of its recent Walden system launch. The precision and recall performance of each algorithm, corpus, and language is first analyzed, followed by an assessment of processing speeds recorded for each algorithm and corpus type. These different performance measures are then simulated at the database level using probabilistic confusion matrices for each algorithm, corpus, and language, as well as a probabilistic modeling of relative article language frequencies for the whole OpenAlex database. Results show that procedure performance strongly depends on the importance given to each of the measures implemented: for contexts where precision is preferred, using the LangID algorithm on the greedy corpus gives the best results; however, for all cases where recall is considered at least slightly more important than precision or as soon as processing times are given any kind of consideration, the procedure that consists in the application of the FastText algorithm on the Titles corpus outperforms all other alternatives. Given the lack of truly multilingual large-scale bibliographic databases, it is hoped that these results help confirm and foster the unparalleled potential of the OpenAlex database for cross-linguistic and comprehensive measurement and evaluation.

preprint2022arXiv

Impact of Geographic Diversity on Citation of Collaborative Research

Diversity in human capital is widely seen as critical to creating holistic and high quality research, especially in areas that engage with diverse cultures, environments, and challenges. Quantifying diverse academic collaborations and its effect on research quality is lacking, especially at international scale and across different domains. Here, we present the first effort to measure the impact of geographic diversity in coauthorships on the citation of their papers across different academic domains. Our results unequivocally show that geographic coauthor diversity improves paper citation, but very long distance collaborations has variable impact. We also discover "well-trodden" collaboration circles that yield much less impact than similar travel distances. These relationships are observed to exist across different subject areas, but with varying strengths. These findings can help academics identify new opportunities from a diversity perspective, as well as inform funders on areas that require additional mobility support.

preprint2021arXiv

Avoiding bias when inferring race using name-based approaches

Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial based systemic inequalities is an important step towards a more equitable research system. However, because of the lack of robust information on authors' race, few large scale analyses have been performed on this topic. Algorithmic approaches offer one solution, using known information about authors, such as their names, to infer their perceived race. As with any other algorithm, the process of racial inference can generate biases if it is not carefully considered. The goal of this article is to assess the extent to which algorithmic bias is introduced using different approaches for name based racial inference. We use information from the U.S. Census and mortgage applications to infer the race of U.S. affiliated authors in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name based inference varies by race/ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article lays the foundation for more systematic and less biased investigations into racial disparities in science.

preprint2020arXiv

Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)

One of the challenges in machine learning research is to ensure that presented and published results are sound and reliable. Reproducibility, that is obtaining similar results as presented in a paper or talk, using the same code and data (when available), is a necessary step to verify the reliability of research findings. Reproducibility is also an important step to promote open and accessible research, thereby allowing the scientific community to quickly integrate new findings and convert ideas to practice. Reproducibility also promotes the use of robust experimental workflows, which potentially reduce unintentional errors. In 2019, the Neural Information Processing Systems (NeurIPS) conference, the premier international conference for research in machine learning, introduced a reproducibility program, designed to improve the standards across the community for how we conduct, communicate, and evaluate machine learning research. The program contained three components: a code submission policy, a community-wide reproducibility challenge, and the inclusion of the Machine Learning Reproducibility checklist as part of the paper submission process. In this paper, we describe each of these components, how it was deployed, as well as what we were able to learn from this initiative.

preprint2020arXiv

Textual analysis of artificial intelligence manuscripts reveals features associated with peer review outcome

We analysed a dataset of scientific manuscripts that were submitted to various conferences in artificial intelligence. We performed a combination of semantic, lexical and psycholinguistic analyses of the full text of the manuscripts and compared them with the outcome of the peer review process. We found that accepted manuscripts scored lower than rejected manuscripts on two indicators of readability, and that they also used more scientific and artificial intelligence jargon. We also found that accepted manuscripts were written with words that are less frequent, that are acquired at an older age, and that are more abstract than rejected manuscripts. The analysis of references included in the manuscripts revealed that the subset of accepted submissions were more likely to cite the same publications. This finding was echoed by pairwise comparisons of the word content of the manuscripts (i.e. an indicator or semantic similarity), which were more similar in the subset of accepted manuscripts. Finally, we predicted the peer review outcome of manuscripts with their word content, with words related to machine learning and neural networks positively related with acceptance, whereas words related to logic, symbolic processing and knowledge-based systems negatively related with acceptance.

preprint2018arXiv

The many faces of mobility: Using bibliometric data to measure the movement of scientists

This paper presents a methodological framework for developing scientific mobility indicators based on bibliometric data. We identify nearly 16 million individual authors from publications covered in the Web of Science for the 2008-2015 period. Based on the information provided across individuals' publication records, we propose a general classification for analyzing scientific mobility using institutional affiliation changes. We distinguish between migrants--authors who have ruptures with their country of origin--and travelers--authors who gain additional affiliations while maintaining affiliation with their country of origin. We find that 3.7 percent of researchers who have published at least one paper over the period are mobile. Travelers represent 72.7 percent of all mobile scholars, but migrants have higher scientific impact. We apply this classification at the country level, expanding the classification to incorporate the directionality of scientists' mobility (i.e., incoming and outgoing). We provide a brief analysis to highlight the utility of the proposed taxonomy to study scholarly mobility and discuss the implications for science policy.

preprint2016arXiv

On the Composition of Scientific Abstracts

Scientific abstracts contain what is considered by the author(s) as information that best describe documents' content. They represent a compressed view of the informational content of a document and allow readers to evaluate the relevance of the document to a particular information need. However, little is known on their composition. This paper contributes to the understanding of the structure of abstracts, by comparing similarity between scientific abstracts and the text content of research articles. More specifically, using sentence-based similarity metrics, we quantify the phenomenon of text re-use in abstracts and examine the positions of the sentences that are similar to sentences in abstracts in the IMRaD structure (Introduction, Methods, Results and Discussion), using a corpus of over 85,000 research articles published in the seven PLOS journals. We provide evidence that 84% of abstract have at least one sentence in common with the body of the article. Our results also show that the sections of the paper from which abstract sentence are taken are invariant across the PLOS journals, with sentences mainly coming from the beginning of the introduction and the end of the conclusion.

preprint2016arXiv

Scholarly use of social media and altmetrics: a review of the literature

Social media has become integrated into the fabric of the scholarly communication system in fundamental ways: principally through scholarly use of social media platforms and the promotion of new indicators on the basis of interactions with these platforms. Research and scholarship in this area has accelerated since the coining and subsequent advocacy for altmetrics -- that is, research indicators based on social media activity. This review provides an extensive account of the state-of-the art in both scholarly use of social media and altmetrics. The review consists of two main parts: the first examines the use of social media in academia, examining the various functions these platforms have in the scholarly communication process and the factors that affect this use. The second part reviews empirical studies of altmetrics, discussing the various interpretations of altmetrics, data collection and methodological limitations, and differences according to platform. The review ends with a critical discussion of the implications of this transformation in the scholarly communication system.

preprint2015arXiv

Social media in scholarly communication

Social media metrics - commonly coined as "altmetrics" - have been heralded as great democratizers of science, providing broader and timelier indicators of impact than citations. These metrics come from a range of sources, including Twitter, blogs, social reference managers, post-publication peer review, and other social media platforms. Social media metrics have begun to be used as indicators of scientific impact, yet the theoretical foundation, empirical validity, and extent of use of platforms underlying these metrics lack thorough treatment in the literature. This editorial provides an overview of terminology and definitions of altmetrics and summarizes current research regarding social media use in academia, social media metrics as well as data reliability and validity. The papers of the special issue are introduced.

preprint2014arXiv

Astrophysicists on Twitter: An in-depth analysis of tweeting and scientific publication behavior

This paper analyzes the tweeting behavior of 37 astrophysicists on Twitter and compares their tweeting behavior with their publication behavior and citation impact to show whether they tweet research-related topics or not. Astrophysicists on Twitter are selected to compare their tweets with their publications from Web of Science. Different user groups are identified based on tweeting and publication frequency. A moderate negative correlation (p=-0.390*) is found between the number of publications and tweets per day, while retweet and citation rates do not correlate. The similarity between tweets and abstracts is very low (cos=0.081). User groups show different tweeting behavior such as retweeting and including hashtags, usernames and URLs. The study is limited in terms of the small set of astrophysicists. Results are not necessarily representative of the entire astrophysicist community on Twitter and they most certainly do not apply to scientists in general. Future research should apply the methods to a larger set of researchers and other scientific disciplines. To a certain extent, this study helps to understand how researchers use Twitter. The results hint at the fact that impact on Twitter can neither be equated with nor replace traditional research impact metrics. However, tweets and other so-called altmetrics might be able to reflect other impact of scientists such as public outreach and science communication. To the best of our knowledge, this is the first in-depth study comparing researchers' tweeting activity and behavior with scientific publication output in terms of quantity, content and impact.

preprint2014arXiv

The role of handbooks in knowledge creation and diffusion: A case of science and technology studies

Genre is considered to be an important element in scholarly communication and in the practice of scientific disciplines. However, scientometric studies have typically focused on a single genre, the journal article. The goal of this study is to understand the role that handbooks play in knowledge creation and diffusion and their relationship with the genre of journal articles, particularly in highly interdisciplinary and emergent social science and humanities disciplines. To shed light on these questions we focused on handbooks and journal articles published over the last four decades belonging to the research area of Science and Technology Studies (STS), broadly defined. To get a detailed picture we used the full-text of five handbooks (500,000 words) and a well-defined set of 11,700 STS articles. We confirmed the methodological split of STS into qualitative and quantitative (scientometric) approaches. Even when the two traditions explore similar topics (e.g., science and gender) they approach them from different starting points. The change in cognitive foci in both handbooks and articles partially reflects the changing trends in STS research, often driven by technology. Using text similarity measures we found that, in the case of STS, handbooks play no special role in either focusing the research efforts or marking their decline. In general, they do not represent the summaries of research directions that have emerged since the previous edition of the handbook.

preprint2014arXiv

Tweets as impact indicators: Examining the implications of automated bot accounts on Twitter

This brief communication presents preliminary findings on automated Twitter accounts distributing links to scientific papers deposited on the preprint repository arXiv. It discusses the implication of the presence of such bots from the perspective of social media metrics (altmetrics), where mentions of scholarly documents on Twitter have been suggested as a means of measuring impact that is both broader and timelier than citations. We present preliminary findings that automated Twitter accounts create a considerable amount of tweets to scientific papers and that they behave differently than common social bots, which has critical implications for the use of raw tweet counts in research evaluation and assessment. We discuss some definitions of Twitter cyborgs and bots in scholarly communication and propose differentiating between different levels of engagement from tweeting only bibliographic information to discussing or commenting on the content of a paper.

preprint2014arXiv

Tweets vs. Mendeley readers: How do these two social media metrics differ?

A set of 1.4 million biomedical papers was analyzed with regards to how often articles are mentioned on Twitter or saved by users on Mendeley. While Twitter is a microblogging platform used by a general audience to distribute information, Mendeley is a reference manager targeted at an academic user group to organize scholarly literature. Both platforms are used as sources for so-called altmetrics to measure a new kind of research impact. This analysis shows in how far they differ and compare to traditional citation impact metrics based on a large set of PubMed papers.

preprint2013arXiv

Tweeting biomedicine: an analysis of tweets and citations in the biomedical literature

Data collected by social media platforms have recently been introduced as a new source for indicators to help measure the impact of scholarly research in ways that are complementary to traditional citation-based indicators. Data generated from social media activities related to scholarly content can be used to reflect broad types of impact. This paper aims to provide systematic evidence regarding how often Twitter is used to diffuse journal articles in the biomedical and life sciences. The analysis is based on a set of 1.4 million documents covered by both PubMed and Web of Science (WoS) and published between 2010 and 2012. The number of tweets containing links to these documents was analyzed to evaluate the degree to which certain journals, disciplines, and specialties were represented on Twitter. It is shown that, with less than 10% of PubMed articles mentioned on Twitter, its uptake is low in general. The relationship between tweets and WoS citations was examined for each document at the level of journals and specialties. The results show that tweeting behavior varies between journals and specialties and correlations between tweets and citations are low, implying that impact metrics based on tweets are different from those based on citations. A framework utilizing the coverage of articles and the correlation between Twitter mentions and citations is proposed to facilitate the evaluation of novel social-media based metrics and to shed light on the question in how far the number of tweets is a valid metric to measure research impact.

preprint2012arXiv

Green and Gold Open Access Percentages and Growth, by Discipline

Most refereed journal articles today are published in subscription journals, accessible only to subscribing institutions, hence losing considerable research impact. Making articles freely accessible online ("Open Access," OA) maximizes their impact. Articles can be made OA in two ways: by self-archiving them on the web ("Green OA") or by publishing them in OA journals ("Gold OA"). We compared the percent and growth rate of Green and Gold OA for 14 disciplines in two random samples of 1300 articles per discipline out of the 12,500 journals indexed by Thomson-Reuters-ISI using a robot that trawled the web for OA full-texts. We sampled in 2009 and 2011 for publication year ranges 1998-2006 and 2005-2010, respectively. Green OA (21.4%) exceeds Gold OA (2.4%) in proportion and growth rate in all but the biomedical disciplines, probably because it can be provided for all journals articles and does not require paying extra Gold OA publication fees. The spontaneous overall OA growth rate is still very slow (about 1% per year). If institutions make Green OA self-archiving mandatory, however, it triples percent Green OA as well as accelerating its growth rate.

preprint2011arXiv

A small world of citations? The influence of collaboration networks on citation practices

This paper examines the proximity of authors to those they cite using degrees of separation in a co-author network, essentially using collaboration networks to expand on the notion of self-citations. While the proportion of direct self-citations (including co-authors of both citing and cited papers) is relatively constant in time and across specialties in the natural sciences (10% of citations) and the social sciences (20%), the same cannot be said for citations to authors who are members of the co-author network. Differences between fields and trends over time lie not only in the degree of co-authorship which defines the large-scale topology of the collaboration network, but also in the referencing practices within a given discipline, computed by defining a propensity to cite at a given distance within the collaboration network. Overall, there is little tendency to cite those nearby in the collaboration network, excluding direct self-citations. By analyzing these social references, we characterize the social capital of local collaboration networks in terms of the knowledge production within scientific fields. These results have implications for the long-standing debate over biases common to most types of citation analysis, and for understanding citation practices across scientific disciplines over the past 50 years. In addition, our findings have important practical implications for the availability of 'arm's length' expert reviewers of grant applications and manuscripts.

Vincent Larivière

What is connected

Connect this record

See the researcher in context

Building this map preview

16 published item(s)

Sorting the Babble in Babel: Assessing the Performance of Language Identification Algorithms on the OpenAlex Database

Impact of Geographic Diversity on Citation of Collaborative Research

Avoiding bias when inferring race using name-based approaches

Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)

Textual analysis of artificial intelligence manuscripts reveals features associated with peer review outcome

The many faces of mobility: Using bibliometric data to measure the movement of scientists

On the Composition of Scientific Abstracts

Scholarly use of social media and altmetrics: a review of the literature

Social media in scholarly communication

Astrophysicists on Twitter: An in-depth analysis of tweeting and scientific publication behavior

The role of handbooks in knowledge creation and diffusion: A case of science and technology studies

Tweets as impact indicators: Examining the implications of automated bot accounts on Twitter

Tweets vs. Mendeley readers: How do these two social media metrics differ?

Tweeting biomedicine: an analysis of tweets and citations in the biomedical literature

Green and Gold Open Access Percentages and Growth, by Discipline

A small world of citations? The influence of collaboration networks on citation practices