Researcher profile

Wentian Li

Wentian Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - Emerging
15works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2024arXiv

Extending 1089 attractor to any number of digits and any number of steps

The well-known 1089 trick reflects an amazing trait of digital reversal process and reminisces of a limiting attractor in dynamical systems even though it takes only two steps. It is natural to consider the situations when the number of digits is beyond three as in the original 1089 trick, as well as situations when the number of steps is beyond two. The first part has been mostly done by Webster which we will reproduce. After two steps, the resulting integers are called Papadakis-Webster integers (PWI), which is always divisible by 99, and the resulting quotients consist of only 0's and 1's, which we name Papadakis-Webster binary strings (PWBS). Not all binary strings could be PWBS, and we define the hairpin pairing rule to determine if a binary string is a PWBS. For the second part, we propose a two-option iteration system named iterative digital reversal (IDR) suitably interweaving additions and subtractions. The simplest limiting behavior of IDR is 2-cycles. The elements in an IDR 2-cycle are all composed of repetitions of the 10(9)$_L$89 (L>=0) motif, and are all PWIs. The lower 2-cycle elements after division of 99 belong to the subset of PWBS that are palindromic and consist of 0- and 1-blocks with a minimal length of two. IDR also has higher p-cycles (p=10,12,71) whose elements seem to contain at least one PWI. Another interesting finding about IDR is that it contains non-periodic and diverging trajectories, as the integer values grow to infinity. In these diverging trajectories, while the number of flanking digits around the middle point increases by the iteration, the middle part has an 8-cycle rhythm or signature which has been found in all diverging trajectories. Overall, the generalization of the original 1089 trick in both space and time leads to new patterns in integers and new phenomenology in dynamics.

preprint2022arXiv

Human mobility patterns in Mexico City and their links with socioeconomic variables during the COVID-19 pandemic

The availability of cellphone geolocation data provides a remarkable opportunity to study human mobility patterns and how these patterns are affected by the recent pandemic. Two simple centrality metrics allow us to measure two different aspects of mobility in origin-destination networks constructed with this type of data: variety of places connected to a certain node (degree) and number of people that travel to or from a given node (strength). In this contribution, we present an analysis of node degree and strength in daily origin-destination networks for Greater Mexico City during 2020. Unlike what is observed in many complex networks, these origin-destination networks are not scale free. Instead, there is a characteristic scale defined by the distribution peak; centrality distributions exhibit a skewed two-tail distribution with power law decay on each side of the peak. We found that high mobility areas tend to be closer to the city center, have higher population and better socioeconomic conditions. Areas with anomalous behavior are almost always on the periphery of the city, where we can also observe qualitative difference in mobility patterns between east and west. Finally, we study the effect of mobility restrictions due to the outbreak of the COVID-19 pandemics on these mobility patterns.

preprint2020arXiv

Revisiting the Neutral Dynamics Derived Limiting Guanine-Cytosine Content Using the Human De Novo Point Mutation Data

We revisit the topic of human genome guanine-cytosine content under neutral evolution. For this study, the de novo mutation data within human is used to estimate mutational rate instead of using base substitution data between related species. We then define a new measure of mutation bias which separate the de novo mutation counts from the background guanine-cytosine content itself, making comparison between different datasets easier. We derive a new formula for calculating limiting guanine-cytosine content by separating CpG-involved mutational events as an independent variable. Using the formula when CpG-involved mutations are considered, the guanine-cytosine content drops less severely in the limit of neutral dynamics. We provide evidence, under certain assumptions, that an isochore-like structure might remain as a limiting configuration of the neutral mutational dynamics.

preprint2016arXiv

Population patterns in World's administrative units

While there has been an extended discussion concerning city population distribution, little has been said about administrative units. Even though there might be a correspondence between cities and administrative divisions, they are conceptually different entities and the correspondence breaks as artificial divisions form and evolve. In this work we investigate the population distribution of second level administrative units for 150 countries and propose the Discrete Generalized Beta Distribution (DGBD) rank-size function to describe the data. After testing the goodness of fit of this two parameter function against power law, which is the most common model for city population, DGBD is a good statistical model for 73% of our data sets and better than power law in almost every case. Particularly, DGBD is better than power law for fitting country population data. The fitted parameters of this function allow us to construct a phenomenological characterization of countries according to the way in which people are distributed inside them. We present a computational model to simulate the formation of administrative divisions and give numerical evidence that DGBD arises from it. This model along with the DGBD function prove adequate to reproduce and describe local unit evolution and its effect on population distribution.

preprint2013arXiv

Application of Volcano Plots in Analyses of mRNA Differential Expressions with Microarrays

Volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log10(p-value) from the t test). We review the basic and an interactive use of the volcano plot, and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide an unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility to apply volcano plots to other fields beyond microarray.

preprint2012arXiv

Analyses of Baby Name Popularity Distribution in U.S. for the Last 131 Years

We examine the complete dataset of baby name popularity collected by U.S. Social Security Administration for the last 131 years (1880-2010). The ranked baby name popularity can be fitted empirically by a piecewise function consisting of Beta function for the high-ranking names and power-law function for low-ranking names, but not power-law (Zipf's law) or Beta function by itself.

preprint2012arXiv

Characterizing Ranked Chinese Syllable-to-Character Mapping Spectrum: A Bridge Between the Spoken and Written Chinese Language

One important aspect of the relationship between spoken and written Chinese is the ranked syllable-to-character mapping spectrum, which is the ranked list of syllables by the number of characters that map to the syllable. Previously, this spectrum is analyzed for more than 400 syllables without distinguishing the four intonations. In the current study, the spectrum with 1280 toned syllables is analyzed by logarithmic function, Beta rank function, and piecewise logarithmic function. Out of the three fitting functions, the two-piece logarithmic function fits the data the best, both by the smallest sum of squared errors (SSE) and by the lowest Akaike information criterion (AIC) value. The Beta rank function is the close second. By sampling from a Poisson distribution whose parameter value is chosen from the observed data, we empirically estimate the $p$-value for testing the two-piece-logarithmic-function being better than the Beta rank function hypothesis, to be 0.16. For practical purposes, the piecewise logarithmic function and the Beta rank function can be considered a tie.

preprint2011arXiv

Effective Sample Size: Quick Estimation of the Effect of Related Samples in Genetic Case-Control Association Analyses

Affected relatives are essential for pedigree linkage analysis, however, they cause a violation of the independent sample assumption in case-control association studies. To avoid the correlation between samples, a common practice is to take only one affected sample per pedigree in association analysis. Although several methods exist in handling correlated samples, they are still not widely used in part because these are not easily implemented, or because they are not widely known. We advocate the effective sample size method as a simple and accessible approach for case-control association analysis with correlated samples. This method modifies the chi-square test statistic, p-value, and 95% confidence interval of the odds-ratio by replacing the apparent number of allele or genotype counts with the effective ones in the standard formula, without the need for specialized computer programs. We present a simple formula for calculating effective sample size for many types of relative pairs and relative sets. For allele frequency estimation, the effective sample size method captures the variance inflation exactly. For genotype frequency, simulations showed that effective sample size provides a satisfactory approximation. A gene which is previously identified as a type 1 diabetes susceptibility locus, the interferon-induced helicase gene (IFIH1), is shown to be significantly associated with rheumatoid arthritis when the effective sample size method is applied. This significant association is not established if only one affected sib per pedigree were used in the association analysis. Relationship between the effective sample size method and other methods -- the generalized estimation equation, variance of eigenvalues for correlation matrices, and genomic controls -- are discussed.

preprint2011arXiv

Fitting Ranked English and Spanish Letter Frequency Distribution in U.S. and Mexican Presidential Speeches

The limited range in its abscissa of ranked letter frequency distributions causes multiple functions to fit the observed distribution reasonably well. In order to critically compare various functions, we apply the statistical model selections on ten functions, using the texts of U.S. and Mexican presidential speeches in the last 1-2 centuries. Dispite minor switching of ranking order of certain letters during the temporal evolution for both datasets, the letter usage is generally stable. The best fitting function, judged by either least-square-error or by AIC/BIC model selection, is the Cocho/Beta function. We also use a novel method to discover clusters of letters by their observed-over-expected frequency ratios.

preprint2009arXiv

Copy-number-variation and copy-number-alteration region detection by cumulative plots

Background: Regions with copy number variations (in germline cells) or copy number alteration (in somatic cells) are of great interest for human disease gene mapping and cancer studies. They represent a new type of mutation and are larger-scaled than the single nucleotide polymorphisms. Using genotyping microarray for copy number variation detection has become standard, and there is a need for improving analysis methods. Results: We apply the cumulative plot to the detection of regions with copy number variation/alteration, on samples taken from a chronic lymphocytic leukemia patient. Two sets of whole-genome genotyping of 317k single nucleotide polymorphisms, one from the normal cell and another from the cancer cell, are analyzed. We demonstrate the utility of cumulative plot in detecting a 9Mb (9 x 10^6 bases) hemizygous deletion and 1Mb homozygous deletion on chromosome 13. We also show the possibility to detect smaller copy number variation/alteration regions below the 100kb range. Conclusions: As a graphic tool, the cumulative plot is an intuitive and a scale-free (window-less) way for detecting copy number variation/alteration regions, especially when such regions are small.

preprint2009arXiv

Partial correlation analysis indicates causal relationships between GC-content, exon density and recombination rate in the human genome

{\bf Background}: Several features are known to correlate with the GC-content in the human genome, including recombination rate, gene density and distance to telomere. However, by testing for pairwise correlation only, it is impossible to distinguish direct associations from indirect ones and to distinguish between causes and effects. {\bf Results}: We use partial correlations to construct partially directed graphs for the following four variables: GC-content, recombination rate, exon density and distance-to-telomere. Recombination rate and exon density are unconditionally uncorrelated, but become inversely correlated by conditioning on GC-content. This pattern indicates a model where recombination rate and exon density are two independent causes of GC-content variation. {\bf Conclusions}: Causal inference and graphical models are useful methods to understand genome evolution and the mechanisms of isochore evolution in the human genome.

preprint2009arXiv

Two-Parameter Characterization of Chromosome-Scale Recombination Rate

The genome-wide recombination rate ($RR$) of a species is often described by one parameter, the ratio between total genetic map length ($G$) and physical map length ($P$), measured in centiMorgans per Megabase (cM/Mb). The value of this parameter varies greatly between species, but the cause for these differences is not entirely clear. A constraining factor of overall $RR$ in a species, which may cause increased $RR$ for smaller chromosomes, is the requirement of at least one chiasma per chromosome (or chromosome-arm) per meiosis. In the present study, we quantify the relative excess of recombination events on smaller chromosomes by a linear regression model, which relates the genetic length of chromosomes to their physical length. We find for several species that the two-parameter regression, $G= G_0 + k \cdot P$ provides a better characterization of the relationship between genetic and physical map length than the one-parameter regression that runs through the origin. A non-zero intercept ($G_0$) indicates a relative excess of recombination on smaller chromosomes in a genome. Given $G_0$, the parameter $k$ predicts the increase of genetic map length over the increase of physical map length. The observed values of $G_0$ have a similar magnitude for diverse species, whereas $k$ varies by two orders of magnitude. The implications of this strategy for the genetic maps of human, mouse, rat, chicken, honeybee, worm and yeast are discussed.

preprint2006arXiv

Does Logarithm Transformation of Microarray Data Affect Ranking Order of Differentially Expressed Genes?

A common practice in microarray analysis is to transform the microarray raw data (light intensity) by a logarithmic transformation, and the justification for this transformation is to make the distribution more symmetric and Gaussian-like. Since this transformation is not universally practiced in all microarray analysis, we examined whether the discrepancy of this treatment of raw data affect the "high level" analysis result. In particular, whether the differentially expressed genes as obtained by $t$-test, regularized t-test, or logistic regression have altered rank orders due to presence or absence of the transformation. We show that as much as 20%--40% of significant genes are "discordant" (significant only in one form of the data and not in both), depending on the test being used and the threshold value for claiming significance. The t-test is more likely to be affected by logarithmic transformation than logistic regression, and regularized $t$-test more affected than t-test. On the other hand, the very top ranking genes (e.g. up to top 20--50 genes, depending on the test) are not affected by the logarithmic transformation.