Researcher profile

Miao Fan

Miao Fan contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - Emerging
10works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

Understanding the Impact of the COVID-19 Pandemic on Transportation-related Behaviors with Human Mobility Data

The constrained outbreak of COVID-19 in Mainland China has recently been regarded as a successful example of fighting this highly contagious virus. Both the short period (in about three months) of transmission and the sub-exponential increase of confirmed cases in Mainland China have proved that the Chinese authorities took effective epidemic prevention measures, such as case isolation, travel restrictions, closing recreational venues, and banning public gatherings. These measures can, of course, effectively control the spread of the COVID-19 pandemic. Meanwhile, they may dramatically change the human mobility patterns, such as the daily transportation-related behaviors of the public. To better understand the impact of COVID-19 on transportation-related behaviors and to provide more targeted anti-epidemic measures, we use the huge amount of human mobility data collected from Baidu Maps, a widely-used Web mapping service in China, to look into the detail reaction of the people there during the pandemic. To be specific, we conduct data-driven analysis on transportation-related behaviors during the pandemic from the perspectives of 1) means of transportation, 2) type of visited venues, 3) check-in time of venues, 4) preference on "origin-destination" distance, and 5) "origin-transportation-destination" patterns. For each topic, we also give our specific insights and policy-making suggestions. Given that the COVID-19 pandemic is still spreading in more than 200 countries and territories worldwide, infecting millions of people, the insights and suggestions provided here may help fight COVID-19.

preprint2020arXiv

Quantifying the Economic Impact of COVID-19 in Mainland China Using Human Mobility Data

To contain the pandemic of coronavirus (COVID-19) in Mainland China, the authorities have put in place a series of measures, including quarantines, social distancing, and travel restrictions. While these strategies have effectively dealt with the critical situations of outbreaks, the combination of the pandemic and mobility controls has slowed China's economic growth, resulting in the first quarterly decline of Gross Domestic Product (GDP) since GDP began to be calculated, in 1992. To characterize the potential shrinkage of the domestic economy, from the perspective of mobility, we propose two new economic indicators: the New Venues Created (NVC) and the Volumes of Visits to Venue (V^3), as the complementary measures to domestic investments and consumption activities, using the data of Baidu Maps. The historical records of these two indicators demonstrated strong correlations with the past figures of Chinese GDP, while the status quo has dramatically changed this year, due to the pandemic. We hereby presented a quantitative analysis to project the impact of the pandemic on economies, using the recent trends of NVC and V^3. We found that the most affected sectors would be travel-dependent businesses, such as hotels, educational institutes, and public transportation, while the sectors that are mandatory to human life, such as workplaces, residential areas, restaurants, and shopping sites, have been recovering rapidly. Analysis at the provincial level showed that the self-sufficient and self-sustainable economic regions, with internal supplies, production, and consumption, have recovered faster than those regions relying on global supply chains.

preprint2015arXiv

Detecting Table Region in PDF Documents Using Distant Supervision

Superior to state-of-the-art approaches which compete in table recognition with 67 annotated government reports in PDF format released by {\it ICDAR 2013 Table Competition}, this paper contributes a novel paradigm leveraging large-scale unlabeled PDF documents to open-domain table detection. We integrate the paradigm into our latest developed system ({\it PdfExtra}) to detect the region of tables by means of 9,466 academic articles from the entire repository of {\it ACL Anthology}, where almost all papers are archived by PDF format without annotation for tables. The paradigm first designs heuristics to automatically construct weakly labeled data. It then feeds diverse evidences, such as layouts of documents and linguistic features, which are extracted by {\it Apache PDFBox} and processed by {\it Stanford NLP} toolkit, into different canonical classifiers. We finally use these classifiers, i.e. {\it Naive Bayes}, {\it Logistic Regression} and {\it Support Vector Machine}, to collaboratively vote on the region of tables. Experimental results show that {\it PdfExtra} achieves a great leap forward, compared with the state-of-the-art approach. Moreover, we discuss the factors of different features, learning models and even domains of documents that may impact the performance. Extensive evaluations demonstrate that our paradigm is compatible enough to leverage various features and learning models for open-domain table region detection within PDF files.

preprint2015arXiv

Distant Supervision for Entity Linking

Entity linking is an indispensable operation of populating knowledge repositories for information extraction. It studies on aligning a textual entity mention to its corresponding disambiguated entry in a knowledge repository. In this paper, we propose a new paradigm named distantly supervised entity linking (DSEL), in the sense that the disambiguated entities that belong to a huge knowledge repository (Freebase) are automatically aligned to the corresponding descriptive webpages (Wiki pages). In this way, a large scale of weakly labeled data can be generated without manual annotation and fed to a classifier for linking more newly discovered entities. Compared with traditional paradigms based on solo knowledge base, DSEL benefits more via jointly leveraging the respective advantages of Freebase and Wikipedia. Specifically, the proposed paradigm facilitates bridging the disambiguated labels (Freebase) of entities and their textual descriptions (Wikipedia) for Web-scale entities. Experiments conducted on a dataset of 140,000 items and 60,000 features achieve a baseline F1-measure of 0.517. Furthermore, we analyze the feature performance and improve the F1-measure to 0.545.

preprint2015arXiv

Jointly Embedding Relations and Mentions for Knowledge Population

This paper contributes a joint embedding model for predicting relations between a pair of entities in the scenario of relation inference. It differs from most stand-alone approaches which separately operate on either knowledge bases or free texts. The proposed model simultaneously learns low-dimensional vector representations for both triplets in knowledge repositories and the mentions of relations in free texts, so that we can leverage the evidence both resources to make more accurate predictions. We use NELL to evaluate the performance of our approach, compared with cutting-edge methods. Results of extensive experiments show that our model achieves significant improvement on relation extraction.

preprint2015arXiv

Large Margin Nearest Neighbor Embedding for Knowledge Representation

Traditional way of storing facts in triplets ({\it head\_entity, relation, tail\_entity}), abbreviated as ({\it h, r, t}), makes the knowledge intuitively displayed and easily acquired by mankind, but hardly computed or even reasoned by AI machines. Inspired by the success in applying {\it Distributed Representations} to AI-related fields, recent studies expect to represent each entity and relation with a unique low-dimensional embedding, which is different from the symbolic and atomic framework of displaying knowledge in triplets. In this way, the knowledge computing and reasoning can be essentially facilitated by means of a simple {\it vector calculation}, i.e. ${\bf h} + {\bf r} \approx {\bf t}$. We thus contribute an effective model to learn better embeddings satisfying the formula by pulling the positive tail entities ${\bf t^{+}}$ to get together and close to {\bf h} + {\bf r} ({\it Nearest Neighbor}), and simultaneously pushing the negatives ${\bf t^{-}}$ away from the positives ${\bf t^{+}}$ via keeping a {\it Large Margin}. We also design a corresponding learning algorithm to efficiently find the optimal solution based on {\it Stochastic Gradient Descent} in iterative fashion. Quantitative experiments illustrate that our approach can achieve the state-of-the-art performance, compared with several latest methods on some benchmark datasets for two classical applications, i.e. {\it Link prediction} and {\it Triplet classification}. Moreover, we analyze the parameter complexities among all the evaluated models, and analytical results indicate that our model needs fewer computational resources on outperforming the other methods.

preprint2015arXiv

Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories

This paper considers the problem of knowledge inference on large-scale imperfect repositories with incomplete coverage by means of embedding entities and relations at the first attempt. We propose IIKE (Imperfect and Incomplete Knowledge Embedding), a probabilistic model which measures the probability of each belief, i.e. $\langle h,r,t\rangle$, in large-scale knowledge bases such as NELL and Freebase, and our objective is to learn a better low-dimensional vector representation for each entity ($h$ and $t$) and relation ($r$) in the process of minimizing the loss of fitting the corresponding confidence given by machine learning (NELL) or crowdsouring (Freebase), so that we can use $||{\bf h} + {\bf r} - {\bf t}||$ to assess the plausibility of a belief when conducting inference. We use subsets of those inexact knowledge bases to train our model and test the performances of link prediction and triplet classification on ground truth beliefs, respectively. The results of extensive experiments show that IIKE achieves significant improvement compared with the baseline and state-of-the-art approaches.

preprint2015arXiv

Parallel Knowledge Embedding with MapReduce on a Multi-core Processor

This article firstly attempts to explore parallel algorithms of learning distributed representations for both entities and relations in large-scale knowledge repositories with {\it MapReduce} programming model on a multi-core processor. We accelerate the training progress of a canonical knowledge embedding method, i.e. {\it translating embedding} ({\bf TransE}) model, by dividing a whole knowledge repository into several balanced subsets, and feeding each subset into an individual core where local embeddings can concurrently run updating during the {\it Map} phase. However, it usually suffers from inconsistent low-dimensional vector representations of the same key, which are collected from different {\it Map} workers, and further leads to conflicts when conducting {\it Reduce} to merge the various vectors associated with the same key. Therefore, we try several strategies to acquire the merged embeddings which may not only retain the performance of {\it entity inference}, {\it relation prediction}, and even {\it triplet classification} evaluated by the single-thread {\bf TransE} on several well-known knowledge bases such as Freebase and NELL, but also scale up the learning speed along with the number of cores within a processor. So far, the empirical studies show that we could achieve comparable results as the single-thread {\bf TransE} performs by the {\it stochastic gradient descend} (SGD) algorithm, as well as increase the training speed multiple times via adapting the {\it batch gradient descend} (BGD) algorithm for {\it MapReduce} paradigm.

preprint2015arXiv

Probabilistic Belief Embedding for Knowledge Base Completion

This paper contributes a novel embedding model which measures the probability of each belief $\langle h,r,t,m\rangle$ in a large-scale knowledge repository via simultaneously learning distributed representations for entities ($h$ and $t$), relations ($r$), and the words in relation mentions ($m$). It facilitates knowledge completion by means of simple vector operations to discover new beliefs. Given an imperfect belief, we can not only infer the missing entities, predict the unknown relations, but also tell the plausibility of the belief, just leveraging the learnt embeddings of remaining evidences. To demonstrate the scalability and the effectiveness of our model, we conduct experiments on several large-scale repositories which contain millions of beliefs from WordNet, Freebase and NELL, and compare it with other cutting-edge approaches via competing the performances assessed by the tasks of entity inference, relation prediction and triplet classification with respective metrics. Extensive experimental results show that the proposed model outperforms the state-of-the-arts with significant improvements.

preprint2014arXiv

Errata: Distant Supervision for Relation Extraction with Matrix Completion

The essence of distantly supervised relation extraction is that it is an incomplete multi-label classification problem with sparse and noisy features. To tackle the sparsity and noise challenges, we propose solving the classification problem using matrix completion on factorized matrix of minimized rank. We formulate relation classification as completing the unknown labels of testing items (entity pairs) in a sparse matrix that concatenates training and testing textual features with training labels. Our algorithmic framework is based on the assumption that the rank of item-by-feature and item-by-label joint matrix is low. We apply two optimization models to recover the underlying low-rank matrix leveraging the sparsity of feature-label matrix. The matrix completion problem is then solved by the fixed point continuation (FPC) algorithm, which can find the global optimum. Experiments on two widely used datasets with different dimensions of textual features demonstrate that our low-rank matrix completion approach significantly outperforms the baseline and the state-of-the-art methods.