Source author record

Guanhua Tian

Guanhua Tian appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Information Retrieval Distributed, Parallel, and Cluster Computing Performance

Catalog footprint

What is connected

3works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2017arXiv

Self-Taught Convolutional Neural Networks for Short Text Clustering

Short text clustering is a challenging problem due to its sparseness of text representation. Here we propose a flexible Self-Taught Convolutional neural network framework for Short Text Clustering (dubbed STC^2), which can flexibly and successfully incorporate more useful semantic features and learn non-biased deep text representation in an unsupervised manner. In our framework, the original raw text features are firstly embedded into compact binary codes by using one existing unsupervised dimensionality reduction methods. Then, word embeddings are explored and fed into convolutional neural networks to learn deep feature representations, meanwhile the output units are used to fit the pre-trained binary codes in the training process. Finally, we get the optimal clusters by employing K-means to cluster the learned representations. Extensive experimental results demonstrate that the proposed framework is effective, flexible and outperform several popular clustering methods when tested on three public short text datasets.

preprint2015arXiv

Short Text Hashing Improved by Integrating Multi-Granularity Topics and Tags

Due to computational and storage efficiencies of compact binary codes, hashing has been widely used for large-scale similarity search. Unfortunately, many existing hashing methods based on observed keyword features are not effective for short texts due to the sparseness and shortness. Recently, some researchers try to utilize latent topics of certain granularity to preserve semantic similarity in hash codes beyond keyword matching. However, topics of certain granularity are not adequate to represent the intrinsic semantic information. In this paper, we present a novel unified approach for short text Hashing using Multi-granularity Topics and Tags, dubbed HMTT. In particular, we propose a selection method to choose the optimal multi-granularity topics depending on the type of dataset, and design two distinct hashing strategies to incorporate multi-granularity topics. We also propose a simple and effective method to exploit tags to enhance the similarity of related texts. We carry out extensive experiments on one short text dataset as well as on one normal text dataset. The results demonstrate that our approach is effective and significantly outperforms baselines on several evaluation metrics.

preprint2010arXiv

Decreasing log data of multi-tier services for effective request tracing

Previous work shows request tracing systems help understand and debug the performance problems of multi-tier services. However, for large-scale data centers, more than hundreds of thousands of service instances provide online service at the same time. Previous work such as white-box or black box tracing systems will produce large amount of log data, which would be correlated into large quantities of causal paths for performance debugging. In this paper, we propose an innovative algorithm to eliminate valueless logs of multitiers services. Our experiment shows our method filters 84% valueless causal paths and is promising to be used in large-scale data centers.