Researcher profile

Huajie Shao

Huajie Shao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

Lens: A Knowledge-Guided Foundation Model for Network Traffic

Network traffic refers to the amount of data being sent and received over the Internet or any system that connects computers. Analyzing network traffic is vital for security and management, yet remains challenging due to the heterogeneity of plain-text packet headers and encrypted payloads. To capture the latent semantics of traffic, recent studies have adopted Transformer-based pretraining techniques to learn network representations from massive traffic data. However, these methods pre-train on data-driven tasks but overlook network knowledge, such as masking partial digits of the indivisible network port numbers for prediction, thereby limiting semantic understanding. In addition, they struggle to extend classification to new classes during fine-tuning due to the distribution shift. Motivated by these limitations, we propose \Lens, a unified knowledge-guided foundation model for both network traffic classification and generation. In pretraining, we propose a Knowledge-Guided Mask Span Prediction method with textual context for learning knowledge-enriched representations. For extending to new classes in finetuning, we reframe the traffic classification as a closed-ended generation task and introduce context-aware finetuning to adapt to the distribution shift. Evaluation results across various benchmark datasets demonstrate that the proposed Lens~achieves superior performance on both classification and generation tasks. For traffic classification, Lens~outperforms competitive baselines substantially on 8 out of 12 tasks with an average accuracy of \textbf{96.33\%} and extends to novel classes with significantly better performance. For traffic generation, Lens~generates better high-fidelity network traffic for network simulation, gaining up to \textbf{30.46\%} and \textbf{33.3\%} better accuracy and F1 in fuzzing tests. We will open-source the code upon publication.

preprint2022arXiv

Unsupervised Belief Representation Learning with Information-Theoretic Variational Graph Auto-Encoders

This paper develops a novel unsupervised algorithm for belief representation learning in polarized networks that (i) uncovers the latent dimensions of the underlying belief space and (ii) jointly embeds users and content items (that they interact with) into that space in a manner that facilitates a number of downstream tasks, such as stance detection, stance prediction, and ideology mapping. Inspired by total correlation in information theory, we propose the Information-Theoretic Variational Graph Auto-Encoder (InfoVGAE) that learns to project both users and content items (e.g., posts that represent user views) into an appropriate disentangled latent space. To better disentangle latent variables in that space, we develop a total correlation regularization module, a Proportional-Integral (PI) control module, and adopt rectified Gaussian distribution to ensure the orthogonality. The latent representation of users and content can then be used to quantify their ideological leaning and detect/predict their stances on issues. We evaluate the performance of the proposed InfoVGAE on three real-world datasets, of which two are collected from Twitter and one from U.S. Congress voting records. The evaluation results show that our model outperforms state-of-the-art unsupervised models by reducing 10.5% user clustering errors and achieving 12.1% higher F1 scores for stance separation of content items. In addition, InfoVGAE produces a comparable result with supervised models. We also discuss its performance on stance prediction and user ranking within ideological groups.

preprint2021arXiv

Controllable and Diverse Text Generation in E-commerce

In E-commerce, a key challenge in text generation is to find a good trade-off between word diversity and accuracy (relevance) in order to make generated text appear more natural and human-like. In order to improve the relevance of generated results, conditional text generators were developed that use input keywords or attributes to produce the corresponding text. Prior work, however, do not finely control the diversity of automatically generated sentences. For example, it does not control the order of keywords to put more relevant ones first. Moreover, it does not explicitly control the balance between diversity and accuracy. To remedy these problems, we propose a fine-grained controllable generative model, called~\textit{Apex}, that uses an algorithm borrowed from automatic control (namely, a variant of the \textit{proportional, integral, and derivative (PID) controller}) to precisely manipulate the diversity/accuracy trade-off of generated text. The algorithm is injected into a Conditional Variational Autoencoder (CVAE), allowing \textit{Apex} to control both (i) the order of keywords in the generated sentences (conditioned on the input keywords and their order), and (ii) the trade-off between diversity and accuracy. Evaluation results on real-world datasets show that the proposed method outperforms existing generative models in terms of diversity and relevance. Apex is currently deployed to generate production descriptions and item recommendation reasons in Taobao owned by Alibaba, the largest E-commerce platform in China. The A/B production test results show that our method improves click-through rate (CTR) by 13.17\% compared to the existing method for production descriptions. For item recommendation reason, it is able to increase CTR by 6.89\% and 1.42\% compared to user reviews and top-K item recommendation without reviews, respectively.

preprint2020arXiv

ControlVAE: Controllable Variational Autoencoder

Variational Autoencoders (VAE) and their variants have been widely used in a variety of applications, such as dialog generation, image generation and disentangled representation learning. However, the existing VAE models have some limitations in different applications. For example, a VAE easily suffers from KL vanishing in language modeling and low reconstruction quality for disentangling. To address these issues, we propose a novel controllable variational autoencoder framework, ControlVAE, that combines a controller, inspired by automatic control theory, with the basic VAE to improve the performance of resulting generative models. Specifically, we design a new non-linear PI controller, a variant of the proportional-integral-derivative (PID) control, to automatically tune the hyperparameter (weight) added in the VAE objective using the output KL-divergence as feedback during model training. The framework is evaluated using three applications; namely, language modeling, disentangled representation learning, and image generation. The results show that ControlVAE can achieve better disentangling and reconstruction quality than the existing methods. For language modelling, it not only averts the KL-vanishing, but also improves the diversity of generated text. Finally, we also demonstrate that ControlVAE improves the reconstruction quality of generated images compared to the original VAE.

preprint2020arXiv

paper2repo: GitHub Repository Recommendation for Academic Papers

GitHub has become a popular social application platform, where a large number of users post their open source projects. In particular, an increasing number of researchers release repositories of source code related to their research papers in order to attract more people to follow their work. Motivated by this trend, we describe a novel item-item cross-platform recommender system, $\textit{paper2repo}$, that recommends relevant repositories on GitHub that match a given paper in an academic search system such as Microsoft Academic. The key challenge is to identify the similarity between an input paper and its related repositories across the two platforms, $\textit{without the benefit of human labeling}$. Towards that end, paper2repo integrates text encoding and constrained graph convolutional networks (GCN) to automatically learn and map the embeddings of papers and repositories into the same space, where proximity offers the basis for recommendation. To make our method more practical in real life systems, labels used for model training are computed automatically from features of user actions on GitHub. In machine learning, such automatic labeling is often called {\em distant supervision\/}. To the authors' knowledge, this is the first distant-supervised cross-platform (paper to repository) matching system. We evaluate the performance of paper2repo on real-world data sets collected from GitHub and Microsoft Academic. Results demonstrate that it outperforms other state of the art recommendation methods.