Source author record

Hai Jin

Hai Jin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Machine Learning Cryptography and Security Software Engineering Computer Vision Distributed, Parallel, and Cluster Computing Social and Information Networks cond-mat.str-el Data Structures and Algorithms Programming Languages

Catalog footprint

What is connected

15works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora, with the aim of developing intelligent tools to improve the quality and productivity of computer programming. Currently, there is already a thriving research community focusing on code intelligence, with efforts ranging from software engineering, machine learning, data mining, natural language processing, and programming languages. In this paper, we conduct a comprehensive literature review on deep learning for code intelligence, from the aspects of code representation learning, deep learning techniques, and application tasks. We also benchmark several state-of-the-art neural models for code intelligence, and provide an open-source toolkit tailored for the rapid prototyping of deep-learning-based code intelligence models. In particular, we inspect the existing code intelligence models under the basis of code representation learning, and provide a comprehensive overview to enhance comprehension of the present state of code intelligence. Furthermore, we publicly release the source code and data resources to provide the community with a ready-to-use benchmark, which can facilitate the evaluation and comparison of existing and future code intelligence models (https://xcodemind.github.io). At last, we also point out several challenging and promising directions for future research.

preprint2023arXiv

Dipolar Spin Liquid Ending with Quantum Critical Point in a Gd-based Triangular Magnet

By performing experiment and model studies on a triangular-lattice dipolar magnet KBaGd(BO$_3$)$_2$ (KBGB), we find the highly frustrated magnet with a planar anisotropy hosts a strongly fluctuating dipolar spin liquid (DSL), which originates from the intriguing interplay between dipolar and Heisenberg interactions. The DSL constitutes an extended regime in the field-temperature phase diagram, which gets lowered in temperature as field increases and eventually ends with an unconventional quantum critical point (QCP) at $B_c\simeq 0.75$~T. Based on dipolar Heisenberg model calculations, we identify the DSL as a Berezinskii-Kosterlitz-Thouless (BKT) phase with emergent U(1) symmetry. Due to the tremendous entropy accumulation that can be related to the strong BKT and quantum fluctuations, unprecedented magnetic cooling effects are observed in the DSL regime and particularly near the QCP, making KBGB a superior dipolar coolant to commercial Gd-based refrigerants. We establish the phase diagram for triangular-lattice dipolar quantum magnets where emergent symmetry plays an essential role, and provide a basis and opens an avenue for their applications in sub-Kelvin refrigeration.

preprint2022arXiv

BadHash: Invisible Backdoor Attacks against Deep Hashing with Clean Label

Due to its powerful feature learning capability and high efficiency, deep hashing has achieved great success in large-scale image retrieval. Meanwhile, extensive works have demonstrated that deep neural networks (DNNs) are susceptible to adversarial examples, and exploring adversarial attack against deep hashing has attracted many research efforts. Nevertheless, backdoor attack, another famous threat to DNNs, has not been studied for deep hashing yet. Although various backdoor attacks have been proposed in the field of image classification, existing approaches failed to realize a truly imperceptive backdoor attack that enjoys invisible triggers and clean label setting simultaneously, and they also cannot meet the intrinsic demand of image retrieval backdoor. In this paper, we propose BadHash, the first generative-based imperceptible backdoor attack against deep hashing, which can effectively generate invisible and input-specific poisoned images with clean label. Specifically, we first propose a new conditional generative adversarial network (cGAN) pipeline to effectively generate poisoned samples. For any given benign image, it seeks to generate a natural-looking poisoned counterpart with a unique invisible trigger. In order to improve the attack effectiveness, we introduce a label-based contrastive learning network LabCLN to exploit the semantic characteristics of different labels, which are subsequently used for confusing and misleading the target model to learn the embedded trigger. We finally explore the mechanism of backdoor attacks on image retrieval in the hash space. Extensive experiments on multiple benchmark datasets verify that BadHash can generate imperceptible poisoned samples with strong attack ability and transferability over state-of-the-art deep hashing schemes.

preprint2022arXiv

Cross-Language Binary-Source Code Matching with Intermediate Representations

Binary-source code matching plays an important role in many security and software engineering related tasks such as malware detection, reverse engineering and vulnerability assessment. Currently, several approaches have been proposed for binary-source code matching by jointly learning the embeddings of binary code and source code in a common vector space. Despite much effort, existing approaches target on matching the binary code and source code written in a single programming language. However, in practice, software applications are often written in different programming languages to cater for different requirements and computing platforms. Matching binary and source code across programming languages introduces additional challenges when maintaining multi-language and multi-platform applications. To this end, this paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem. We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code. To validate the effectiveness of XLIR, comprehensive experiments are conducted on two tasks of cross-language binary-source code matching, and cross-language source-source code matching, on top of our curated dataset. Experimental results and analysis show that our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.

preprint2022arXiv

FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization

One underlying assumption of recent federated learning (FL) paradigms is that all local models usually share the same network architecture and size, which becomes impractical for devices with different hardware resources. A scalable federated learning framework should address the heterogeneity that clients have different computing capacities and communication capabilities. To this end, this paper proposes FedHM, a novel heterogeneous federated model compression framework, distributing the heterogeneous low-rank models to clients and then aggregating them into a full-rank model. Our solution enables the training of heterogeneous models with varying computational complexities and aggregates them into a single global model. Furthermore, FedHM significantly reduces the communication cost by using low-rank models. Extensive experimental results demonstrate that FedHM is superior in the performance and robustness of models of different sizes, compared with state-of-the-art heterogeneous FL methods under various FL settings. Additionally, the convergence guarantee of FL for heterogeneous devices is first theoretically analyzed.

preprint2022arXiv

Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer

While deep face recognition (FR) systems have shown amazing performance in identification and verification, they also arouse privacy concerns for their excessive surveillance on users, especially for public face images widely spread on social networks. Recently, some studies adopt adversarial examples to protect photos from being identified by unauthorized face recognition systems. However, existing methods of generating adversarial face images suffer from many limitations, such as awkward visual, white-box setting, weak transferability, making them difficult to be applied to protect face privacy in reality. In this paper, we propose adversarial makeup transfer GAN (AMT-GAN), a novel face protection method aiming at constructing adversarial face images that preserve stronger black-box transferability and better visual quality simultaneously. AMT-GAN leverages generative adversarial networks (GAN) to synthesize adversarial face images with makeup transferred from reference images. In particular, we introduce a new regularization module along with a joint training strategy to reconcile the conflicts between the adversarial noises and the cycle consistence loss in makeup transfer, achieving a desirable balance between the attack strength and visual changes. Extensive experiments verify that compared with state of the arts, AMT-GAN can not only preserve a comfortable visual quality, but also achieve a higher attack success rate over commercial FR APIs, including Face++, Aliyun, and Microsoft.

preprint2022arXiv

Significant Engagement Community Search on Temporal Networks: Concepts and Algorithms

Community search, retrieving the cohesive subgraph which contains the query vertex, has been widely touched over the past decades. The existing studies on community search mainly focus on static networks. However, real-world networks usually are temporal networks where each edge is associated with timestamps. The previous methods do not work when handling temporal networks. We study the problem of identifying the significant engagement community to which the user-specified query belongs. Specifically, given an integer k and a query vertex u, then we search for the subgraph H which satisfies (i) u $\in$ H; (ii) the de-temporal graph of H is a connected k-core; (iii) In H that u has the maximum engagement level. To address our problem, we first develop a top-down greedy peeling algorithm named TDGP, which iteratively removes the vertices with the maximum temporal degree. To boost the efficiency, we then design a bottom-up local search algorithm named BULS and its enhanced versions BULS+ and BULS*. Lastly, we empirically show the superiority of our proposed solutions on six real-world temporal graphs.

preprint2022arXiv

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

preprint2021arXiv

Multi-Task Representation Learning with Multi-View Graph Convolutional Networks

Link prediction and node classification are two important downstream tasks of network representation learning. Existing methods have achieved acceptable results but they perform these two tasks separately, which requires a lot of duplication of work and ignores the correlations between tasks. Besides, conventional models suffer from the identical treatment of information of multiple views, thus they fail to learn robust representation for downstream tasks. To this end, we tackle link prediction and node classification problems simultaneously via multi-task multi-view learning in this paper. We first explain the feasibility and advantages of multi-task multi-view learning for these two tasks. Then we propose a novel model named as MT-MVGCN to perform link prediction and node classification tasks simultaneously. More specifically, we design a multi-view graph convolutional network to extract abundant information of multiple views in a network, which is shared by different tasks. We further apply two attention mechanisms: view attention mechanism and task attention mechanism to make views and tasks adjust the view fusion process. Moreover, view reconstruction can be introduced as an auxiliary task to boost the performance of the proposed model. Experiments on real-world network datasets demonstrate that our model is efficient yet effective, and outperforms advanced baselines in these two tasks.

preprint2021arXiv

SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities

The detection of software vulnerabilities (or vulnerabilities for short) is an important problem that has yet to be tackled, as manifested by the many vulnerabilities reported on a daily basis. This calls for machine learning methods for vulnerability detection. Deep learning is attractive for this purpose because it alleviates the requirement to manually define features. Despite the tremendous success of deep learning in other application domains, its applicability to vulnerability detection is not systematically understood. In order to fill this void, we propose the first systematic framework for using deep learning to detect vulnerabilities in C/C++ programs with source code. The framework, dubbed Syntax-based, Semantics-based, and Vector Representations (SySeVR), focuses on obtaining program representations that can accommodate syntax and semantic information pertinent to vulnerabilities. Our experiments with 4 software products demonstrate the usefulness of the framework: we detect 15 vulnerabilities that are not reported in the National Vulnerability Database. Among these 15 vulnerabilities, 7 are unknown and have been reported to the vendors, and the other 8 have been "silently" patched by the vendors when releasing newer versions of the pertinent software products.

preprint2020arXiv

$μ$VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection

Fine-grained software vulnerability detection is an important and challenging problem. Ideally, a detection system (or detector) not only should be able to detect whether or not a program contains vulnerabilities, but also should be able to pinpoint the type of a vulnerability in question. Existing vulnerability detection methods based on deep learning can detect the presence of vulnerabilities (i.e., addressing the binary classification or detection problem), but cannot pinpoint types of vulnerabilities (i.e., incapable of addressing multiclass classification). In this paper, we propose the first deep learning-based system for multiclass vulnerability detection, dubbed $μ$VulDeePecker. The key insight underlying $μ$VulDeePecker is the concept of code attention, which can capture information that can help pinpoint types of vulnerabilities, even when the samples are small. For this purpose, we create a dataset from scratch and use it to evaluate the effectiveness of $μ$VulDeePecker. Experimental results show that $μ$VulDeePecker is effective for multiclass vulnerability detection and that accommodating control-dependence (other than data-dependence) can lead to higher detection capabilities.

preprint2016arXiv

Differentially Private Online Learning for Cloud-Based Video Recommendation with Multimedia Big Data in Social Networks

With the rapid growth in multimedia services and the enormous offers of video contents in online social networks, users have difficulty in obtaining their interests. Therefore, various personalized recommendation systems have been proposed. However, they ignore that the accelerated proliferation of social media data has led to the big data era, which has greatly impeded the process of video recommendation. In addition, none of them has considered both the privacy of users' contexts (e,g., social status, ages and hobbies) and video service vendors' repositories, which are extremely sensitive and of significant commercial value. To handle the problems, we propose a cloud-assisted differentially private video recommendation system based on distributed online learning. In our framework, service vendors are modeled as distributed cooperative learners, recommending videos according to user's context, while simultaneously adapting the video-selection strategy based on user-click feedback to maximize total user clicks (reward). Considering the sparsity and heterogeneity of big social media data, we also propose a novel geometric differentially private model, which can greatly reduce the performance (recommendation accuracy) loss. Our simulation shows the proposed algorithms outperform other existing methods and keep a delicate balance between computing accuracy and privacy preserving level.

preprint2016arXiv

Lifetime-Based Memory Management for Distributed Data Processing Systems

In-memory caching of intermediate data and eager combining of data in shuffle buffers have been shown to be very effective in minimizing the re-computation and I/O cost in distributed data processing systems like Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap, which may quickly saturate the garbage collector, especially when handling a large dataset, and hence would limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects, and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca, a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. An extensive experimental study using both synthetic and real datasets shows that, in comparing to Spark, Deca is able to 1) reduce the garbage collection time by up to 99.9%, 2) to achieve up to 22.7x speed up in terms of execution time in cases without data spilling and 41.6x speedup in cases with data spilling, and 3) to consume up to 46.6% less memory.

preprint2016arXiv

Parallel Algorithms for Core Maintenance in Dynamic Graphs

This paper initiates the studies of parallel algorithms for core maintenance in dynamic graphs. The core number is a fundamental index reflecting the cohesiveness of a graph, which are widely used in large-scale graph analytics. The core maintenance problem requires to update the core numbers of vertices after a set of edges and vertices are inserted into or deleted from the graph. We investigate the parallelism in the core update process when multiple edges and vertices are inserted or deleted. Specifically, we discover a structure called superior edge set, the insertion or deletion of edges in which can be processed in parallel. Based on the structure of superior edge set, efficient parallel algorithms are then devised for incremental and decremental core maintenance respectively. To the best of our knowledge, the proposed algorithms are the first parallel ones for the fundamental core maintenance problem. The algorithms show a significant speedup in the processing time compared with previous results that sequentially handle edge and vertex insertions/deletions. Finally, extensive experiments are conducted on different types of real-world and synthetic datasets, and the results illustrate the efficiency, stability and scalability of the proposed algorithms.

preprint2015arXiv

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach

Locks have been widely used as an effective synchronization mechanism among processes and threads. However, we observe that a large number of false inter-thread dependencies (i.e., unnecessary lock contentions) exist during the program execution on multicore processors, thereby incurring significant performance overhead. This paper presents a performance debugging framework, PERFPLAY, to facilitate a comprehensive and in-depth understanding of the performance impact of unnecessary lock contentions. The core technique of our debugging framework is trace replay. Specifically, PERFPLAY records the program execution trace, on the basis of which the unnecessary lock contentions can be identified through trace analysis. We then propose a novel technique of trace transformation to transform these identified unnecessary lock contentions in the original trace into the correct pattern as a new trace free of unnecessary lock contentions. Through replaying both traces, PERFPLAY can quantify the performance impact of unnecessary lock contentions. To demonstrate the effectiveness of our debugging framework, we study five real-world programs and PARSEC benchmarks. Our experimental results demonstrate the significant performance overhead of unnecessary lock contentions, and the effectiveness of PERFPLAY in identifying the performance critical unnecessary lock contentions in real applications.

Hai Jin

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit

Dipolar Spin Liquid Ending with Quantum Critical Point in a Gd-based Triangular Magnet

BadHash: Invisible Backdoor Attacks against Deep Hashing with Clean Label

Cross-Language Binary-Source Code Matching with Intermediate Representations

FedHM: Efficient Federated Learning for Heterogeneous Models via Low-rank Factorization

Protecting Facial Privacy: Generating Adversarial Identity Masks via Style-robust Makeup Transfer

Significant Engagement Community Search on Temporal Networks: Concepts and Algorithms

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Multi-Task Representation Learning with Multi-View Graph Convolutional Networks

SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities

$μ$VulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection

Differentially Private Online Learning for Cloud-Based Video Recommendation with Multimedia Big Data in Social Networks

Lifetime-Based Memory Management for Distributed Data Processing Systems

Parallel Algorithms for Core Maintenance in Dynamic Graphs

On Performance Debugging of Unnecessary Lock Contentions on Multicore Processors: A Replay-based Approach