Source author record

Zhigang Lu

Zhigang Lu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Machine Learning Information Retrieval

Catalog footprint

What is connected

9works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process. With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This approach improves both query efficiency and constraint accuracy compared to state-of-the-art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and the overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.

preprint2026arXiv

Sentient: Detecting APTs Via Capturing Indirect Dependencies and Behavioral Logic

Advanced Persistent Threats (APTs) are difficult to detect due to their complexity and stealthiness. To mitigate such attacks, many approaches model entities and their relationship using provenance graphs to detect the stealthy and persistent characteristics of APTs. However, existing detection methods suffer from the flaws of missing indirect dependencies, noisy complex scenarios, and missing behavioral logical associations, which make it difficult to detect complex scenarios and effectively identify stealthy threats. In this paper, we propose Sentient, an APT detection method that combines pre-training and intent analysis. It employs a graph transformer to learn structural and semantic information from provenance graphs to avoid missing indirect dependencies. We mitigate scenario noise by combining global and local information. Additionally, we design an Intent Analysis Module (IAM) to associate logical relationships between behaviors. Sentient is trained solely on easily obtainable benign data to detect malicious behaviors that deviate from benign behavioral patterns. We evaluated Sentient on three widely-used datasets covering real-world attacks and simulated attacks. Notably, compared to six state-of-the-art methods, Sentient achieved an average reduction of 44% in false positive rate(FPR) for detection.

preprint2022arXiv

A Differentially Private Framework for Deep Learning with Convexified Loss Functions

Differential privacy (DP) has been applied in deep learning for preserving privacy of the underlying training sets. Existing DP practice falls into three categories - objective perturbation, gradient perturbation and output perturbation. They suffer from three main problems. First, conditions on objective functions limit objective perturbation in general deep learning tasks. Second, gradient perturbation does not achieve a satisfactory privacy-utility trade-off due to over-injected noise in each epoch. Third, high utility of the output perturbation method is not guaranteed because of the loose upper bound on the global sensitivity of the trained model parameters as the noise scale parameter. To address these problems, we analyse a tighter upper bound on the global sensitivity of the model parameters. Under a black-box setting, based on this global sensitivity, to control the overall noise injection, we propose a novel output perturbation framework by injecting DP noise into a randomly sampled neuron (via the exponential mechanism) at the output layer of a baseline non-private neural network trained with a convexified loss function. We empirically compare the privacy-utility trade-off, measured by accuracy loss to baseline non-private models and the privacy leakage against black-box membership inference (MI) attacks, between our framework and the open-source differentially private stochastic gradient descent (DP-SGD) approaches on six commonly used real-world datasets. The experimental evaluations show that, when the baseline models have observable privacy leakage under MI attacks, our framework achieves a better privacy-utility trade-off than existing DP-SGD implementations, given an overall privacy budget $ε\leq 1$ for a large number of queries.

preprint2020arXiv

Data and Model Dependencies of Membership Inference Attack

Machine learning (ML) models have been shown to be vulnerable to Membership Inference Attacks (MIA), which infer the membership of a given data point in the target dataset by observing the prediction output of the ML model. While the key factors for the success of MIA have not yet been fully understood, existing defense mechanisms such as using L2 regularization \cite{10shokri2017membership} and dropout layers \cite{salem2018ml} take only the model's overfitting property into consideration. In this paper, we provide an empirical analysis of the impact of both the data and ML model properties on the vulnerability of ML techniques to MIA. Our results reveal the relationship between MIA accuracy and properties of the dataset and training model in use. In particular, we show that the size of shadow dataset, the class and feature balance and the entropy of the target dataset, the configurations and fairness of the training model are the most influential factors. Based on those experimental findings, we conclude that along with model overfitting, multiple properties jointly contribute to MIA success instead of any single property. Building on our experimental findings, we propose using those data and model properties as regularizers to protect ML models against MIA. Our results show that the proposed defense mechanisms can reduce the MIA accuracy by up to 25\% without sacrificing the ML model prediction utility.

preprint2020arXiv

Differentially Private k-Means Clustering with Guaranteed Convergence

Iterative clustering algorithms help us to learn the insights behind the data. Unfortunately, this may allow adversaries to infer the privacy of individuals with some background knowledge. In the worst case, the adversaries know the centroids of an arbitrary iteration and the information of n-1 out of n items. To protect individual privacy against such an inference attack, preserving differential privacy (DP) for the iterative clustering algorithms has been extensively studied in the interactive settings. However, existing interactive differentially private clustering algorithms suffer from a non-convergence problem, i.e., these algorithms may not terminate without a predefined number of iterations. This problem severely impacts the clustering quality and the efficiency of a differentially private algorithm. To resolve this problem, in this paper, we propose a novel differentially private clustering framework in the interactive settings which controls the orientation of the movement of the centroids over the iterations to ensure the convergence by injecting DP noise in a selected area. We prove that, in the expected case, algorithm under our framework converges in at most twice the iterations of Lloyd's algorithm. We perform experimental evaluations on real-world datasets to show that our algorithm outperforms the state-of-the-art of the interactive differentially private clustering algorithms with guaranteed convergence and better clustering quality to meet the same DP requirement.

preprint2020arXiv

Protect Edge Privacy in Path Publishing with Differential Privacy

Paths in a given network are a generalised form of time-serial chains in many real-world applications, such as trajectories and Internet flows. Differentially private trajectory publishing concerns publishing path information that is usable to the genuine users yet secure against adversaries to reconstruct the path with maximum background knowledge. The exiting studies all assume this knowledge to be all but one vertex on the path. To prevent the adversaries recovering the missing information, they publish a perturbed path where each vertex is sampled from a pre-defined set with differential privacy (DP) to replace the corresponding vertex in the original path. In this paper, we relax this assumption to be all but one edge on the path, and hence consider the scenario of more powerful adversaries with the maximum background knowledge of the entire network topology and the path (including all the vertices) except one (arbitrary) missing edge. Under such an assumption, the perturbed path produced by the existing work is vulnerable, because the adversary can reconstruct the missing edge from the existence of an edge in the perturbed path. To address this vulnerability and effectively protect edge privacy, instead of publishing a perturbed path, we propose a novel scheme of graph-based path publishing to protect the original path by embedding the path in a graph that contains fake edges and replicated vertices applying the differential privacy technique, such that only the legitimate users who have the full knowledge of the network topology are able to recover the exact vertices and edges of the original path with high probability. We theoretically analyse the performance of our algorithm in differential privacy, utility, and execution efficiency. We also conduct extensive experimental evaluations on a high-performance cluster system to validate our analytical results.

preprint2015arXiv

A Faster Algorithm to Build New Users Similarity List in Neighbourhood-based Collaborative Filtering

Neighbourhood-based Collaborative Filtering (CF) has been applied in the industry for several decades, because of the easy implementation and high recommendation accuracy. As the core of neighbourhood-based CF, the task of dynamically maintaining users' similarity list is challenged by cold-start problem and scalability problem. Recently, several methods are presented on solving the two problems. However, these methods applied an $O(n^2)$ algorithm to compute the similarity list in a special case, where the new users, with enough recommendation data, have the same rating list. To address the problem of large computational cost caused by the special case, we design a faster ($O(\frac{1}{125}n^2)$) algorithm, TwinSearch Algorithm, to avoid computing and sorting the similarity list for the new users repeatedly to save the computational resources. Both theoretical and experimental results show that the TwinSearch Algorithm achieves better running time than the traditional method.

preprint2015arXiv

A Security-assured Accuracy-maximised Privacy Preserving Collaborative Filtering Recommendation Algorithm

The neighbourhood-based Collaborative Filtering is a widely used method in recommender systems. However, the risks of revealing customers' privacy during the process of filtering have attracted noticeable public concern recently. Specifically, $k$NN attack discloses the target user's sensitive information by creating $k$ fake nearest neighbours by non-sensitive information. Among the current solutions against $k$NN attack, the probabilistic methods showed a powerful privacy preserving effect. However, the existing probabilistic methods neither guarantee enough prediction accuracy due to the global randomness, nor provide assured security enforcement against $k$NN attack. To overcome the problems of current probabilistic methods, we propose a novel approach, Partitioned Probabilistic Neighbour Selection, to ensure a required security guarantee while achieving the optimal prediction accuracy against $k$NN attack. In this paper, we define the sum of $k$ neighbours' similarity as the accuracy metric $α$, the number of user partitions, across which we select the $k$ neighbours, as the security metric $β$. Differing from the present methods that globally selected neighbours, our method selects neighbours from each group with exponential differential privacy to decrease the magnitude of noise. Theoretical and experimental analysis show that to achieve the same security guarantee against $k$NN attack, our approach ensures the optimal prediction accuracy.

preprint2015arXiv

An Accuracy-Assured Privacy-Preserving Recommender System for Internet Commerce

Recommender systems, tool for predicting users' potential preferences by computing history data and users' interests, show an increasing importance in various Internet applications such as online shopping. As a well-known recommendation method, neighbourhood-based collaborative filtering has attracted considerable attention recently. The risk of revealing users' private information during the process of filtering has attracted noticeable research interests. Among the current solutions, the probabilistic techniques have shown a powerful privacy preserving effect. When facing $k$ Nearest Neighbour attack, all the existing methods provide no data utility guarantee, for the introduction of global randomness. In this paper, to overcome the problem of recommendation accuracy loss, we propose a novel approach, Partitioned Probabilistic Neighbour Selection, to ensure a required prediction accuracy while maintaining high security against $k$NN attack. We define the sum of $k$ neighbours' similarity as the accuracy metric alpha, the number of user partitions, across which we select the $k$ neighbours, as the security metric beta. We generalise the $k$ Nearest Neighbour attack to beta k Nearest Neighbours attack. Differing from the existing approach that selects neighbours across the entire candidate list randomly, our method selects neighbours from each exclusive partition of size $k$ with a decreasing probability. Theoretical and experimental analysis show that to provide an accuracy-assured recommendation, our Partitioned Probabilistic Neighbour Selection method yields a better trade-off between the recommendation accuracy and system security.

Zhigang Lu

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Optimized Algorithms for Text Clustering with LLM-Generated Constraints

Sentient: Detecting APTs Via Capturing Indirect Dependencies and Behavioral Logic

A Differentially Private Framework for Deep Learning with Convexified Loss Functions

Data and Model Dependencies of Membership Inference Attack

Differentially Private k-Means Clustering with Guaranteed Convergence

Protect Edge Privacy in Path Publishing with Differential Privacy

A Faster Algorithm to Build New Users Similarity List in Neighbourhood-based Collaborative Filtering

A Security-assured Accuracy-maximised Privacy Preserving Collaborative Filtering Recommendation Algorithm

An Accuracy-Assured Privacy-Preserving Recommender System for Internet Commerce