Source author record

Guoai Xu

Guoai Xu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

cond-mat.stat-mech Cryptography and Security math-ph math.MP Software Engineering Computation and Language Computational Complexity Machine Learning

Catalog footprint

What is connected

8works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2024arXiv

Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.

preprint2022arXiv

Dataset Bias in Android Malware Detection

Researchers have proposed kinds of malware detection methods to solve the explosive mobile security threats. We argue that the experiment results are inflated due to the research bias introduced by the variability of malware dataset. We explore the impact of bias in Android malware detection in three aspects, the method used to flag the ground truth, the distribution of malware families in the dataset, and the methods to use the dataset. We implement a set of experiments of different VT thresholds and find that the methods used to flag the malware data affect the malware detection performance directly. We further compare the impact of malware family types and composition on malware detection in detail. The superiority of each approach is different under various combinations of malware families. Through our extensive experiments, we showed that the methods to use the dataset can have a misleading impact on evaluation, and the performance difference can be up to over 40%. We argue that these research biases observed in this paper should be carefully controlled/eliminated to enforce a fair comparison of malware detection techniques. Providing reasonable and explainable results is better than only reporting a high detection accuracy with vague dataset and experimental settings.

preprint2020arXiv

Characterizing Cryptocurrency Exchange Scams

As the indispensable trading platforms of the ecosystem, hundreds of cryptocurrency exchanges are emerging to facilitate the trading of digital assets. While, it also attracts the attentions of attackers. A number of scam attacks were reported targeting cryptocurrency exchanges, leading to a huge mount of financial loss. However, no previous work in our research community has systematically studied this problem. In this paper, we make the first effort to identify and characterize the cryptocurrency exchange scams. We first identify over 1,500 scam domains and over 300 fake apps, by collecting existing reports and using typosquatting generation techniques. Then we investigate the relationship between them, and identify 94 scam domain families and 30 fake app families. We further characterize the impacts of such scams, and reveal that these scams have incurred financial loss of 520k US dollars at least. We further observe that the fake apps have been sneaked to major app markets (including Google Play) to infect unsuspicious users. Our findings demonstrate the urgency to identify and prevent cryptocurrency exchange scams. To facilitate future research, we have publicly released all the identified scam domains and fake apps to the community.

preprint2013arXiv

Analysis of diffusion and trapping efficiency for random walks on non-fractal scale-free trees

We study discrete random walks on the NFSFT and provide new methods to calculate the analytic solutions of the MFPT for any pair of nodes, the MTT for any target node and MDT for any source node. Further more, using the MTT and the MDT as the measures of trapping efficiency and diffusion efficiency respectively, we compare the trapping efficiency and diffusion efficiency for any two nodes of NFSFT and find the best (or worst) trapping sites and the best (or worst) diffusion sites. Our results show that: the two hubs of NFSFT is the best trapping site, but it is also the worst diffusion site, the nodes which are the farthest nodes from the two hubs are the worst trapping sites, but they are also the best diffusion sites. Comparing the maximum and minimum of MTT and MDT, we found that the ratio between the maximum and minimum of MTT grows logarithmically with network order, but the ratio between the maximum and minimum of MTT is almost equal to $1$. These results implie that the trap's position has great effect on the trapping efficiency, but the position of source node almost has no effect on diffusion efficiency. We also conducted numerical simulation to test the results we have derived, the results we derived are consistent with those obtained by numerical simulation.

preprint2013arXiv

Effects of node position on diffusion and trapping efficiency for random walks on fractal scale-free trees

We study unbiased discrete random walks on the FSFT based on the its self-similar structure and the relations between random walks and electrical networks. First, we provide new methods to derive analytic solutions of the MFPT for any pair of nodes, the MTT for any target node and MDT for any starting node. And then, using the MTT and the MDT as the measures of trapping efficiency and diffusion efficiency respectively, we analyze the effect of trap's position on trapping efficiency and the effect of starting position on diffusion efficiency. Comparing the trapping efficiency and diffusion efficiency among all nodes of FSFT, we find the best (or worst) trapping sites and the best (or worst) diffusing sites. Our results show that: the node which is at the center of FSFT is the best trapping site, but it is also the worst diffusing site. The nodes which are the farthest nodes from the two hubs are the worst trapping sites, but they are also the best diffusion sites. Comparing the maximum and minimum of MTT and MDT, we found that the maximum of MTT is almost $\frac{20m^2+32m+12}{4m^2+4m+1}$ times of the minimum of MTT, but the the maximum of MDT is almost equal to the minimum of MDT. These results shows that the position of target node has big effect on trapping efficiency, but the position of starting node almost has no effect on diffusion efficiency. We also conducted numerical simulation to test the results we have derived, the results we derived are consistent with those obtained by numerical simulation.

preprint2013arXiv

Efficiency analysis of diffusion on T-fractals in the sense of random walks

Efficiently controlling the diffusion process is crucial in the study of diffusion problem in complex systems. In the sense of random walks with a single trap, mean trapping time(MTT) and mean diffusing time(MDT) are good measures of trapping efficiency and diffusion efficiency respectively. They both vary with the location of the node. In this paper, we study random walks on T-fractal and provided general methods to calculate the MTT for any target node and the MDT for any source node. Using the MTT and the MDT as the measure of trapping efficiency and diffusion efficiency respectively, we compare the trapping efficiency and diffusion efficiency among all nodes of T-fractal and find the best (or worst) trapping sites and the best (or worst) diffusing sites. Our results show that: the hub node of T-fractal is the best trapping site, but it is also the worst diffusing site, the three boundary nodes are the worst trapping sites, but they are also the best diffusing sites. Comparing the minimum and maximum of MTT and MDT, we found that the maximum of MTT is almost $6$ times of the minimum for MTT and the maximum of MDT is almost equal to the minimum for MDT. These results show that the location of target node has big effect on the trapping efficiency, but the location of source node almost has no effect on diffusion efficiency. We also conducted numerical simulation to test the results we have derived, the results we derived are consistent with those obtained by numerical simulation.

preprint2013arXiv

Tutte polynomial of pseudofractal scale-free web

The Tutte polynomial of a graph is a 2-variable polynomial which is quite important in both combinatorics and statistical physics. It contains various numerical invariants and polynomial invariants, such as the number of spanning trees, the number of spanning forests, the number of acyclic orientations, the reliability polynomial, chromatic polynomial and flow polynomial. In this paper, we study and gain recursive formulas for the Tutte polynomial of pseudofractal scale-free web (PSW) which implies logarithmic complexity algorithm is obtained to calculate the Tutte polynomial of PSW although it is NP-hard for general graph. We also obtain the rigorous solution for the the number of spanning trees of PSW by solving the recurrence relations derived from Tutte polynomial, which give an alternative approach for explicitly determining the number of spanning trees of PSW. Further more, we analysis the all-terminal reliability of PSW and compare the results with that of Sierpinski gasket which has the same number of nodes and edges with PSW. In contrast with the well-known conclusion that scale-free networks are more robust against removal of nodes than homogeneous networks (e.g., exponential networks and regular networks). Our results show that Sierpinski gasket (which is a regular network) are more robust against random edge failures than PSW (which is a scale-free network). Whether it is true for any regular networks and scale-free networks, is still a unresolved problem.

preprint2011arXiv

Average path length for Sierpinski pentagon

In this paper,we investigate diameter and average path length(APL) of Sierpinski pentagon based on its recursive construction and self-similar structure.We find that the diameter of Sierpinski pentagon is just the shortest path lengths between two nodes of generation 0. Deriving and solving the linear homogenous recurrence relation the diameter satisfies, we obtain rigorous solution for the diameter. We also obtain approximate solution for APL of Sierpinski pentagon, both diameter and APL grow approximately as a power-law function of network order $N(t)$, with the exponent equals $\frac{\ln(1+\sqrt{3})}{\ln(5)}$. Although the solution for APL is approximate,it is trusted because we have calculated all items of APL accurately except for the compensation($Δ_{t}$) of total distances between non-adjacent branches($Λ_t^{1,3}$), which is obtained approximately by least-squares curve fitting. The compensation($Δ_{t}$) is only a small part of total distances between non-adjacent branches($Λ_t^{1,3}$) and has little effect on APL. Further more,using the data obtained by iteration to test the fitting results, we find the relative error for $Δ_{t}$ is less than $10^{-7}$, hence the approximate solution for average path length is almost accurate.

Guoai Xu

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

Dataset Bias in Android Malware Detection

Characterizing Cryptocurrency Exchange Scams

Analysis of diffusion and trapping efficiency for random walks on non-fractal scale-free trees

Effects of node position on diffusion and trapping efficiency for random walks on fractal scale-free trees

Efficiency analysis of diffusion on T-fractals in the sense of random walks

Tutte polynomial of pseudofractal scale-free web

Average path length for Sierpinski pentagon