Source author record

Dan Meng

Dan Meng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Cryptography and Security Hardware Architecture Performance Artificial Intelligence Machine Learning

Catalog footprint

What is connected

11works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Attack detection based on machine learning algorithms for different variants of Spectre attacks and different Meltdown attack implementations

To improve the overall performance of processors, computer architects use various performance optimization techniques in modern processors, such as speculative execution, branch prediction, and chaotic execution. Both now and in the future, these optimization techniques are critical for improving the execution speed of processor instructions. However, researchers have discovered that these techniques introduce hidden inherent security flaws, such as meltdown and ghost attacks in recent years. They exploit techniques such as chaotic execution or speculative execution combined with cache-based side-channel attacks to leak protected data. The impact of these vulnerabilities is enormous because they are prevalent in existing or future processors. However, until today, meltdown and ghost have not been effectively addressed, but instead, multiple attack variants and different attack implementations have evolved from them. This paper proposes to optimize four different hardware performance events through feature selection and use machine learning algorithms to build a real-time detection mechanism for Spectre v1,v2,v4, and different implementations of meltdown attacks, ultimately achieving an accuracy rate of over 99\%. In order to verify the practicality of the attack detection model, this paper is tested with a variety of benign programs and different implementations of Spectre attacks different from the modeling process, and the absolute accuracy also exceeds 99\%, showing that this paper can cope with different attack variants and different implementations of the same attack that may occur daily.

preprint2022arXiv

TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU

In this paper, we propose TensorFHE, an FHE acceleration solution based on GPGPU for real applications on encrypted data. TensorFHE utilizes Tensor Core Units (TCUs) to boost the computation of Number Theoretic Transform (NTT), which is the part of FHE with highest time-cost. Moreover, TensorFHE focuses on performing as many FHE operations as possible in a certain time period rather than reducing the latency of one operation. Based on such an idea, TensorFHE introduces operation-level batching to fully utilize the data parallelism in GPGPU. We experimentally prove that it is possible to achieve comparable performance with GPGPU as with state-of-the-art ASIC accelerators. TensorFHE performs 913 KOPS and 88 KOPS for NTT and HMULT (key FHE kernels) within NVIDIA A100 GPGPU, which is 2.61x faster than state-of-the-art FHE implementation on GPGPU; Moreover, TensorFHE provides comparable performance to the ASIC FHE accelerators, which makes it even 2.9x faster than the F1+ with a specific workload. Such a pure software acceleration based on commercial hardware with high performance can open up usage of state-of-the-art FHE algorithms for a broad set of applications in real systems.

preprint2020arXiv

A Lightweight Isolation Mechanism for Secure Branch Predictors

Recently exposed vulnerabilities reveal the necessity to improve the security of branch predictors. Branch predictors record history about the execution of different programs, and such information from different processes are stored in the same structure and thus accessible to each other. This leaves the attackers with the opportunities for malicious training and malicious perception. Instead of flush-based or physical isolation of hardware resources, we want to achieve isolation of the content in these hardware tables with some lightweight processing using randomization as follows. (1) Content encoding. We propose to use hardware-based thread-private random numbers to encode the contents of the branch predictor tables (both direction and destination histories) which we call XOR-BP. Specifically, the data is encoded by XOR operation with the key before written in the table and decoded after read from the table. Such a mechanism obfuscates the information adding difficulties to cross-process or cross-privilege level analysis and perception. It achieves a similar effect of logical isolation but adds little in terms of space or time overheads. (2) Index encoding. We propose a randomized index mechanism of the branch predictor (Noisy-XOR-BP). Similar to the XOR-BP, another thread-private random number is used together with the branch instruction address as the input to compute the index of the branch predictor. This randomized indexing mechanism disrupts the correspondence between the branch instruction address and the branch predictor entry, thus increases the noise for malicious perception attacks. Our analyses using an FPGA-based RISC-V processor prototype and additional auxiliary simulations suggest that the proposed mechanisms incur a very small performance cost while providing strong protection.

preprint2020arXiv

Knowledge Federation: A Unified and Hierarchical Privacy-Preserving AI Framework

With strict protections and regulations of data privacy and security, conventional machine learning based on centralized datasets is confronted with significant challenges, making artificial intelligence (AI) impractical in many mission-critical and data-sensitive scenarios, such as finance, government, and health. In the meantime, tremendous datasets are scattered in isolated silos in various industries, organizations, different units of an organization, or different branches of an international organization. These valuable data resources are well underused. To advance AI theories and applications, we propose a comprehensive framework (called Knowledge Federation - KF) to address these challenges by enabling AI while preserving data privacy and ownership. Beyond the concepts of federated learning and secure multi-party computation, KF consists of four levels of federation: (1) information level, low-level statistics and computation of data, meeting the requirements of simple queries, searching and simplistic operators; (2) model level, supporting training, learning, and inference; (3) cognition level, enabling abstract feature representation at various levels of abstractions and contexts; (4) knowledge level, fusing knowledge discovery, representation, and reasoning. We further clarify the relationship and differentiation between knowledge federation and other related research areas. We have developed a reference implementation of KF, called iBond Platform, to offer a production-quality KF platform to enable industrial applications in finance, insurance et al. The iBond platform will also help establish the KF community and a comprehensive ecosystem and usher in a novel paradigm shift towards secure, privacy-preserving and responsible AI. As far as we know, knowledge federation is the first hierarchical and unified framework for secure multi-party computing and learning.

preprint2020arXiv

Zipper Stack: Shadow Stacks Without Shadow

Return-Oriented Programming (ROP) is a typical attack technique that exploits return addresses to abuse existing code repeatedly. Most of the current return address protecting mechanisms (also known as the Backward-Edge Control-Flow Integrity) work only in limited threat models. For example, the attacker cannot break memory isolation, or the attacker has no knowledge of a secret key or random values. This paper presents a novel, lightweight mechanism protecting return addresses, Zipper Stack, which authenticates all return addresses by a chain structure using cryptographic message authentication codes (MACs). This innovative design can defend against the most powerful attackers who have full control over the program's memory and even know the secret key of the MAC function. This threat model is stronger than the one used in related work. At the same time, it produces low-performance overhead. We implemented Zipper Stack by extending the RISC-V instruction set architecture, and the evaluation on FPGA shows that the performance overhead of Zipper Stack is only 1.86%. Thus, we think Zipper Stack is suitable for actual deployment.

preprint2011arXiv

Automatic Performance Debugging of SPMD-style Parallel Programs

The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any apriori knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on a basis of the rough set theory, we propose an innovative approach to automatically uncovering root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code-MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.

preprint2010arXiv

Automatic Performance Debugging of SPMD Parallel Programs

Automatic performance debugging of parallel applications usually involves two steps: automatic detection of performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in several ways: first, several previous efforts automate analysis processes, but present the results in a confined way that only identifies performance problems with apriori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships. However, these efforts do not focus on locating performance bottlenecks or uncovering their root causes. In this paper, we design and implement an innovative system, AutoAnalyzer, to automatically debug the performance problems of single program multi-data (SPMD) parallel programs. Our system is unique in terms of two dimensions: first, without any apriori knowledge, we automatically locate bottlenecks and uncover their root causes for performance optimization; second, our method is lightweight in terms of size of collected and analyzed performance data. Our contribution is three-fold. First, we propose a set of simple performance metrics to represent behavior of different processes of parallel programs, and present two effective clustering and searching algorithms to locate bottlenecks. Second, we propose to use the rough set algorithm to automatically uncover the root causes of bottlenecks. Third, we design and implement the AutoAnalyzer system, and use two production applications to verify the effectiveness and correctness of our methods. According to the analysis results of AutoAnalyzer, we optimize two parallel programs with performance improvements by minimally 20% and maximally 170%.

preprint2010arXiv

Phoenix Cloud: Consolidating Different Computing Loads on Shared Cluster System for Large Organization

Different departments of a large organization often run dedicated cluster systems for different computing loads, like HPC (high performance computing) jobs or Web service applications. In this paper, we have designed and implemented a cloud management system software Phoenix Cloud to consolidate heterogeneous workloads from different departments affiliated to the same organization on the shared cluster system. We have also proposed cooperative resource provisioning and management policies for a large organization and its affiliated departments, running HPC jobs and Web service applications, to share the consolidated cluster system. The experiments show that in comparison with the case that each department operates its dedicated cluster system, Phoenix Cloud significantly decreases the scale of the required cluster system for a large organization, improves the benefit of the scientific computing department, and at the same time provisions enough resources to the other department running Web services with varying loads.

preprint2010arXiv

Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes

As more and more multi-tier services are developed from commercial components or heterogeneous middleware without the source code available, both developers and administrators need a precise request tracing tool to help understand and debug performance problems of large concurrent services of black boxes. Previous work fails to resolve this issue in several ways: they either accept the imprecision of probabilistic correlation methods, or rely on knowledge of protocols to isolate requests in pursuit of tracing accuracy. This paper introduces a tool named PreciseTracer to help debug performance problems of multi-tier services of black boxes. Our contributions are two-fold: first, we propose a precise request tracing algorithm for multi-tier services of black boxes, which only uses application-independent knowledge; secondly, we present a component activity graph abstraction to represent causal paths of requests and facilitate end-to-end performance debugging. The low overhead and tolerance of noise make PreciseTracer a promising tracing tool for using on production systems.

preprint2010arXiv

Precise, Scalable and Online Request Tracing for Multi-tier Services of Black Boxes

As more and more multi-tier services are developed from commercial off-the-shelf components or heterogeneous middleware without source code available, both developers and administrators need a request tracing tool to (1) exactly know how a user request of interest travels through services of black boxes; (2) obtain macro-level user request behavior information of services without the necessity of inundating within massive logs. Previous research efforts either accept imprecision of probabilistic correlation methods or present precise but unscalable tracing approaches that have to collect and analyze large amount of logs; Besides, previous precise request tracing approaches of black boxes fail to propose macro-level abstractions that enables debugging performance-in-the-large, and hence users have to manually interpret massive logs. This paper introduces a precise, scalable and online request tracing tool, named PreciseTracer, for multi-tier services of black boxes. Our contributions are four-fold: first, we propose a precise request tracing algorithm for multi-tier services of black boxes, which only uses application-independent knowledge; second, we respectively present micro-level and macro-level abstractions: component activity graphs and dominated causal path patterns to represent causal paths of each individual request and repeatedly executed causal paths that account for significant fractions; third, we present two mechanisms: tracing on demand and sampling to significantly increase system scalability; fourth, we design and implement an online request tracing tool. PreciseTracer's fast response, low overhead and scalability make it a promising tracing tool for large-scale production systems.

preprint2010arXiv

Scalable Group Management in Large-Scale Virtualized Clusters

To save cost, recently more and more users choose to provision virtual machine resources in cluster systems, especially in data centres. Maintaining a consistent member view is the foundation of reliable cluster managements, and it also raises several challenge issues for large scale cluster systems deployed with virtual machines (which we call virtualized clusters). In this paper, we introduce our experiences in design and implementation of scalable member view management on large-scale virtual clusters. Our research contributions are three-fold: 1) we propose a scalable and reliable management infrastructure that combines a peer-to-peer structure and a hierarchy structure to maintain a consistent member view in virtual clusters; 2) we present a light-weighted group membership algorithm that can reach the consistent member view within a single round of message exchange; and 3) we design and implement a scalable membership service that can provision virtual machines and maintain a consistent member view in virtual clusters. Our work is verified on Dawning 5000A, which ranked No.10 of Top 500 super computers in November, 2008.

Dan Meng

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Attack detection based on machine learning algorithms for different variants of Spectre attacks and different Meltdown attack implementations

TensorFHE: Achieving Practical Computation on Encrypted Data Using GPGPU

A Lightweight Isolation Mechanism for Secure Branch Predictors

Knowledge Federation: A Unified and Hierarchical Privacy-Preserving AI Framework

Zipper Stack: Shadow Stacks Without Shadow

Automatic Performance Debugging of SPMD-style Parallel Programs

Automatic Performance Debugging of SPMD Parallel Programs

Phoenix Cloud: Consolidating Different Computing Loads on Shared Cluster System for Large Organization

Precise Request Tracing and Performance Debugging for Multi-tier Services of Black Boxes

Precise, Scalable and Online Request Tracing for Multi-tier Services of Black Boxes

Scalable Group Management in Large-Scale Virtualized Clusters