Source author record

Xiaodong Wang

Xiaodong Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

97works

32topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A modified Bakry-Émery $Γ_2$ criterion inequality and the monotonicity of the Tsallis entropy

The Bakry-Émery $Γ_2$ criterion inequality provides a method for establishing the logarithmic Sobolev inequality. We prove a one-parameter family of weighted Bakry-Émery $Γ_2$ criterion inequalities which in the limit case yields the improved constant due to Ji \cite{Ji24}. Furthermore, we establish a modified weighted $Γ_2$ criterion inequality which could be interpreted as a monotonicity of the Tsallis entropy under the heat flow and yields a family of sharp Sobolev inequalities.

preprint2022arXiv

A Novel Algorithm to Solve for an Underwater Line Source Sound Field Based on Coupled Modes and a Spectral Method

A high-precision numerical sound field is the basis of underwater target detection, positioning and communication. A line source in a plane is a common type of sound source in computational ocean acoustics. The exciting waveguide in a range-dependent ocean environment is often structurally complicated; however, traditional algorithms often assume that the waveguide has a simple seabed boundary and that the line source is located at a horizontal range of 0 m, although this ideal situation is rarely encountered in the actual ocean. In this paper, a novel algorithm is designed that can solve for the sound field excited by a line source at any position in a range-dependent ocean environment. The proposed algorithm uses the classic stepwise approximation approach to address the range dependence of the environment and uses the Chebyshev--Tau spectral method to solve for the horizontal wavenumbers and modes of approximately range-independent segments. Once the modal information of these flat segments has been obtained, a global matrix is constructed to solve for the coupling coefficients of all segments, and finally, the complete sound field is synthesized. Numerical experiments using a robust numerical program developed based on this algorithm verify the correctness and usability of our novel algorithm and software. Furthermore, a detailed analysis and test of the computational cost of this algorithm show that it is efficient.

preprint2022arXiv

A Sharp Inequality Relating Yamabe Invariants on Asymptotically Poincare-Einstein Manifolds with a Ricci Curvature Lower Bound

For an asymptotically Poincare-Einstein manifold with a lower Ricci curvature bound, we establish a sharp inequality relating the type II Yamabe invariant of the interior and the Yamabe invariant of its conformal infinity

preprint2022arXiv

Hybrid Mechanical and Electronic Beam Steering for Maximizing OAM Channel Capacity

Radio frequency-orbital angular momentum (RF-OAM) is a novel approach of multiplexing a set of orthogonal modes on the same frequency channel to achieve high spectrum efficiencies. Since OAM requires precise alignment of the transmit and the receive antennas, the electronic beam steering approach has been proposed for the uniform circular array (UCA)-based OAM communication system to circumvent large performance degradation induced by small antenna misalignment in practical environment. However, in the case of large-angle misalignment, the OAM channel capacity can not be effectively compensated only by the electronic beam steering. To solve this problem, we propose a hybrid mechanical and electronic beam steering scheme, in which mechanical rotating devices controlled by pulse width modulation (PWM) signals as the execution unit are utilized to eliminate the large misalignment angle, while electronic beam steering is in charge of the remaining small misalignment angle caused by perturbations. Furthermore, due to the interferometry, the receive signal-to-noise ratios (SNRs) are not uniform at the elements of the receive UCA. Therefore, a rotatable UCA structure is proposed for the OAM receiver to maximize the channel capacity, in which the simulated annealing algorithm is adopted to obtain the optimal rotation angle at first, then the servo system performs mechanical rotation, at last the electronic beam steering is adjusted accordingly. Both mathematical analysis and simulation results validate that the proposed hybrid mechanical and electronic beam steering scheme can effectively eliminate the effect of diverse misalignment errors of any practical OAM channel and maximize the OAM channel capacity.

preprint2022arXiv

Scaling Blockchains with Error Correction Codes: A Survey on Coded Blockchains

This paper reviews and highlights how coding schemes have been used to solve various problems in blockchain systems. Specifically, these problems relate to scaling blockchains in terms of their data storage, computation and communication cost, as well as security. To this end, this paper considers the use of coded blocks or shards that allows participants to store only a fraction of the total blockchain, protect against malicious nodes or erasures due to nodes leaving a blockchain system, ensure data availability in order to promote transparency, and scale the security of sharded blockchains. Further, it helps reduce communication cost when disseminating blocks, which is critical to bootstrapping new nodes and helps speed up consensus of blocks. For each category of solutions, we highlight problems and issues that motivated their designs and use of coding. Moreover, we provide a qualitative analysis of their storage, communication and computation cost.

preprint2022arXiv

Two stages for visual object tracking

Siamese-based trackers have achived promising performance on visual object tracking tasks. Most existing Siamese-based trackers contain two separate branches for tracking, including classification branch and bounding box regression branch. In addition, image segmentation provides an alternative way to obetain the more accurate target region. In this paper, we propose a novel tracker with two-stages: detection and segmentation. The detection stage is capable of locating the target by Siamese networks. Then more accurate tracking results are obtained by segmentation module given the coarse state estimation in the first stage. We conduct experiments on four benchmarks. Our approach achieves state-of-the-art results, with the EAO of 52.6$\%$ on VOT2016, 51.3$\%$ on VOT2018, and 39.0$\%$ on VOT2019 datasets, respectively.

preprint2021arXiv

Discovering Multiple Phases of Dynamics by Dissecting Multivariate Time Series

We proposed a data-driven approach to dissect multivariate time series in order to discover multiple phases underlying dynamics of complex systems. This computing approach is developed as a multiple-dimension version of Hierarchical Factor Segmentation(HFS) technique. This expanded approach proposes a systematic protocol of choosing various extreme events in multi-dimensional space. Upon each chosen event, an empirical distribution of event-recurrence, or waiting time between the excursions, is fitted by a geometric distribution with time-varying parameters. Iterative fittings are performed across all chosen events. We then collect and summarize the local recurrent patterns into a global dynamic mechanism. Clustering is applied for partitioning the whole time period into alternating segments, in which variables are identically distributed. Feature weighting techniques are also considered to compensate for some drawbacks of clustering. Our simulation results show that this expanded approach can even detect systematic differences when the joint distribution varies. In real data experiments, we analyze the relationship from returns, trading volume, and transaction number of a single, as well as of multiple stocks in S&P500. We can successfully not only map out volatile periods but also provide potential associative links between stocks.

preprint2021arXiv

Improved Sobolev inequality under constraints

We give a new proof of Aubin's improvement of the Sobolev inequality on $\mathbb{S}^{n}$ under the vanishing of first order moments of the area element and generalize it to higher order moments case. By careful study of an extremal problem on $\mathbb{S}^{n}$, we determine the constant explicitly in the second order moments case.

preprint2021arXiv

MIMO OFDM Dual-Function Radar-Communication Under Error Rate and Beampattern Constraints

In this work we consider a multiple-input multiple-output (MIMO) dual-function radar-communication (DFRC) system, which senses multiple spatial directions and serves multiple users. Upon resorting to an orthogonal frequency division multiplexing (OFDM) transmission format and a differential phase shift keying (DPSK) modulation, we study the design of the radiated waveforms and of the receive filters employed by the radar and the users. The approach is communication-centric, in the sense that a radar-oriented objective is optimized under constraints on the average transmit power, the power leakage towards specific directions, and the error rate of each user, thus safeguarding the communication quality of service (QoS). We adopt a unified design approach allowing a broad family of radar objectives, including both estimation- and detection-oriented merit functions. We devise a suboptimal solution based on alternating optimization of the involved variables, a convex restriction of the feasible search set, and minorization-maximization, offering a single algorithm for all of the radar merit functions in the considered family. Finally, the performance is inspected through numerical examples.

preprint2021arXiv

Privacy-preserving Channel Estimation in Cell-free Hybrid Massive MIMO Systems

We consider a cell-free hybrid massive multiple-input multiple-output (MIMO) system with $K$ users and $M$ access points (APs), each with $N_a$ antennas and $N_r< N_a$ radio frequency (RF) chains. When $K\ll M{N_a}$, efficient uplink channel estimation and data detection with reduced number of pilots can be performed based on low-rank matrix completion. However, such a scheme requires the central processing unit (CPU) to collect received signals from all APs, which may enable the CPU to infer the private information of user locations. We therefore develop and analyze privacy-preserving channel estimation schemes under the framework of differential privacy (DP). As the key ingredient of the channel estimator, two joint differentially private noisy matrix completion algorithms based respectively on Frank-Wolfe iteration and singular value decomposition are presented. We provide an analysis on the tradeoff between the privacy and the channel estimation error. In particular, we show that the estimation error can be mitigated while maintaining the same privacy level by increasing the payload size with fixed pilot size; and the scaling laws of both the privacy-induced and privacy-independent error components in terms of payload size are characterized. Simulation results are provided to further demonstrate the tradeoff between privacy and channel estimation performance.

preprint2021arXiv

Reconfigurable-intelligent-surface-assisted Downlink Transmission Design via Bayesian Optimization

This paper investigates the transmission design in the reconfigurable-intelligent-surface (RIS)-assisted downlink system. The channel state information (CSI) is usually difficult to be estimated at the base station (BS) when the RIS is not equipped with radio frequency chains. In this paper, we propose a downlink transmission framework with unknown CSI via Bayesian optimization. Since the CSI is not available at the BS, we treat the unknown objective function as the black-box function and take the beamformer, the phase shift, and the receiving filter as the input. Then the objective function is decomposed as the sum of low-dimension subfunctions to reduce the complexity. By re-expressing the power constraint of the BS in spherical coordinates, the original constraint problem is converted into an equivalent unconstrained problem. The users estimate the sum MSE of the training symbols as the objective value and feed it back to the BS. We assume a Gaussian prior of the feedback samples and the next query point is updated by minimizing the constructed acquisition function. Furthermore, this framework can also be applied to the power transfer system and fairness problems. Simulation results validate the effectiveness of the proposed transmission scheme in the downlink data transmission and power transfer.

preprint2021arXiv

Remark on an inequality for closed hypersurfaces in complete manifolds with nonnegative Ricci curvature

We give a simple proof of a recent result due to Agostiniani, Fogagnolo and Mazzieri.

preprint2020arXiv

Compressing Recurrent Neural Networks Using Hierarchical Tucker Tensor Decomposition

Recurrent Neural Networks (RNNs) have been widely used in sequence analysis and modeling. However, when processing high-dimensional data, RNNs typically require very large model sizes, thereby bringing a series of deployment challenges. Although the state-of-the-art tensor decomposition approaches can provide good model compression performance, these existing methods are still suffering some inherent limitations, such as restricted representation capability and insufficient model complexity reduction. To overcome these limitations, in this paper we propose to develop compact RNN models using Hierarchical Tucker (HT) decomposition. HT decomposition brings strong hierarchical structure to the decomposed RNN models, which is very useful and important for enhancing the representation capability. Meanwhile, HT decomposition provides higher storage and computational cost reduction than the existing tensor decomposition approaches for RNN compression. Our experimental results show that, compared with the state-of-the-art compressed RNN models, such as TT-LSTM, TR-LSTM and BT-LSTM, our proposed HT-based LSTM (HT-LSTM), consistently achieves simultaneous and significant increases in both compression ratio and test accuracy on different datasets.

preprint2020arXiv

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

preprint2020arXiv

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

preprint2020arXiv

Exploiting Parallelism Opportunities with Deep Learning Frameworks

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This paper takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.29x and 1.34x, respectively.

preprint2020arXiv

From learning gait signatures of many individuals to reconstructing gait dynamics of one single individual

Based on the same databases, we computationally address two seemingly highly related, in fact drastically distinct, questions via computational data-driven algorithms: 1) how to precisely achieve the big task of differentiating gait signatures of many individuals? 2) how to reconstruct an individual's complex gait dynamics in full? Our brains can "effortlessly" resolve the first question, but will definitely fail in the second one. Since many fine temporal scale gait patterns surely escape our eyes. Based on accelerometers' 3D gait time series databases, we link the answers toward both questions via multiscale structural dependency within gait dynamics of our musculoskeletal system. Two types of dependency manifestations are explored. We first develop simple algorithmic computing called Principle System-State Analysis (PSSA) for the coarse dependency in implicit forms. PSSA is shown to be able to efficiently classifying among many subjects. We then develop a multiscale Local-1st-Global-2nd (L1G2) Coding Algorithm and a landmark computing algorithm. With both algorithms, we can precisely dissect rhythmic gait cycles, and then decompose each cycle into a series of cyclic gait phases. With proper color-coding and stacking, we reconstruct and represent an individual's gait dynamics via a 3D cylinder to collectively reveal universal deterministic and stochastic structural patterns on centisecond (10 milliseconds) scale across all rhythmic cycles. This 3D cylinder can serve as "passtensor" for authentication purposes related to clinical diagnoses and cybersecurity.

preprint2020arXiv

GEVO: GPU Code Optimization using Evolutionary Computation

GPUs are a key enabler of the revolution in machine learning and high performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the GPU's computation power. GEVO (Gpu optimization using EVOlutionary computation) is a tool for automatically discovering optimization opportunities and tuning the performance of GPU kernels in the LLVM representation. GEVO uses population-based search to find edits to GPU code compiled to LLVM-IR and improves performance on desired criteria while retaining required functionality. We demonstrate that GEVO improves the execution time of the GPU programs in the Rodinia benchmark suite and the machine learning models, SVM and ResNet18, on NVIDIA Tesla P100. For the Rodinia benchmarks, GEVO improves GPU kernel runtime performance by an average of 49.48% and by as much as 412% over the fully compiler-optimized baseline. If kernel output accuracy is relaxed to tolerate up to 1% error, GEVO can find kernel variants that outperform the baseline version by an average of 51.08%. For the machine learning workloads, GEVO achieves kernel performance improvement for SVM on the MNIST handwriting recognition (3.24X) and the a9a income prediction (2.93X) datasets with no loss of model accuracy. GEVO achieves 1.79X kernel performance improvement on image classification using ResNet18/CIFAR-10, with less than 1% model accuracy reduction.

preprint2020arXiv

Liouville type theorems on manifolds with nonnegative curvature and strictly convex boundary

We prove some Liouville type theorems on smooth compact Riemannian manifolds with nonnegative sectional curvature and strictly convex boundary. This gives a nonlinear generalization in low dimension of the recent sharp lower bound of the first Steklov eigenvalue by Xia-Xiong and verifies partially a conjecture by the third author. As a consequence, we derive several sharp Sobolev trace inequalities on these manifolds.

preprint2020arXiv

Long-term scheduling and power control for wirelessly powered cell-free IoT

We investigate the long-term scheduling and power control scheme for a wirelessly powered cell-free Internet of Things (IoT) network which consists of distributed access points (APs) and large number of sensors. In each time slot, a subset of sensors are scheduled for uplink data transmission or downlink power transfer. Through asymptotic analysis, we obtain closedform expressions for the harvested energy and the achievable rates that are independent of random pilots. Then, using these expressions, we formulate a long-term scheduling and power control problem to maximize the minimum time average achievable rate among all sensors, while maintaining the battery state of each sensor higher than a predefined minimum level. Using Lyapunov optimization, the transmission mode, the active sensor set, and the power control coefficients for each time slot are jointly determined. Finally, simulation results validate the accuracy of our derived closed-form expressions and reveal that the minimum time average achievable rate is boosted significantly by the proposed scheme compare with the simple greedy transmission scheme.

preprint2020arXiv

mmWave/THz Channel Estimation Using Frequency-Selective Atomic Norm Minimization

We propose a MIMO channel estimation method for millimeter-wave (mmWave) and terahertz (THz) systems based on frequency-selective atomic norm minimization (FS-ANM). For the strong line-of-sight property of the channel in such high-frequency bands, prior knowledge on the ranges of angles of departure/arrival (AoD/AoA) can be obtained as the prior knowledge, which can be exploited by the proposed channel estimator to improve the estimation accuracy. Simulation results show that the proposed method can achieve considerable performance gain when compared with the existing approaches without incorporating the the strong line-of-sight property.

preprint2020arXiv

Multi-mode OAM Radio Waves: Generation, Angle of Arrival Estimation and Reception With UCAs

Orbital angular momentum (OAM) at radio frequency (RF) provides a novel approach of multiplexing a set of orthogonal modes on the same frequency channel to achieve high spectrum efficiencies. However, there are still big challenges in the multi-mode OAM generation, OAM antenna alignment and OAM signal reception. To solve these problems, we propose an overall scheme of the line-of-sight multi-carrier and multi-mode OAM (LoS MCMM-OAM) communication based on uniform circular arrays (UCAs). First, we verify that UCA can generate multi-mode OAM radio beam with both the RF analog synthesis method and the baseband digital synthesis method. Then, for the considered UCA-based LoS MCMM-OAM communication system, a distance and AoA estimation method is proposed based on the two-dimensional ESPRIT (2-D ESPRIT) algorithm. A salient feature of the proposed LoS MCMM-OAM and LoS MCMM-OAM-MIMO systems is that the channel matrices are completely characterized by three parameters, namely, the azimuth angle, the elevation angle and the distance, independent of the numbers of subcarriers and antennas, which significantly reduces the burden by avoiding estimating large channel matrices, as traditional MIMO-OFDM systems. After that, we propose an OAM reception scheme including the beam steering with the estimated AoA and the amplitude detection with the estimated distance. At last, the proposed methods are extended to the LoS MCMM-OAM-MIMO system equipped with uniform concentric circular arrays (UCCAs). Both mathematical analysis and simulation results validate that the proposed OAM reception scheme can eliminate the effect of the misalignment error of a practical OAM channel and approaches the performance of an ideally aligned OAM channel.

preprint2020arXiv

On compact Riemannian manifolds with convex boundary and Ricci curvature bounded from below

We propose a new approach to the study of compact Riemannian manifolds with nonnegative Ricci curvature and strictly convex boundary or positive Ricci curvature and convex boundary. Several conjectures are formulated. Some partial results that support these conjectures are established.

preprint2020arXiv

Progressive Local Filter Pruning for Image Retrieval Acceleration

This paper focuses on network pruning for image retrieval acceleration. Prevailing image retrieval works target at the discriminative feature learning, while little attention is paid to how to accelerate the model inference, which should be taken into consideration in real-world practice. The challenge of pruning image retrieval models is that the middle-level feature should be preserved as much as possible. Such different requirements of the retrieval and classification model make the traditional pruning methods not that suitable for our task. To solve the problem, we propose a new Progressive Local Filter Pruning (PLFP) method for image retrieval acceleration. Specifically, layer by layer, we analyze the local geometric properties of each filter and select the one that can be replaced by the neighbors. Then we progressively prune the filter by gradually changing the filter weights. In this way, the representation ability of the model is preserved. To verify this, we evaluate our method on two widely-used image retrieval datasets,i.e., Oxford5k and Paris6K, and one person re-identification dataset,i.e., Market-1501. The proposed method arrives with superior performance to the conventional pruning methods, suggesting the effectiveness of the proposed method for image retrieval.

preprint2020arXiv

Real-Time Nonparametric Anomaly Detection in High-Dimensional Settings

Timely detection of abrupt anomalies is crucial for real-time monitoring and security of modern systems producing high-dimensional data. With this goal, we propose effective and scalable algorithms. Proposed algorithms are nonparametric as both the nominal and anomalous multivariate data distributions are assumed unknown. We extract useful univariate summary statistics and perform anomaly detection in a single-dimensional space. We model anomalies as persistent outliers and propose to detect them via a cumulative sum-like algorithm. In case the observed data have a low intrinsic dimensionality, we learn a submanifold in which the nominal data are embedded and evaluate whether the sequentially acquired data persistently deviate from the nominal submanifold. Further, in the general case, we learn an acceptance region for nominal data via Geometric Entropy Minimization and evaluate whether the sequentially observed data persistently fall outside the acceptance region. We provide an asymptotic lower bound and an asymptotic approximation for the average false alarm period of the proposed algorithm. Moreover, we provide a sufficient condition to asymptotically guarantee that the decision statistic of the proposed algorithm does not diverge in the absence of anomalies. Experiments illustrate the effectiveness of the proposed schemes in quick and accurate anomaly detection in high-dimensional settings.

preprint2020arXiv

Spectral Method for Phase Retrieval: an Expectation Propagation Perspective

Phase retrieval refers to the problem of recovering a signal $\mathbf{x}_{\star}\in\mathbb{C}^n$ from its phaseless measurements $y_i=|\mathbf{a}_i^{\mathrm{H}}\mathbf{x}_{\star}|$, where $\{\mathbf{a}_i\}_{i=1}^m$ are the measurement vectors. Many popular phase retrieval algorithms are based on the following two-step procedure: (i) initialize the algorithm based on a spectral method, (ii) refine the initial estimate by a local search algorithm (e.g., gradient descent). The quality of the spectral initialization step can have a major impact on the performance of the overall algorithm. In this paper, we focus on the model where the measurement matrix $\mathbf{A}=[\mathbf{a}_1,\ldots,\mathbf{a}_m]^{\mathrm{H}}$ has orthonormal columns, and study the spectral initialization under the asymptotic setting $m,n\to\infty$ with $m/n\toδ\in(1,\infty)$. We use the expectation propagation framework to characterize the performance of spectral initialization for Haar distributed matrices. Our numerical results confirm that the predictions of the EP method are accurate for not-only Haar distributed matrices, but also for realistic Fourier based models (e.g. the coded diffraction model). The main findings of this paper are the following: (1) There exists a threshold on $δ$ (denoted as $δ_{\mathrm{weak}}$) below which the spectral method cannot produce a meaningful estimate. We show that $δ_{\mathrm{weak}}=2$ for the column-orthonormal model. In contrast, previous results by Mondelli and Montanari show that $δ_{\mathrm{weak}}=1$ for the i.i.d. Gaussian model. (2) The optimal design for the spectral method coincides with that for the i.i.d. Gaussian model, where the latter was recently introduced by Luo, Alghamdi and Lu.

preprint2020arXiv

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.

preprint2020arXiv

Wirelessly Powered Cell-free IoT: Analysis and Optimization

In this paper, we propose a wirelessly powered Internet of Things (IoT) system based on the cell-free massive MIMO technology. In such a system, during the downlink phase, the sensors harvest radio-frequency (RF) energy emitted by the distributed access points (APs). During the uplink phase, sensors transmit data to the APs using the harvested energy. Collocated massive MIMO and small-cell IoT can be treated as special cases of cell-free IoT. We derive the tight closed-form lower bound on the amount of harvested energy, and the closed-form expression of SINR as the metrics of power transfer and data transmission, respectively. To improve the energy efficiency, we jointly optimize the uplink and downlink power control coefficients to minimize the total transmit energy consumption while meeting the target SINRs. Extended simulation results show that cell-free IoT outperforms collocated massive MIMO and small-cell IoT both in terms of the per user throughput for uplink, and the amount of energy harvested for downlink. Moreover, significant gains can be achieved by the proposed joint power control in terms of both per user throughput and energy consumption.

preprint2019arXiv

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.

preprint2016arXiv

A simple linear space algorithm for computing a longest common increasing subsequence

This paper reformulates the problem of finding a longest common increasing subsequence of the two given input sequences in a very succinct way. An extremely simple linear space algorithm based on the new formula can find a longest common increasing subsequence of sizes $n$ and $m$ respectively, in time $O(nm)$ using additional $\min\{n,m\}+1$ space.

preprint2016arXiv

Beamspace Channel Estimation for Millimeter-Wave Massive MIMO Systems with Lens Antenna Array

By employing the lens antenna array, beamspace MIMO can utilize beam selection to reduce the number of required RF chains in mmWave massive MIMO systems without obvious performance loss. However, to achieve the capacityapproaching performance, beam selection requires the accurate information of beamspace channel of large size, which is challenging, especially when the number of RF chains is limited. To solve this problem, in this paper we propose a reliable support detection (SD)-based channel estimation scheme. Specifically, we propose to decompose the total beamspace channel estimation problem into a series of sub-problems, each of which only considers one sparse channel component. For each channel component, we first reliably detect its support by utilizing the structural characteristics of mmWave beamspace channel. Then, the influence of this channel component is removed from the total beamspace channel estimation problem. After the supports of all channel components have been detected, the nonzero elements of the sparse beamspace channel can be estimated with low pilot overhead. Simulation results show that the proposed SD-based channel estimation outperforms conventional schemes and enjoys satisfying accuracy, even in the low SNR region.

preprint2016arXiv

Low-tubal-rank Tensor Completion using Alternating Minimization

The low-tubal-rank tensor model has been recently proposed for real-world multidimensional data. In this paper, we study the low-tubal-rank tensor completion problem, i.e., to recover a third-order tensor by observing a subset of its elements selected uniformly at random. We propose a fast iterative algorithm, called {\em Tubal-Alt-Min}, that is inspired by a similar approach for low-rank matrix completion. The unknown low-tubal-rank tensor is represented as the product of two much smaller tensors with the low-tubal-rank property being automatically incorporated, and Tubal-Alt-Min alternates between estimating those two tensors using tensor least squares minimization. First, we note that tensor least squares minimization is different from its matrix counterpart and nontrivial as the circular convolution operator of the low-tubal-rank tensor model is intertwined with the sub-sampling operator. Second, the theoretical performance guarantee is challenging since Tubal-Alt-Min is iterative and nonconvex in nature. We prove that 1) Tubal-Alt-Min guarantees exponential convergence to the global optima, and 2) for an $n \times n \times k$ tensor with tubal-rank $r \ll n$, the required sampling complexity is $O(nr^2k \log^3 n)$ and the computational complexity is $O(n^2rk^2 \log^2 n)$. Third, on both synthetic data and real-world video data, evaluation results show that compared with tensor-nuclear norm minimization (TNN-ADMM), Tubal-Alt-Min improves the recovery error dramatically (by orders of magnitude). It is estimated that Tubal-Alt-Min converges at an exponential rate $10^{-0.4423 \text{Iter}}$ where $\text{Iter}$ denotes the number of iterations, which is much faster than TNN-ADMM's $10^{-0.0332 \text{Iter}}$, and the running time can be accelerated by more than $5$ times for a $200 \times 200 \times 20$ tensor.

preprint2016arXiv

Modeling and analysis of the electromechanical behavior of surface-bonded piezoelectric actuators using finite element method

Piezoelectric actuators have been widely used to form a self-monitoring smart system to do Structural health monitoring (SHM). One of the most fundamental issues in using actuators is to determine the actuation effects being transferred from the actuators to the host structure. This report summaries the state of the art of modeling techniques for piezoelectric actuators and provides a numerical analysis of the static and dynamic electromechanical behavior of piezoelectric actuators surface-bonded to an elastic medium under in-plane mechanical and electric loads using finite element method. Also case study is conducted to study the effect of material properties, bonding layer and loading frequency using static and harmonic analysis of ANSYS. Finally, stresses and displacements are determined, and singularity behavior at the tips of the actuator is proved. The results indicate that material properties, bonding layers and frequency have a significant influence on the stresses transferred to the host structure.

preprint2016arXiv

Position-aided Large-scale MIMO Channel Estimation for High-Speed Railway Communication Systems

We consider channel estimation for high-speed railway communication systems, where both the transmitter and the receiver are equipped with large-scale antenna arrays. It is known that the throughput of conventional training schemes monotonically decreases with the mobility. Assuming that the moving terminal employs a large linear antenna array, this paper proposes a position-aided channel estimation scheme whereby only a portion of the transmit antennas send pilot symbols and the full channel matrix can be well estimated by using these pilots together with the antenna position information based on the joint spatial-temporal correlation. The relationship between mobility and throughput/DoF is established. Furthermore, the optimal selections of transmit power and time interval partition between the training and data phases as well as the antenna size are presented accordingly. Both analytical and simulation results show that the system throughput with the position-aided channel estimator does not deteriorate appreciably as the mobility increases, which is sharply in contrast with the conventional one.

preprint2016arXiv

Sequential Hypothesis Test with Online Usage-Constrained Sensor Selection

This work investigates the sequential hypothesis testing problem with online sensor selection and sensor usage constraints. That is, in a sensor network, the fusion center sequentially acquires samples by selecting one "most informative" sensor at each time until a reliable decision can be made. In particular, the sensor selection is carried out in the online fashion since it depends on all the previous samples at each time. Our goal is to develop the sequential test (i.e., stopping rule and decision function) and sensor selection strategy that minimize the expected sample size subject to the constraints on the error probabilities and sensor usages. To this end, we first recast the usage-constrained formulation into a Bayesian optimal stopping problem with different sampling costs for the usage-contrained sensors. The Bayesian problem is then studied under both finite- and infinite-horizon setups, based on which, the optimal solution to the original usage-constrained problem can be readily established. Moreover, by capitalizing on the structures of the optimal solution, a lower bound is obtained for the optimal expected sample size. In addition, we also propose algorithms to approximately evaluate the parameters in the optimal sequential test so that the sensor usage and error probability constraints are satisfied. Finally, numerical experiments are provided to illustrate the theoretical findings, and compare with the existing methods.

preprint2015arXiv

A Practical O(R\log\log n+n) time Algorithm for Computing the Longest Common Subsequence

In this paper, we revisit the much studied LCS problem for two given sequences. Based on the algorithm of Iliopoulos and Rahman for solving the LCS problem, we have suggested 3 new improved algorithms. We first reformulate the problem in a very succinct form. The problem LCS is abstracted to an abstract data type DS on an ordered positive integer set with a special operation Update(S,x). For the two input sequences X and Y of equal length n, the first improved algorithm uses a van Emde Boas tree for DS and its time and space complexities are O(R\log\log n+n) and O(R), where R is the number of matched pairs of the two input sequences. The second algorithm uses a balanced binary search tree for DS and its time and space complexities are O(R\log L+n) and O(R), where L is the length of the longest common subsequence of X and Y. The third algorithm uses an ordered vector for DS and its time and space complexities are O(nL) and O(R).

preprint2015arXiv

An Efficient Dynamic Programming Algorithm for STR-IC-SEQ-EC-LCS Problem

In this paper, we consider a generalized longest common subsequence problem, in which a constraining sequence of length $s$ must be included as a substring and the other constraining sequence of length $t$ must be excluded as a subsequence of two main sequences and the length of the result must be maximal. For the two input sequences $X$ and $Y$ of lengths $n$ and $m$, and the given two constraining sequences of length $s$ and $t$, we present an $O(nmst)$ time dynamic programming algorithm for solving the new generalized longest common subsequence problem. The time complexity can be reduced further to cubic time in a more detailed analysis. The correctness of the new algorithm is proved.

preprint2015arXiv

An efficient dynamic programming algorithm for the generalized LCS problem with multiple substring inclusive constraints

In this paper, we consider a generalized longest common subsequence problem with multiple substring inclusive constraints. For the two input sequences $X$ and $Y$ of lengths $n$ and $m$, and a set of $d$ constraints $P=\{P_1,\cdots,P_d\}$ of total length $r$, the problem is to find a common subsequence $Z$ of $X$ and $Y$ including each of constraint string in $P$ as a substring and the length of $Z$ is maximized. A new dynamic programming solution to this problem is presented in this paper. The correctness of the new algorithm is proved. The time complexity of our algorithm is $O(d2^dnmr)$. In the case of the number of constraint strings is fixed, our new algorithm for the generalized longest common subsequence problem with multiple substring inclusive constraints requires $O(nmr)$ time and space.

preprint2015arXiv

An LS-Decomposition Approach for Robust Data Recovery in Wireless Sensor Networks

Wireless sensor networks are widely adopted in military, civilian and commercial applications, which fuels an exponential explosion of sensory data. However, a major challenge to deploy effective sensing systems is the presence of {\em massive missing entries, measurement noise, and anomaly readings}. Existing works assume that sensory data matrices have low-rank structures. This does not hold in reality due to anomaly readings, causing serious performance degradation. In this paper, we introduce an {\em LS-Decomposition} approach for robust sensory data recovery, which decomposes a corrupted data matrix as the superposition of a low-rank matrix and a sparse anomaly matrix. First, we prove that LS-Decomposition solves a convex program with bounded approximation error. Second, using data sets from the IntelLab, GreenOrbs, and NBDC-CTD projects, we find that sensory data matrices contain anomaly readings. Third, we propose an accelerated proximal gradient algorithm and prove that it approximates the optimal solution with convergence rate $O(1/k^2)$ ($k$ is the number of iterations). Evaluations on real data sets show that our scheme achieves recovery error $\leq 5\%$ for sampling rate $\geq 50\%$ and almost exact recovery for sampling rate $\geq 60\%$, while state-of-the-art methods have error $10\% \sim 15\%$ at sampling rate $90\%$.

preprint2015arXiv

Hyperbolicity versus non-hyperbolic ergodic measures inside homoclinic classes

We prove that, for $C^1$-generic diffeomorphisms, if a homoclinic class is not hyperbolic, then there is a non-hyperbolic ergodic measure supported on it. This proves a conjecture by Díaz and Gorodetski [28]. We also discuss the conjectured existence of periodic points with different stable dimension in the class.

preprint2015arXiv

Information and Energy Cooperation in OFDM Relaying: Protocols and Optimization

Integrating power transfer into wireless communications for supporting simultaneous wireless information and power transfer (SWIPT) is a promising technique in energy-constrained wireless networks. While most existing work on SWIPT focuses on capacity-energy characterization, the benefits of cooperative transmission for SWIPT are much less investigated. In this paper, we consider SWIPT in an orthogonal frequency-division multiplexing (OFDM) relaying system, where a source node transfers information and a fraction of power simultaneously to a relay node, and the relay node uses the harvested power from the source node to forward the source information to the destination. To support the simultaneous information and energy cooperation, we first propose a transmission protocol assuming that the direct link between the source and destination does not exist, namely power splitting (PS) relaying protocol, where the relay node splits the received signal power in the first hop into two separate parts, one for information decoding and the other for energy harvesting. Then, we consider the case that the direct link between the source and destination is available, and the transmission mode adaptation (TMA) protocol is proposed, where the transmission can be completed by cooperative mode and direct mode simultaneously (over different subcarriers). In direct mode, when the source transmits signal to the destination, the destination receives the signal as information and the relay node concurrently receives the signal for energy harvesting. Joint resource allocation problems are formulated to maximize the system throughput. By using the Lagrangian dual method, we develop efficient algorithms to solve the nonconvex optimization problems.

preprint2015arXiv

On the dominated splitting of Lyapunov stable aperiodic classes

Recent works related to Palis conjecture of J. Yang, S. Crovisier, M. Sambarino and D. Yang showed that any aperiodic class of a $C^1$-generic diffeomorphism far away from homoclinic bifurcations (or homoclinic tangencies) is partially hyperbolic. We show in this paper that, generically, a non-trivial dominated splitting implies partial hyperbolicity for an aperiodic class if it is Lyapunov stable. More precisely, for $C^1$-generic diffeomorphisms, if a Lyapunov stable aperiodic class has a non-trivial dominated splitting $E\oplus F$, then one of the two bundles is hyperbolic (either $E$ is contracted or $F$ is expanded).

Xiaodong Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

97 published item(s)

A modified Bakry-Émery $Γ_2$ criterion inequality and the monotonicity of the Tsallis entropy

A Novel Algorithm to Solve for an Underwater Line Source Sound Field Based on Coupled Modes and a Spectral Method

A Sharp Inequality Relating Yamabe Invariants on Asymptotically Poincare-Einstein Manifolds with a Ricci Curvature Lower Bound

Hybrid Mechanical and Electronic Beam Steering for Maximizing OAM Channel Capacity

Scaling Blockchains with Error Correction Codes: A Survey on Coded Blockchains

Two stages for visual object tracking

Discovering Multiple Phases of Dynamics by Dissecting Multivariate Time Series

Improved Sobolev inequality under constraints

MIMO OFDM Dual-Function Radar-Communication Under Error Rate and Beampattern Constraints

Privacy-preserving Channel Estimation in Cell-free Hybrid Massive MIMO Systems

Reconfigurable-intelligent-surface-assisted Downlink Transmission Design via Bayesian Optimization

Remark on an inequality for closed hypersurfaces in complete manifolds with nonnegative Ricci curvature

Compressing Recurrent Neural Networks Using Hierarchical Tucker Tensor Decomposition

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Exploiting Parallelism Opportunities with Deep Learning Frameworks

From learning gait signatures of many individuals to reconstructing gait dynamics of one single individual

GEVO: GPU Code Optimization using Evolutionary Computation

Liouville type theorems on manifolds with nonnegative curvature and strictly convex boundary

Long-term scheduling and power control for wirelessly powered cell-free IoT

mmWave/THz Channel Estimation Using Frequency-Selective Atomic Norm Minimization

Multi-mode OAM Radio Waves: Generation, Angle of Arrival Estimation and Reception With UCAs

On compact Riemannian manifolds with convex boundary and Ricci curvature bounded from below

Progressive Local Filter Pruning for Image Retrieval Acceleration

Real-Time Nonparametric Anomaly Detection in High-Dimensional Settings

Spectral Method for Phase Retrieval: an Expectation Propagation Perspective

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

Wirelessly Powered Cell-free IoT: Analysis and Optimization

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

A simple linear space algorithm for computing a longest common increasing subsequence

Beamspace Channel Estimation for Millimeter-Wave Massive MIMO Systems with Lens Antenna Array

Low-tubal-rank Tensor Completion using Alternating Minimization

Modeling and analysis of the electromechanical behavior of surface-bonded piezoelectric actuators using finite element method

Position-aided Large-scale MIMO Channel Estimation for High-Speed Railway Communication Systems

Sequential Hypothesis Test with Online Usage-Constrained Sensor Selection

A Practical O(R\log\log n+n) time Algorithm for Computing the Longest Common Subsequence

An Efficient Dynamic Programming Algorithm for STR-IC-SEQ-EC-LCS Problem

An efficient dynamic programming algorithm for the generalized LCS problem with multiple substring inclusive constraints

An LS-Decomposition Approach for Robust Data Recovery in Wireless Sensor Networks

Hyperbolicity versus non-hyperbolic ergodic measures inside homoclinic classes

Information and Energy Cooperation in OFDM Relaying: Protocols and Optimization

On the dominated splitting of Lyapunov stable aperiodic classes

A note on the largest number of red nodes in red-black trees

Boundary Effect of Ricci Curvature

Cooperative Change Detection for Online Power Quality Monitoring

Distributed Energy Efficient Cross-layer Optimization for Multihop MIMO Cognitive Radio Networks with Primary User Rate Protection

Dynamic Optimization For Heterogeneous Powered Wireless Multimedia Sensor Networks With Correlated Sources and Network Coding

Energy Management and Cross Layer Optimization for Wireless Sensor Network Powered by Heterogeneous Energy Sources

Massive MIMO Multicasting in Noncooperative Cellular Networks

Multiuser Joint Energy-Bandwidth Allocation with Energy Harvesting - Part I: Optimum Algorithm & Multiple Point-to-Point Channels

Multiuser Joint Energy-Bandwidth Allocation with Energy Harvesting - Part II: Multiple Broadcast Channels & Proportional Fairness

On the hyperbolicity of $C^1$-generic homoclinic classes

Online Dating Recommendations: Matching Markets and Learning Preferences

Resource Allocation for Power Minimization in the Downlink of THP-based Spatial Multiplexing MIMO-OFDMA Systems

Self-organized vanadium and nitrogen co-doped titania nanotube arrays with enhanced photocatalytic reduction of CO2 into CH4

Sequential and Decentralized Estimation of Linear Regression Parameters in Wireless Sensor Networks

Sequential Distributed Detection in Energy-Constrained Wireless Sensor Networks

Sequential Joint Detection and Estimation: Optimum Tests and Applications

Sequential Joint Spectrum Sensing and Channel Estimation for Dynamic Spectrum Access

Stochastic Optimal Linear Control of Wireless Networked Control Systems with Delays and Packet Losses

Who is Dating Whom: Characterizing User Behaviors of a Large Online Dating Site

A Dynamic Programming Solution to a Generalized LCS Problem

A new characterization of the CR shpere and the sharp eigenvalue estimate for the Kohn Laplacian

An Efficient Dynamic Programming Algorithm for the Generalized LCS Problem with Multiple Substring Exclusion Constrains

An Obata-type Theorem in CR Geometry

Complete Solutions for a Combinatorial Puzzle in Linear Time

Hybrid Group Decoding for Scalable Video over MIMO-OFDM Downlink Systems

On a remarkable formula of Jerison and Lee in CR geometry

On Finite Block-Length Quantization Distortion

On the Capacity and Degrees of Freedom Regions of MIMO Interference Channels with Limited Receiver Cooperation

On the Capacity Region and the Generalized Degrees of Freedom Region for the MIMO Interference Channel with Feedback

On the isoperimetric constant of symmetric spaces of noncompact type

Optimal Distributed Control for Networked Control Systems with Delays

Optimal Sequential Joint Detection and Estimation