Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
29works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

29 published item(s)

preprint2026arXiv

A modified Bakry-Émery $Γ_2$ criterion inequality and the monotonicity of the Tsallis entropy

The Bakry-Émery $Γ_2$ criterion inequality provides a method for establishing the logarithmic Sobolev inequality. We prove a one-parameter family of weighted Bakry-Émery $Γ_2$ criterion inequalities which in the limit case yields the improved constant due to Ji \cite{Ji24}. Furthermore, we establish a modified weighted $Γ_2$ criterion inequality which could be interpreted as a monotonicity of the Tsallis entropy under the heat flow and yields a family of sharp Sobolev inequalities.

preprint2022arXiv

A Novel Algorithm to Solve for an Underwater Line Source Sound Field Based on Coupled Modes and a Spectral Method

A high-precision numerical sound field is the basis of underwater target detection, positioning and communication. A line source in a plane is a common type of sound source in computational ocean acoustics. The exciting waveguide in a range-dependent ocean environment is often structurally complicated; however, traditional algorithms often assume that the waveguide has a simple seabed boundary and that the line source is located at a horizontal range of 0 m, although this ideal situation is rarely encountered in the actual ocean. In this paper, a novel algorithm is designed that can solve for the sound field excited by a line source at any position in a range-dependent ocean environment. The proposed algorithm uses the classic stepwise approximation approach to address the range dependence of the environment and uses the Chebyshev--Tau spectral method to solve for the horizontal wavenumbers and modes of approximately range-independent segments. Once the modal information of these flat segments has been obtained, a global matrix is constructed to solve for the coupling coefficients of all segments, and finally, the complete sound field is synthesized. Numerical experiments using a robust numerical program developed based on this algorithm verify the correctness and usability of our novel algorithm and software. Furthermore, a detailed analysis and test of the computational cost of this algorithm show that it is efficient.

preprint2022arXiv

Hybrid Mechanical and Electronic Beam Steering for Maximizing OAM Channel Capacity

Radio frequency-orbital angular momentum (RF-OAM) is a novel approach of multiplexing a set of orthogonal modes on the same frequency channel to achieve high spectrum efficiencies. Since OAM requires precise alignment of the transmit and the receive antennas, the electronic beam steering approach has been proposed for the uniform circular array (UCA)-based OAM communication system to circumvent large performance degradation induced by small antenna misalignment in practical environment. However, in the case of large-angle misalignment, the OAM channel capacity can not be effectively compensated only by the electronic beam steering. To solve this problem, we propose a hybrid mechanical and electronic beam steering scheme, in which mechanical rotating devices controlled by pulse width modulation (PWM) signals as the execution unit are utilized to eliminate the large misalignment angle, while electronic beam steering is in charge of the remaining small misalignment angle caused by perturbations. Furthermore, due to the interferometry, the receive signal-to-noise ratios (SNRs) are not uniform at the elements of the receive UCA. Therefore, a rotatable UCA structure is proposed for the OAM receiver to maximize the channel capacity, in which the simulated annealing algorithm is adopted to obtain the optimal rotation angle at first, then the servo system performs mechanical rotation, at last the electronic beam steering is adjusted accordingly. Both mathematical analysis and simulation results validate that the proposed hybrid mechanical and electronic beam steering scheme can effectively eliminate the effect of diverse misalignment errors of any practical OAM channel and maximize the OAM channel capacity.

preprint2022arXiv

Scaling Blockchains with Error Correction Codes: A Survey on Coded Blockchains

This paper reviews and highlights how coding schemes have been used to solve various problems in blockchain systems. Specifically, these problems relate to scaling blockchains in terms of their data storage, computation and communication cost, as well as security. To this end, this paper considers the use of coded blocks or shards that allows participants to store only a fraction of the total blockchain, protect against malicious nodes or erasures due to nodes leaving a blockchain system, ensure data availability in order to promote transparency, and scale the security of sharded blockchains. Further, it helps reduce communication cost when disseminating blocks, which is critical to bootstrapping new nodes and helps speed up consensus of blocks. For each category of solutions, we highlight problems and issues that motivated their designs and use of coding. Moreover, we provide a qualitative analysis of their storage, communication and computation cost.

preprint2022arXiv

Two stages for visual object tracking

Siamese-based trackers have achived promising performance on visual object tracking tasks. Most existing Siamese-based trackers contain two separate branches for tracking, including classification branch and bounding box regression branch. In addition, image segmentation provides an alternative way to obetain the more accurate target region. In this paper, we propose a novel tracker with two-stages: detection and segmentation. The detection stage is capable of locating the target by Siamese networks. Then more accurate tracking results are obtained by segmentation module given the coarse state estimation in the first stage. We conduct experiments on four benchmarks. Our approach achieves state-of-the-art results, with the EAO of 52.6$\%$ on VOT2016, 51.3$\%$ on VOT2018, and 39.0$\%$ on VOT2019 datasets, respectively.

preprint2021arXiv

Discovering Multiple Phases of Dynamics by Dissecting Multivariate Time Series

We proposed a data-driven approach to dissect multivariate time series in order to discover multiple phases underlying dynamics of complex systems. This computing approach is developed as a multiple-dimension version of Hierarchical Factor Segmentation(HFS) technique. This expanded approach proposes a systematic protocol of choosing various extreme events in multi-dimensional space. Upon each chosen event, an empirical distribution of event-recurrence, or waiting time between the excursions, is fitted by a geometric distribution with time-varying parameters. Iterative fittings are performed across all chosen events. We then collect and summarize the local recurrent patterns into a global dynamic mechanism. Clustering is applied for partitioning the whole time period into alternating segments, in which variables are identically distributed. Feature weighting techniques are also considered to compensate for some drawbacks of clustering. Our simulation results show that this expanded approach can even detect systematic differences when the joint distribution varies. In real data experiments, we analyze the relationship from returns, trading volume, and transaction number of a single, as well as of multiple stocks in S&P500. We can successfully not only map out volatile periods but also provide potential associative links between stocks.

preprint2021arXiv

MIMO OFDM Dual-Function Radar-Communication Under Error Rate and Beampattern Constraints

In this work we consider a multiple-input multiple-output (MIMO) dual-function radar-communication (DFRC) system, which senses multiple spatial directions and serves multiple users. Upon resorting to an orthogonal frequency division multiplexing (OFDM) transmission format and a differential phase shift keying (DPSK) modulation, we study the design of the radiated waveforms and of the receive filters employed by the radar and the users. The approach is communication-centric, in the sense that a radar-oriented objective is optimized under constraints on the average transmit power, the power leakage towards specific directions, and the error rate of each user, thus safeguarding the communication quality of service (QoS). We adopt a unified design approach allowing a broad family of radar objectives, including both estimation- and detection-oriented merit functions. We devise a suboptimal solution based on alternating optimization of the involved variables, a convex restriction of the feasible search set, and minorization-maximization, offering a single algorithm for all of the radar merit functions in the considered family. Finally, the performance is inspected through numerical examples.

preprint2021arXiv

Privacy-preserving Channel Estimation in Cell-free Hybrid Massive MIMO Systems

We consider a cell-free hybrid massive multiple-input multiple-output (MIMO) system with $K$ users and $M$ access points (APs), each with $N_a$ antennas and $N_r< N_a$ radio frequency (RF) chains. When $K\ll M{N_a}$, efficient uplink channel estimation and data detection with reduced number of pilots can be performed based on low-rank matrix completion. However, such a scheme requires the central processing unit (CPU) to collect received signals from all APs, which may enable the CPU to infer the private information of user locations. We therefore develop and analyze privacy-preserving channel estimation schemes under the framework of differential privacy (DP). As the key ingredient of the channel estimator, two joint differentially private noisy matrix completion algorithms based respectively on Frank-Wolfe iteration and singular value decomposition are presented. We provide an analysis on the tradeoff between the privacy and the channel estimation error. In particular, we show that the estimation error can be mitigated while maintaining the same privacy level by increasing the payload size with fixed pilot size; and the scaling laws of both the privacy-induced and privacy-independent error components in terms of payload size are characterized. Simulation results are provided to further demonstrate the tradeoff between privacy and channel estimation performance.

preprint2021arXiv

Reconfigurable-intelligent-surface-assisted Downlink Transmission Design via Bayesian Optimization

This paper investigates the transmission design in the reconfigurable-intelligent-surface (RIS)-assisted downlink system. The channel state information (CSI) is usually difficult to be estimated at the base station (BS) when the RIS is not equipped with radio frequency chains. In this paper, we propose a downlink transmission framework with unknown CSI via Bayesian optimization. Since the CSI is not available at the BS, we treat the unknown objective function as the black-box function and take the beamformer, the phase shift, and the receiving filter as the input. Then the objective function is decomposed as the sum of low-dimension subfunctions to reduce the complexity. By re-expressing the power constraint of the BS in spherical coordinates, the original constraint problem is converted into an equivalent unconstrained problem. The users estimate the sum MSE of the training symbols as the objective value and feed it back to the BS. We assume a Gaussian prior of the feedback samples and the next query point is updated by minimizing the constructed acquisition function. Furthermore, this framework can also be applied to the power transfer system and fairness problems. Simulation results validate the effectiveness of the proposed transmission scheme in the downlink data transmission and power transfer.

preprint2020arXiv

Compressing Recurrent Neural Networks Using Hierarchical Tucker Tensor Decomposition

Recurrent Neural Networks (RNNs) have been widely used in sequence analysis and modeling. However, when processing high-dimensional data, RNNs typically require very large model sizes, thereby bringing a series of deployment challenges. Although the state-of-the-art tensor decomposition approaches can provide good model compression performance, these existing methods are still suffering some inherent limitations, such as restricted representation capability and insufficient model complexity reduction. To overcome these limitations, in this paper we propose to develop compact RNN models using Hierarchical Tucker (HT) decomposition. HT decomposition brings strong hierarchical structure to the decomposed RNN models, which is very useful and important for enhancing the representation capability. Meanwhile, HT decomposition provides higher storage and computational cost reduction than the existing tensor decomposition approaches for RNN compression. Our experimental results show that, compared with the state-of-the-art compressed RNN models, such as TT-LSTM, TR-LSTM and BT-LSTM, our proposed HT-based LSTM (HT-LSTM), consistently achieves simultaneous and significant increases in both compression ratio and test accuracy on different datasets.

preprint2020arXiv

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook&#39;s next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

preprint2020arXiv

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

preprint2020arXiv

Exploiting Parallelism Opportunities with Deep Learning Frameworks

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This paper takes a deep dive into analyzing the performance impact of key design features in a machine learning framework and quantifies the role of parallelism. The observations and insights distill into a simple set of guidelines that one can use to achieve much higher training and inference speedup. Across a diverse set of real-world deep learning models, the evaluation results show that the proposed performance tuning guidelines outperform the Intel and TensorFlow recommended settings by 1.29x and 1.34x, respectively.

preprint2020arXiv

From learning gait signatures of many individuals to reconstructing gait dynamics of one single individual

Based on the same databases, we computationally address two seemingly highly related, in fact drastically distinct, questions via computational data-driven algorithms: 1) how to precisely achieve the big task of differentiating gait signatures of many individuals? 2) how to reconstruct an individual&#39;s complex gait dynamics in full? Our brains can &#34;effortlessly&#34; resolve the first question, but will definitely fail in the second one. Since many fine temporal scale gait patterns surely escape our eyes. Based on accelerometers&#39; 3D gait time series databases, we link the answers toward both questions via multiscale structural dependency within gait dynamics of our musculoskeletal system. Two types of dependency manifestations are explored. We first develop simple algorithmic computing called Principle System-State Analysis (PSSA) for the coarse dependency in implicit forms. PSSA is shown to be able to efficiently classifying among many subjects. We then develop a multiscale Local-1st-Global-2nd (L1G2) Coding Algorithm and a landmark computing algorithm. With both algorithms, we can precisely dissect rhythmic gait cycles, and then decompose each cycle into a series of cyclic gait phases. With proper color-coding and stacking, we reconstruct and represent an individual&#39;s gait dynamics via a 3D cylinder to collectively reveal universal deterministic and stochastic structural patterns on centisecond (10 milliseconds) scale across all rhythmic cycles. This 3D cylinder can serve as &#34;passtensor&#34; for authentication purposes related to clinical diagnoses and cybersecurity.

preprint2020arXiv

GEVO: GPU Code Optimization using Evolutionary Computation

GPUs are a key enabler of the revolution in machine learning and high performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the GPU&#39;s computation power. GEVO (Gpu optimization using EVOlutionary computation) is a tool for automatically discovering optimization opportunities and tuning the performance of GPU kernels in the LLVM representation. GEVO uses population-based search to find edits to GPU code compiled to LLVM-IR and improves performance on desired criteria while retaining required functionality. We demonstrate that GEVO improves the execution time of the GPU programs in the Rodinia benchmark suite and the machine learning models, SVM and ResNet18, on NVIDIA Tesla P100. For the Rodinia benchmarks, GEVO improves GPU kernel runtime performance by an average of 49.48% and by as much as 412% over the fully compiler-optimized baseline. If kernel output accuracy is relaxed to tolerate up to 1% error, GEVO can find kernel variants that outperform the baseline version by an average of 51.08%. For the machine learning workloads, GEVO achieves kernel performance improvement for SVM on the MNIST handwriting recognition (3.24X) and the a9a income prediction (2.93X) datasets with no loss of model accuracy. GEVO achieves 1.79X kernel performance improvement on image classification using ResNet18/CIFAR-10, with less than 1% model accuracy reduction.

preprint2020arXiv

Liouville type theorems on manifolds with nonnegative curvature and strictly convex boundary

We prove some Liouville type theorems on smooth compact Riemannian manifolds with nonnegative sectional curvature and strictly convex boundary. This gives a nonlinear generalization in low dimension of the recent sharp lower bound of the first Steklov eigenvalue by Xia-Xiong and verifies partially a conjecture by the third author. As a consequence, we derive several sharp Sobolev trace inequalities on these manifolds.

preprint2020arXiv

Long-term scheduling and power control for wirelessly powered cell-free IoT

We investigate the long-term scheduling and power control scheme for a wirelessly powered cell-free Internet of Things (IoT) network which consists of distributed access points (APs) and large number of sensors. In each time slot, a subset of sensors are scheduled for uplink data transmission or downlink power transfer. Through asymptotic analysis, we obtain closedform expressions for the harvested energy and the achievable rates that are independent of random pilots. Then, using these expressions, we formulate a long-term scheduling and power control problem to maximize the minimum time average achievable rate among all sensors, while maintaining the battery state of each sensor higher than a predefined minimum level. Using Lyapunov optimization, the transmission mode, the active sensor set, and the power control coefficients for each time slot are jointly determined. Finally, simulation results validate the accuracy of our derived closed-form expressions and reveal that the minimum time average achievable rate is boosted significantly by the proposed scheme compare with the simple greedy transmission scheme.

preprint2020arXiv

mmWave/THz Channel Estimation Using Frequency-Selective Atomic Norm Minimization

We propose a MIMO channel estimation method for millimeter-wave (mmWave) and terahertz (THz) systems based on frequency-selective atomic norm minimization (FS-ANM). For the strong line-of-sight property of the channel in such high-frequency bands, prior knowledge on the ranges of angles of departure/arrival (AoD/AoA) can be obtained as the prior knowledge, which can be exploited by the proposed channel estimator to improve the estimation accuracy. Simulation results show that the proposed method can achieve considerable performance gain when compared with the existing approaches without incorporating the the strong line-of-sight property.

preprint2020arXiv

Multi-mode OAM Radio Waves: Generation, Angle of Arrival Estimation and Reception With UCAs

Orbital angular momentum (OAM) at radio frequency (RF) provides a novel approach of multiplexing a set of orthogonal modes on the same frequency channel to achieve high spectrum efficiencies. However, there are still big challenges in the multi-mode OAM generation, OAM antenna alignment and OAM signal reception. To solve these problems, we propose an overall scheme of the line-of-sight multi-carrier and multi-mode OAM (LoS MCMM-OAM) communication based on uniform circular arrays (UCAs). First, we verify that UCA can generate multi-mode OAM radio beam with both the RF analog synthesis method and the baseband digital synthesis method. Then, for the considered UCA-based LoS MCMM-OAM communication system, a distance and AoA estimation method is proposed based on the two-dimensional ESPRIT (2-D ESPRIT) algorithm. A salient feature of the proposed LoS MCMM-OAM and LoS MCMM-OAM-MIMO systems is that the channel matrices are completely characterized by three parameters, namely, the azimuth angle, the elevation angle and the distance, independent of the numbers of subcarriers and antennas, which significantly reduces the burden by avoiding estimating large channel matrices, as traditional MIMO-OFDM systems. After that, we propose an OAM reception scheme including the beam steering with the estimated AoA and the amplitude detection with the estimated distance. At last, the proposed methods are extended to the LoS MCMM-OAM-MIMO system equipped with uniform concentric circular arrays (UCCAs). Both mathematical analysis and simulation results validate that the proposed OAM reception scheme can eliminate the effect of the misalignment error of a practical OAM channel and approaches the performance of an ideally aligned OAM channel.

preprint2020arXiv

Progressive Local Filter Pruning for Image Retrieval Acceleration

This paper focuses on network pruning for image retrieval acceleration. Prevailing image retrieval works target at the discriminative feature learning, while little attention is paid to how to accelerate the model inference, which should be taken into consideration in real-world practice. The challenge of pruning image retrieval models is that the middle-level feature should be preserved as much as possible. Such different requirements of the retrieval and classification model make the traditional pruning methods not that suitable for our task. To solve the problem, we propose a new Progressive Local Filter Pruning (PLFP) method for image retrieval acceleration. Specifically, layer by layer, we analyze the local geometric properties of each filter and select the one that can be replaced by the neighbors. Then we progressively prune the filter by gradually changing the filter weights. In this way, the representation ability of the model is preserved. To verify this, we evaluate our method on two widely-used image retrieval datasets,i.e., Oxford5k and Paris6K, and one person re-identification dataset,i.e., Market-1501. The proposed method arrives with superior performance to the conventional pruning methods, suggesting the effectiveness of the proposed method for image retrieval.

preprint2020arXiv

Real-Time Nonparametric Anomaly Detection in High-Dimensional Settings

Timely detection of abrupt anomalies is crucial for real-time monitoring and security of modern systems producing high-dimensional data. With this goal, we propose effective and scalable algorithms. Proposed algorithms are nonparametric as both the nominal and anomalous multivariate data distributions are assumed unknown. We extract useful univariate summary statistics and perform anomaly detection in a single-dimensional space. We model anomalies as persistent outliers and propose to detect them via a cumulative sum-like algorithm. In case the observed data have a low intrinsic dimensionality, we learn a submanifold in which the nominal data are embedded and evaluate whether the sequentially acquired data persistently deviate from the nominal submanifold. Further, in the general case, we learn an acceptance region for nominal data via Geometric Entropy Minimization and evaluate whether the sequentially observed data persistently fall outside the acceptance region. We provide an asymptotic lower bound and an asymptotic approximation for the average false alarm period of the proposed algorithm. Moreover, we provide a sufficient condition to asymptotically guarantee that the decision statistic of the proposed algorithm does not diverge in the absence of anomalies. Experiments illustrate the effectiveness of the proposed schemes in quick and accurate anomaly detection in high-dimensional settings.

preprint2020arXiv

Spectral Method for Phase Retrieval: an Expectation Propagation Perspective

Phase retrieval refers to the problem of recovering a signal $\mathbf{x}_{\star}\in\mathbb{C}^n$ from its phaseless measurements $y_i=|\mathbf{a}_i^{\mathrm{H}}\mathbf{x}_{\star}|$, where $\{\mathbf{a}_i\}_{i=1}^m$ are the measurement vectors. Many popular phase retrieval algorithms are based on the following two-step procedure: (i) initialize the algorithm based on a spectral method, (ii) refine the initial estimate by a local search algorithm (e.g., gradient descent). The quality of the spectral initialization step can have a major impact on the performance of the overall algorithm. In this paper, we focus on the model where the measurement matrix $\mathbf{A}=[\mathbf{a}_1,\ldots,\mathbf{a}_m]^{\mathrm{H}}$ has orthonormal columns, and study the spectral initialization under the asymptotic setting $m,n\to\infty$ with $m/n\toδ\in(1,\infty)$. We use the expectation propagation framework to characterize the performance of spectral initialization for Haar distributed matrices. Our numerical results confirm that the predictions of the EP method are accurate for not-only Haar distributed matrices, but also for realistic Fourier based models (e.g. the coded diffraction model). The main findings of this paper are the following: (1) There exists a threshold on $δ$ (denoted as $δ_{\mathrm{weak}}$) below which the spectral method cannot produce a meaningful estimate. We show that $δ_{\mathrm{weak}}=2$ for the column-orthonormal model. In contrast, previous results by Mondelli and Montanari show that $δ_{\mathrm{weak}}=1$ for the i.i.d. Gaussian model. (2) The optimal design for the spectral method coincides with that for the i.i.d. Gaussian model, where the latter was recently introduced by Luo, Alghamdi and Lu.

preprint2020arXiv

The Architectural Implications of Facebook&#39;s DNN-based Personalized Recommendation

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.

preprint2020arXiv

Wirelessly Powered Cell-free IoT: Analysis and Optimization

In this paper, we propose a wirelessly powered Internet of Things (IoT) system based on the cell-free massive MIMO technology. In such a system, during the downlink phase, the sensors harvest radio-frequency (RF) energy emitted by the distributed access points (APs). During the uplink phase, sensors transmit data to the APs using the harvested energy. Collocated massive MIMO and small-cell IoT can be treated as special cases of cell-free IoT. We derive the tight closed-form lower bound on the amount of harvested energy, and the closed-form expression of SINR as the metrics of power transfer and data transmission, respectively. To improve the energy efficiency, we jointly optimize the uplink and downlink power control coefficients to minimize the total transmit energy consumption while meeting the target SINRs. Extended simulation results show that cell-free IoT outperforms collocated massive MIMO and small-cell IoT both in terms of the per user throughput for uplink, and the amount of energy harvested for downlink. Moreover, significant gains can be achieved by the proposed joint power control in terms of both per user throughput and energy consumption.

preprint2019arXiv

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.