Source author record

Malik Magdon-Ismail

Malik Magdon-Ismail appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms Social and Information Networks Discrete Mathematics astro-ph.GA Information Theory math.CO math.IT Multiagent Systems physics.soc-ph Artificial Intelligence Computational Complexity Computational Engineering, Finance, and Science math.NA Populations and Evolution Computation Numerical Analysis q-fin.CP q-fin.TR

Catalog footprint

What is connected

35works

19topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

An Algorithm for Reconstructing the Orphan Stream Progenitor with MilkyWay@home Volunteer Computing

We have developed a method for estimating the properties of the progenitor dwarf galaxy from the tidal stream of stars that were ripped from it as it fell into the Milky Way. In particular, we show that the mass and radial profile of a progenitor dwarf galaxy evolved along the orbit of the Orphan Stream, including the stellar and dark matter components, can be reconstructed from the distribution of stars in the tidal stream it produced. We use MilkyWay@home, a PetaFLOPS-scale distributed supercomputer, to optimize our dwarf galaxy parameters until we arrive at best-fit parameters. The algorithm fits the dark matter mass, dark matter radius, stellar mass, radial profile of stars, and orbital time. The parameters are recovered even though the dark matter component extends well past the half light radius of the dwarf galaxy progenitor, proving that we are able to extract information about the dark matter halos of dwarf galaxies from the tidal debris. Our simulations assumed that the Milky Way potential, dwarf galaxy orbit, and the form of the density model for the dwarf galaxy were known exactly; more work is required to evaluate the sources of systematic error in fitting real data. This method can be used to estimate the dark matter content in dwarf galaxies without the assumption of virial equilibrium that is required to estimate the mass using line-of-sight velocities. This demonstration is a first step towards building an infrastructure that will fit the Milky Way potential using multiple tidal streams.

preprint2020arXiv

A New Mathematical Model for Controlled Pandemics Like COVID-19 : AI Implemented Predictions

We present a new mathematical model to explicitly capture the effects that the three restriction measures: the lockdown date and duration, social distancing and masks, and, schools and border closing, have in controlling the spread of COVID-19 infections $i(r, t)$. Before restrictions were introduced, the random spread of infections as described by the SEIR model grew exponentially. The addition of control measures introduces a mixing of order and disorder in the system's evolution which fall under a different mathematical class of models that can eventually lead to critical phenomena. A generic analytical solution is hard to obtain. We use machine learning to solve the new equations for $i(r,t)$, the infections $i$ in any region $r$ at time $t$ and derive predictions for the spread of infections over time as a function of the strength of the specific measure taken and their duration. The machine is trained in all of the COVID-19 published data for each region, county, state, and country in the world. It utilizes optimization to learn the best-fit values of the model's parameters from past data in each region in the world, and it updates the predicted infections curves for any future restrictions that may be added or relaxed anywhere. We hope this interdisciplinary effort, a new mathematical model that predicts the impact of each measure in slowing down infection spread combined with the solving power of machine learning, is a useful tool in the fight against the current pandemic and potentially future ones.

preprint2020arXiv

Inferring Degrees from Incomplete Networks and Nonlinear Dynamics

Inferring topological characteristics of complex networks from observed data is critical to understand the dynamical behavior of networked systems, ranging from the Internet and the World Wide Web to biological networks and social networks. Prior studies usually focus on the structure-based estimation to infer network sizes, degree distributions, average degrees, and more. Little effort attempted to estimate the specific degree of each vertex from a sampled induced graph, which prevents us from measuring the lethality of nodes in protein networks and influencers in social networks. The current approaches dramatically fail for a tiny sampled induced graph and require a specific sampling method and a large sample size. These approaches neglect information of the vertex state, representing the dynamical behavior of the networked system, such as the biomass of species or expression of a gene, which is useful for degree estimation. We fill this gap by developing a framework to infer individual vertex degrees using both information of the sampled topology and vertex state. We combine the mean-field theory with combinatorial optimization to learn vertex degrees. Experimental results on real networks with a variety of dynamics demonstrate that our framework can produce reliable degree estimates and dramatically improve existing link prediction methods by replacing the sampled degrees with our estimated degrees.

preprint2020arXiv

Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics

We present a robust data-driven machine learning analysis of the COVID-19 pandemic from its early infection dynamics, specifically infection counts over time. The goal is to extract actionable public health insights. These insights include the infectious force, the rate of a mild infection becoming serious, estimates for asymtomatic infections and predictions of new infections over time. We focus on USA data starting from the first confirmed infection on January 20 2020. Our methods reveal significant asymptomatic (hidden) infection, a lag of about 10 days, and we quantitatively confirm that the infectious force is strong with about a 0.14% transition from mild to serious infection. Our methods are efficient, robust and general, being agnostic to the specific virus and applicable to different populations or cohorts.

preprint2020arXiv

True Nonlinear Dynamics from Incomplete Networks

We study nonlinear dynamics on complex networks. Each vertex $i$ has a state $x_i$ which evolves according to a networked dynamics to a steady-state $x_i^*$. We develop fundamental tools to learn the true steady-state of a small part of the network, without knowing the full network. A naive approach and the current state-of-the-art is to follow the dynamics of the observed partial network to local equilibrium. This dramatically fails to extract the true steady state. We use a mean-field approach to map the dynamics of the unseen part of the network to a single node, which allows us to recover accurate estimates of steady-state on as few as 5 observed vertices in domains ranging from ecology to social networks to gene regulation. Incomplete networks are the norm in practice, and we offer new ways to think about nonlinear dynamics when only sparse information is available.

preprint2016arXiv

Node-By-Node Greedy Deep Learning for Interpretable Features

Multilayer networks have seen a resurgence under the umbrella of deep learning. Current deep learning algorithms train the layers of the network sequentially, improving algorithmic performance as well as providing some regularization. We present a new training algorithm for deep networks which trains \emph{each node in the network} sequentially. Our algorithm is orders of magnitude faster, creates more interpretable internal representations at the node level, while not sacrificing on the ultimate out-of-sample performance.

preprint2015arXiv

Approximating Sparse PCA from Incomplete Data

We study how well one can recover sparse principal components of a data matrix using a sketch formed from a few of its elements. We show that for a wide class of optimization problems, if the sketch is close (in the spectral norm) to the original data matrix, then one can recover a near optimal solution to the optimization problem by using the sketch. In particular, we use this approach to obtain sparse principal components and show that for \math{m} data points in \math{n} dimensions, \math{O(ε^{-2}\tilde k\max\{m,n\})} elements gives an \mathε-additive approximation to the sparse PCA problem (\math{\tilde k} is the stable rank of the data matrix). We demonstrate our algorithms extensively on image, text, biological and financial data. The results show that not only are we able to recover the sparse PCAs from the incomplete data, but by using our sparse sketch, the running time drops by a factor of five or more.

preprint2015arXiv

Column Selection via Adaptive Sampling

Selecting a good column (or row) subset of massive data matrices has found many applications in data analysis and machine learning. We propose a new adaptive sampling algorithm that can be used to improve any relative-error column selection algorithm. Our algorithm delivers a tighter theoretical bound on the approximation error which we also demonstrate empirically using two well known relative-error column subset selection algorithms. Our experimental results on synthetic and real-world data show that our algorithm outperforms non-adaptive sampling as well as prior adaptive sampling approaches.

preprint2015arXiv

Extracting Hidden Groups and their Structure from Streaming Interaction Data

When actors in a social network interact, it usually means they have some general goal towards which they are collaborating. This could be a research collaboration in a company or a foursome planning a golf game. We call such groups \emph{planning groups}. In many social contexts, it might be possible to observe the \emph{dyadic interactions} between actors, even if the actors do not explicitly declare what groups they belong too. When groups are not explicitly declared, we call them \emph{hidden groups}. Our particular focus is hidden planning groups. By virtue of their need to further their goal, the actors within such groups must interact in a manner which differentiates their communications from random background communications. In such a case, one can infer (from these interactions) the composition and structure of the hidden planning groups. We formulate the problem of hidden group discovery from streaming interaction data, and we propose efficient algorithms for identifying the hidden group structures by isolating the hidden group's non-random, planning-related, communications from the random background communications. We validate our algorithms on real data (the Enron email corpus and Blog communication data). Analysis of the results reveals that our algorithms extract meaningful hidden group structures.

preprint2015arXiv

Feature Selection for Linear SVM with Provable Guarantees

We give two provably accurate feature-selection techniques for the linear SVM. The algorithms run in deterministic and randomized time respectively. Our algorithms can be used in an unsupervised or supervised setting. The supervised approach is based on sampling features from support vectors. We prove that the margin in the feature space is preserved to within $ε$-relative error of the margin in the full feature space in the worst-case. In the unsupervised setting, we also provide worst-case guarantees of the radius of the minimum enclosing ball, thereby ensuring comparable generalization as in the full feature space and resolving an open problem posed in Dasgupta et al. We present extensive experiments on real-world datasets to support our theory and to demonstrate that our method is competitive and often better than prior state-of-the-art, for which there are no known provable guarantees.

preprint2015arXiv

NP-Hardness and Inapproximability of Sparse PCA

We give a reduction from {\sc clique} to establish that sparse PCA is NP-hard. The reduction has a gap which we use to exclude an FPTAS for sparse PCA (unless P=NP). Under weaker complexity assumptions, we also exclude polynomial constant-factor approximation algorithms.

preprint2015arXiv

Optimal Sparse Linear Auto-Encoders and Sparse PCA

Principal components analysis (PCA) is the optimal linear auto-encoder of data, and it is often used to construct features. Enforcing sparsity on the principal components can promote better generalization, while improving the interpretability of the features. We study the problem of constructing optimal sparse linear auto-encoders. Two natural questions in such a setting are: i) Given a level of sparsity, what is the best approximation to PCA that can be achieved? ii) Are there low-order polynomial-time algorithms which can asymptotically achieve this optimal tradeoff between the sparsity and the approximation quality? In this work, we answer both questions by giving efficient low-order polynomial-time algorithms for constructing asymptotically \emph{optimal} linear auto-encoders (in particular, sparse features with near-PCA reconstruction error) and demonstrate the performance of our algorithms on real data.

preprint2015arXiv

Recovering PCA from Hybrid-$(\ell_1,\ell_2)$ Sparse Sampling of Data Elements

This paper addresses how well we can recover a data matrix when only given a few of its elements. We present a randomized algorithm that element-wise sparsifies the data, retaining only a few its elements. Our new algorithm independently samples the data using sampling probabilities that depend on both the squares ($\ell_2$ sampling) and absolute values ($\ell_1$ sampling) of the entries. We prove that the hybrid algorithm recovers a near-PCA reconstruction of the data from a sublinear sample-size: hybrid-($\ell_1,\ell_2$) inherits the $\ell_2$-ability to sample the important elements as well as the regularization properties of $\ell_1$ sampling, and gives strictly better performance than either $\ell_1$ or $\ell_2$ on their own. We also give a one-pass version of our algorithm and show experiments to corroborate the theory.

preprint2014arXiv

Faster SVD-Truncated Least-Squares Regression

We develop a fast algorithm for computing the "SVD-truncated" regularized solution to the least-squares problem: $ \min_{\x} \TNorm{\matA \x - \b}. $ Let $\matA_k$ of rank $k$ be the best rank $k$ matrix computed via the SVD of $\matA$. Then, the SVD-truncated regularized solution is: $ \x_k = \pinv{\matA}_k \b. $ If $\matA$ is $m \times n$, then, it takes $O(m n \min\{m,n\})$ time to compute $\x_k $ using the SVD of \math{\matA}. We give an approximation algorithm for \math{\x_k} which constructs a rank-\math{k} approximation $\tilde{\matA}_{k}$ and computes $ \tilde{\x}_{k} = \pinv{\tilde\matA}_{k} \b$ in roughly $O(\nnz(\matA) k \log n)$ time. Our algorithm uses a randomized variant of the subspace iteration. We show that, with high probability: $ \TNorm{\matA \tilde{\x}_{k} - \b} \approx \TNorm{\matA \x_k - \b}$ and $\TNorm{\x_k - \tilde\x_k} \approx 0. $

preprint2014arXiv

MilkyWay@home: Harnessing volunteer computers to constrain dark matter in the Milky Way

MilkyWay@home is a volunteer computing project that allows people from every country in the world to volunteer their otherwise idle processors to Milky Way research. Currently, more than 25,000 people (150,000 since November 9, 2007) contribute about half a PetaFLOPS of computing power to our project. We currently run two types of applications: one application fits the spatial density profile of tidal streams using statistical photometric parallax, and the other application finds the N-body simulation parameters that produce tidal streams that best match the measured density profile of known tidal streams. The stream fitting application is well developed and is producing published results. The Sagittarius dwarf leading tidal tail has been fit, and the algorithm is currently running on the trailing tidal tail and bifurcated pieces. We will soon have a self-consistent model for the density of the smooth component of the stellar halo and the largest tidal streams. The $N$-body application has been implemented for fitting dwarf galaxy progenitor properties only, and is in the testing stages. We use an Earth-Mover Distance method to measure goodness-of-fit for density of stars along the tidal stream. We will add additional spatial dimensions as well as kinematic measures in a piecemeal fashion, with the eventual goal of fitting the orbit and parameters of the Milky Way potential (and thus the density distribution of dark matter) using multiple tidal streams.

preprint2014arXiv

Random Projections for Linear Support Vector Machines

Let X be a data matrix of rank ρ, whose rows represent n points in d-dimensional space. The linear support vector machine constructs a hyperplane separator that maximizes the 1-norm soft margin. We develop a new oblivious dimension reduction technique which is precomputed and can be applied to any input matrix X. We prove that, with high probability, the margin and minimum enclosing ball in the feature space are preserved to within ε-relative error, ensuring comparable generalization as in the original space in the case of classification. For regression, we show that the margin is preserved to ε-relative error with high probability. We present extensive experiments with real and synthetic data to support our theory.

preprint2014arXiv

The Fast Cauchy Transform and Faster Robust Linear Regression

We provide fast algorithms for overconstrained $\ell_p$ regression and related problems: for an $n\times d$ input matrix $A$ and vector $b\in\mathbb{R}^n$, in $O(nd\log n)$ time we reduce the problem $\min_{x\in\mathbb{R}^d} \|Ax-b\|_p$ to the same problem with input matrix $\tilde A$ of dimension $s \times d$ and corresponding $\tilde b$ of dimension $s\times 1$. Here, $\tilde A$ and $\tilde b$ are a coreset for the problem, consisting of sampled and rescaled rows of $A$ and $b$; and $s$ is independent of $n$ and polynomial in $d$. Our results improve on the best previous algorithms when $n\gg d$, for all $p\in[1,\infty)$ except $p=2$. We also provide a suite of improved results for finding well-conditioned bases via ellipsoidal rounding, illustrating tradeoffs between running time and conditioning quality, including a one-pass conditioning algorithm for general $\ell_p$ problems. We also provide an empirical evaluation of implementations of our algorithms for $p=1$, comparing them with related algorithms. Our empirical results show that, in the asymptotic regime, the theory is a very good guide to the practical performance of these algorithms. Our algorithms use our faster constructions of well-conditioned bases for $\ell_p$ spaces and, for $p=1$, a fast subspace embedding of independent interest that we call the Fast Cauchy Transform: a distribution over matrices $Π:\mathbb{R}^n\mapsto \mathbb{R}^{O(d\log d)}$, found obliviously to $A$, that approximately preserves the $\ell_1$ norms: that is, with large probability, simultaneously for all $x$, $\|Ax\|_1 \approx \|ΠAx\|_1$, with distortion $O(d^{2+η})$, for an arbitrarily small constant $η>0$; and, moreover, $ΠA$ can be computed in $O(nd\log d)$ time. The techniques underlying our Fast Cauchy Transform include fast Johnson-Lindenstrauss transforms, low-coherence matrices, and rescaling by Cauchy random variables.

preprint2013arXiv

A note on sparse least-squares regression

We compute a \emph{sparse} solution to the classical least-squares problem $\min_x||A x -b||,$ where $A$ is an arbitrary matrix. We describe a novel algorithm for this sparse least-squares problem. The algorithm operates as follows: first, it selects columns from $A$, and then solves a least-squares problem only with the selected columns. The column selection algorithm that we use is known to perform well for the well studied column subset selection problem. The contribution of this article is to show that it gives favorable results for sparse least-squares as well. Specifically, we prove that the solution vector obtained by our algorithm is close to the solution vector obtained via what is known as the "SVD-truncated regularization approach".

preprint2013arXiv

A Spatial Characterization of the Sagittarius Dwarf Galaxy Tidal Tails

We measure the spatial density of F turnoff stars in the Sagittarius dwarf tidal stream, from Sloan Digital Sky Survey (SDSS) data, using statistical photometric parallax. We find a set of continuous, consistent parameters that describe the leading Sgr stream's position, direction, and width for 15 stripes in the North Galactic Cap, and 3 stripes in the South Galactic Cap. We produce a catalog of stars that has the density characteristics of the dominant leading Sgr tidal stream that can be compared with simulations. We find that the width of the leading (North) tidal tail is consistent with recent triaxial and axisymmetric halo model simulations. The density along the stream is roughly consistent common disruption models in the North, but possibly not in the South. We explore the possibility that one or more of the dominant Sgr streams has been mis-identified, and that one or more of the `bifurcated' pieces is the real Sgr tidal tail, but we do not reach definite conclusions. If two dwarf progenitors are assumed, fits to the planes of the dominant and `bifurcated' tidal tails favor an association of the Sgr dwarf spheroidal galaxy with the dominant Southern stream and the `bifurcated' stream in the North. In the North Galactic Cap, the best fit Hernquist density profile for the smooth component of the stellar halo is oblate, with a flattening parameter q = 0.53, and a scale length of r_0 = 6.73. The Southern data for both the tidal debris and the smooth component of the stellar halo do not match the model fits to the North, although the stellar halo is still overwhelmingly oblate. Finally, we verify that we can reproduce the parameter fits on the asynchronous Milkyway@home volunteer computing platform.

preprint2013arXiv

Deterministic Feature Selection for $k$-means Clustering

We study feature selection for $k$-means clustering. Although the literature contains many methods with good empirical performance, algorithms with provable theoretical behavior have only recently been developed. Unfortunately, these algorithms are randomized and fail with, say, a constant probability. We address this issue by presenting a deterministic feature selection algorithm for k-means with theoretical guarantees. At the heart of our algorithm lies a deterministic method for decompositions of the identity.

preprint2013arXiv

Near-Optimal Column-Based Matrix Reconstruction

We consider low-rank reconstruction of a matrix using its columns and we present asymptotically optimal algorithms for both spectral norm and Frobenius norm reconstruction. The main tools we introduce to obtain our r esults are: (i) the use of fast approximate SVD-like decompositions for column reconstruction, and (ii) two deter ministic algorithms for selecting rows from matrices with orthonormal columns, building upon the sparse represen tation theorem for decompositions of the identity that appeared in \cite{BSS09}.

preprint2013arXiv

Near-optimal Coresets For Least-Squares Regression

We study (constrained) least-squares regression as well as multiple response least-squares regression and ask the question of whether a subset of the data, a coreset, suffices to compute a good approximate solution to the regression. We give deterministic, low order polynomial-time algorithms to construct such coresets with approximation guarantees, together with lower bounds indicating that there is not much room for improvement upon our results.

preprint2013arXiv

Seeding Influential Nodes in Non-Submodular Models of Information Diffusion

We consider the model of information diffusion in social networks from \cite{Hui2010a} which incorporates trust (weighted links) between actors, and allows actors to actively participate in the spreading process, specifically through the ability to query friends for additional information. This model captures how social agents transmit and act upon information more realistically as compared to the simpler threshold and cascade models. However, it is more difficult to analyze, in particular with respect to seeding strategies. We present efficient, scalable algorithms for determining good seed sets -- initial nodes to inject with the information. Our general approach is to reduce our model to a class of simpler models for which provably good sets can be constructed. By tuning this class of simpler models, we obtain a good seed set for the original more complex model. We call this the \emph{projected greedy approach} because you `project' your model onto a class of simpler models where a greedy seed set selection is near-optimal. We demonstrate the effectiveness of our seeding strategy on synthetic graphs as well as a realistic San Diego evacuation network constructed during the 2007 fires.

preprint2012arXiv

Fast approximation of matrix coherence and statistical leverage

The statistical leverage scores of a matrix $A$ are the squared row-norms of the matrix containing its (top) left singular vectors and the coherence is the largest leverage score. These quantities are of interest in recently-popular problems such as matrix completion and Nyström-based low-rank matrix approximation as well as in large-scale statistical data analysis applications more generally; moreover, they are of interest since they define the key structural nonuniformity that must be dealt with in developing fast randomized matrix algorithms. Our main result is a randomized algorithm that takes as input an arbitrary $n \times d$ matrix $A$, with $n \gg d$, and that returns as output relative-error approximations to all $n$ of the statistical leverage scores. The proposed algorithm runs (under assumptions on the precise values of $n$ and $d$) in $O(n d \log n)$ time, as opposed to the $O(nd^2)$ time required by the naïve algorithm that involves computing an orthogonal basis for the range of $A$. Our analysis may be viewed in terms of computing a relative-error approximation to an underconstrained least-squares approximation problem, or, relatedly, it may be viewed as an application of Johnson-Lindenstrauss type ideas. Several practically-important extensions of our basic result are also described, including the approximation of so-called cross-leverage scores, the extension of these ideas to matrices with $n \approx d$, and the extension to streaming environments.

preprint2012arXiv

Near-Optimal Target Learning With Stochastic Binary Signals

We study learning in a noisy bisection model: specifically, Bayesian algorithms to learn a target value V given access only to noisy realizations of whether V is less than or greater than a threshold theta. At step t = 0, 1, 2, ..., the learner sets threshold theta t and observes a noisy realization of sign(V - theta t). After T steps, the goal is to output an estimate V^ which is within an eta-tolerance of V . This problem has been studied, predominantly in environments with a fixed error probability q < 1/2 for the noisy realization of sign(V - theta t). In practice, it is often the case that q can approach 1/2, especially as theta -> V, and there is little known when this happens. We give a pseudo-Bayesian algorithm which provably converges to V. When the true prior matches our algorithm's Gaussian prior, we show near-optimal expected performance. Our methods extend to the general multiple-threshold setting where the observation noisily indicates which of k >= 2 regions V belongs to.

preprint2012arXiv

Spreading Processes and Large Components in Ordered, Directed Random Graphs

Order the vertices of a directed random graph \math{v_1,...,v_n}; edge \math{(v_i,v_j)} for \math{i<j} exists independently with probability \math{p}. This random graph model is related to certain spreading processes on networks. We consider the component reachable from \math{v_1} and prove existence of a sharp threshold \math{p^*=\log n/n} at which this reachable component transitions from \math{o(n)} to \math{Ω(n)}.

preprint2011arXiv

A Note On Estimating the Spectral Norm of A Matrix Efficiently

We give an efficient algorithm which can obtain a relative error approximation to the spectral norm of a matrix, combining the power iteration method with some techniques from matrix reconstruction which use random sampling.

preprint2011arXiv

An Analysis of Optimal Link Bombs

We analyze the phenomenon of collusion for the purpose of boosting the pagerank of a node in an interlinked environment. We investigate the optimal attack pattern for a group of nodes (attackers) attempting to improve the ranking of a specific node (the victim). We consider attacks where the attackers can only manipulate their own outgoing links. We show that the optimal attacks in this scenario are uncoordinated, i.e. the attackers link directly to the victim and no one else. nodes do not link to each other. We also discuss optimal attack patterns for a group that wants to hide itself by not pointing directly to the victim. In these disguised attacks, the attackers link to nodes $l$ hops away from the victim. We show that an optimal disguised attack exists and how it can be computed. The optimal disguised attack also allows us to find optimal link farm configurations. A link farm can be considered a special case of our approach: the target page of the link farm is the victim and the other nodes in the link farm are the attackers for the purpose of improving the rank of the victim. The target page can however control its own outgoing links for the purpose of improving its own rank, which can be modeled as an optimal disguised attack of 1-hop on itself. Our results are unique in the literature as we show optimality not only in the pagerank score, but also in the rank based on the pagerank score. We further validate our results with experiments on a variety of random graph models.

preprint2011arXiv

Exponential Inapproximability of Selecting a Maximum Volume Sub-matrix

Given a matrix $A \in \mathbb{R}^{m \times n}$ ($n$ vectors in $m$ dimensions), and a positive integer $k < n$, we consider the problem of selecting $k$ column vectors from $A$ such that the volume of the parallelepiped they define is maximum over all possible choices. We prove that there exists $δ<1$ and $c>0$ such that this problem is not approximable within $2^{-ck}$ for $k = δn$, unless $P=NP$.

preprint2011arXiv

Pushing Your Point of View: Behavioral Measures of Manipulation in Wikipedia

As a major source for information on virtually any topic, Wikipedia serves an important role in public dissemination and consumption of knowledge. As a result, it presents tremendous potential for people to promulgate their own points of view; such efforts may be more subtle than typical vandalism. In this paper, we introduce new behavioral metrics to quantify the level of controversy associated with a particular user: a Controversy Score (C-Score) based on the amount of attention the user focuses on controversial pages, and a Clustered Controversy Score (CC-Score) that also takes into account topical clustering. We show that both these measures are useful for identifying people who try to "push" their points of view, by showing that they are good predictors of which editors get blocked. The metrics can be used to triage potential POV pushers. We apply this idea to a dataset of users who requested promotion to administrator status and easily identify some editors who significantly changed their behavior upon becoming administrators. At the same time, such behavior is not rampant. Those who are promoted to administrator status tend to have more stable behavior than comparable groups of prolific editors. This suggests that the Adminship process works well, and that the Wikipedia community is not overwhelmed by users who become administrators to promote their own points of view.

preprint2011arXiv

Using a Non-Commutative Bernstein Bound to Approximate Some Matrix Algorithms in the Spectral Norm

We focus on \emph{row sampling} based approximations for matrix algorithms, in particular matrix multipication, sparse matrix reconstruction, and \math{\ell_2} regression. For \math{\matA\in\R^{m\times d}} (\math{m} points in \math{d\ll m} dimensions), and appropriate row-sampling probabilities, which typically depend on the norms of the rows of the \math{m\times d} left singular matrix of \math{\matA} (the \emph{leverage scores}), we give row-sampling algorithms with linear (up to polylog factors) dependence on the stable rank of \math{\matA}. This result is achieved through the application of non-commutative Bernstein bounds. Keywords: row-sampling; matrix multiplication; matrix reconstruction; estimating spectral norm; linear regression; randomized

preprint2010arXiv

Comparing Prediction Market Structures, With an Application to Market Making

Ensuring sufficient liquidity is one of the key challenges for designers of prediction markets. Various market making algorithms have been proposed in the literature and deployed in practice, but there has been little effort to evaluate their benefits and disadvantages in a systematic manner. We introduce a novel experimental design for comparing market structures in live trading that ensures fair comparison between two different microstructures with the same trading population. Participants trade on outcomes related to a two-dimensional random walk that they observe on their computer screens. They can simultaneously trade in two markets, corresponding to the independent horizontal and vertical random walks. We use this experimental design to compare the popular inventory-based logarithmic market scoring rule (LMSR) market maker and a new information based Bayesian market maker (BMM). Our experiments reveal that BMM can offer significant benefits in terms of price stability and expected loss when controlling for liquidity; the caveat is that, unlike LMSR, BMM does not guarantee bounded loss. Our investigation also elucidates some general properties of market makers in prediction markets. In particular, there is an inherent tradeoff between adaptability to market shocks and convergence during market equilibrium.

preprint2010arXiv

Efficient Computation of Optimal Trading Strategies

Given the return series for a set of instruments, a \emph{trading strategy} is a switching function that transfers wealth from one instrument to another at specified times. We present efficient algorithms for constructing (ex-post) trading strategies that are optimal with respect to the total return, the Sterling ratio and the Sharpe ratio. Such ex-post optimal strategies are useful analysis tools. They can be used to analyze the "profitability of a market" in terms of optimal trading; to develop benchmarks against which real trading can be compared; and, within an inductive framework, the optimal trades can be used to to teach learning systems (predictors) which are then used to identify future trading opportunities.

preprint2010arXiv

Embedding a Forest in a Graph

For \math{p\ge 1}, we prove that every forest with \math{p} trees whose sizes are $a_1,..., a_p$ can be embedded in any graph containing at least $\sum_{i=1}^p (a_i + 1)$ vertices and having a minimum degree at least $\sum_{i=1}^p a_i$.

preprint2010arXiv

Row Sampling for Matrix Algorithms via a Non-Commutative Bernstein Bound

We focus the use of \emph{row sampling} for approximating matrix algorithms. We give applications to matrix multipication; sparse matrix reconstruction; and, \math{\ell_2} regression. For a matrix \math{\matA\in\R^{m\times d}} which represents \math{m} points in \math{d\ll m} dimensions, all of these tasks can be achieved in \math{O(md^2)} via the singular value decomposition (SVD). For appropriate row-sampling probabilities (which typically depend on the norms of the rows of the \math{m\times d} left singular matrix of \math{\matA} (the \emph{leverage scores}), we give row-sampling algorithms with linear (up to polylog factors) dependence on the stable rank of \math{\matA}. This result is achieved through the application of non-commutative Bernstein bounds. We then give, to our knowledge, the first algorithms for computing approximations to the appropriate row-sampling probabilities without going through the SVD of \math{\matA}. Thus, these are the first \math{o(md^2)} algorithms for row-sampling based approximations to the matrix algorithms which use leverage scores as the sampling probabilities. The techniques we use to approximate sampling according to the leverage scores uses some powerful recent results in the theory of random projections for embedding, and may be of some independent interest. We confess that one may perform all these matrix tasks more efficiently using these same random projection methods, however the resulting algorithms are in terms of a small number of linear combinations of all the rows. In many applications, the actual rows of \math{\matA} have some physical meaning and so methods based on a small number of the actual rows are of interest.

Malik Magdon-Ismail

What is connected

Connect this record

See the researcher in context

Building this map preview

35 published item(s)

An Algorithm for Reconstructing the Orphan Stream Progenitor with MilkyWay@home Volunteer Computing

A New Mathematical Model for Controlled Pandemics Like COVID-19 : AI Implemented Predictions

Inferring Degrees from Incomplete Networks and Nonlinear Dynamics

Machine Learning the Phenomenology of COVID-19 From Early Infection Dynamics

True Nonlinear Dynamics from Incomplete Networks

Node-By-Node Greedy Deep Learning for Interpretable Features

Approximating Sparse PCA from Incomplete Data

Column Selection via Adaptive Sampling

Extracting Hidden Groups and their Structure from Streaming Interaction Data

Feature Selection for Linear SVM with Provable Guarantees

NP-Hardness and Inapproximability of Sparse PCA

Optimal Sparse Linear Auto-Encoders and Sparse PCA

Recovering PCA from Hybrid-$(\ell_1,\ell_2)$ Sparse Sampling of Data Elements

Faster SVD-Truncated Least-Squares Regression

MilkyWay@home: Harnessing volunteer computers to constrain dark matter in the Milky Way

Random Projections for Linear Support Vector Machines

The Fast Cauchy Transform and Faster Robust Linear Regression

A note on sparse least-squares regression

A Spatial Characterization of the Sagittarius Dwarf Galaxy Tidal Tails

Deterministic Feature Selection for $k$-means Clustering

Near-Optimal Column-Based Matrix Reconstruction

Near-optimal Coresets For Least-Squares Regression

Seeding Influential Nodes in Non-Submodular Models of Information Diffusion

Fast approximation of matrix coherence and statistical leverage

Near-Optimal Target Learning With Stochastic Binary Signals

Spreading Processes and Large Components in Ordered, Directed Random Graphs

A Note On Estimating the Spectral Norm of A Matrix Efficiently

An Analysis of Optimal Link Bombs

Exponential Inapproximability of Selecting a Maximum Volume Sub-matrix

Pushing Your Point of View: Behavioral Measures of Manipulation in Wikipedia

Using a Non-Commutative Bernstein Bound to Approximate Some Matrix Algorithms in the Spectral Norm

Comparing Prediction Market Structures, With an Application to Market Making

Efficient Computation of Optimal Trading Strategies

Embedding a Forest in a Graph

Row Sampling for Matrix Algorithms via a Non-Commutative Bernstein Bound