Source author record

Harsh Mehta

Harsh Mehta appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Computation and Language Computer Vision Artificial Intelligence Cryptography and Security Distributed, Parallel, and Cluster Computing math.CA math.NT math.OC

Catalog footprint

What is connected

7works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ALX: Large Scale Matrix Factorization on TPUs

We present ALX, an open-source library for distributed matrix factorization using Alternating Least Squares, written in JAX. Our design allows for efficient use of the TPU architecture and scales well to matrix factorization problems of O(B) rows/columns by scaling the number of available TPU cores. In order to spur future research on large scale matrix factorization methods and to illustrate the scalability properties of our own implementation, we also built a real world web link prediction dataset called WebGraph. This dataset can be easily modeled as a matrix factorization problem. We created several variants of this dataset based on locality and sparsity properties of sub-graphs. The largest variant of WebGraph has around 365M nodes and training a single epoch finishes in about 20 minutes with 256 TPU cores. We include speed and performance numbers of ALX on all variants of WebGraph. Both the framework code and the dataset is open-sourced.

preprint2022arXiv

Large Scale Transfer Learning for Differentially Private Image Classification

Differential Privacy (DP) provides a formal framework for training machine learning models with individual example level privacy. In the field of deep learning, Differentially Private Stochastic Gradient Descent (DP-SGD) has emerged as a popular private training algorithm. Unfortunately, the computational cost of training large-scale models with DP-SGD is substantially higher than non-private training. This is further exacerbated by the fact that increasing the number of parameters leads to larger degradation in utility with DP. In this work, we zoom in on the ImageNet dataset and demonstrate that, similar to the non-private case, pre-training over-parameterized models on a large public dataset can lead to substantial gains when the model is finetuned privately. Moreover, by systematically comparing private and non-private models across a range of large batch sizes, we find that similar to non-private setting, choice of optimizer can further improve performance substantially with DP. By using LAMB optimizer with DP-SGD we saw improvement of up to 20$\%$ points (absolute). Finally, we show that finetuning just the last layer for a \emph{single step} in the full batch setting, combined with extremely small-scale (near-zero) initialization leads to both SOTA results of 81.7 $\%$ under a wide privacy budget range of $ε\in [4, 10]$ and $δ$ = $10^{-6}$ while minimizing the computational overhead substantially.

preprint2022arXiv

Long Range Language Modeling via Gated State Spaces

State space models have shown to be effective at modeling long range dependencies, specially on sequence classification tasks. In this work we focus on autoregressive sequence modeling over English books, Github source code and ArXiv mathematics articles. Based on recent developments around the effectiveness of gated activation functions, we propose a new layer named Gated State Space (GSS) and show that it trains significantly faster than the diagonal version of S4 (i.e. DSS) on TPUs, is fairly competitive with several well-tuned Transformer-based baselines and exhibits zero-shot generalization to longer inputs while being straightforward to implement. Finally, we show that leveraging self-attention to model local dependencies improves the performance of GSS even further.

preprint2020arXiv

Momentum Improves Normalized SGD

We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an $ε$-critical point in $O(1/ε^{3.5})$ iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks.

preprint2020arXiv

Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

The Touchdown dataset (Chen et al., 2019) provides instructions by human annotators for navigation through New York City streets and for resolving spatial descriptions at a given location. To enable the wider research community to work effectively with the Touchdown tasks, we are publicly releasing the 29k raw Street View panoramas needed for Touchdown. We follow the process used for the StreetLearn data release (Mirowski et al., 2019) to check panoramas for personally identifiable information and blur them as necessary. These have been added to the StreetLearn dataset and can be obtained via the same process as used previously for StreetLearn. We also provide a reference implementation for both of the Touchdown tasks: vision and language navigation (VLN) and spatial description resolution (SDR). We compare our model results to those given in Chen et al. (2019) and show that the panoramas we have added to StreetLearn fully support both Touchdown tasks and can be used effectively for further research and comparison.

preprint2015arXiv

Products of binomial coefficients and unreduced Farey fractions

This paper studies the product $\bar{G}_n$ of the binomial coefficients in the n-th row of Pascal's triangle, which equals the reciprocal of the product of all the reduced and unreduced Farey fractions of order n. It studies its size as a real number, measured by its logarithm $log(\bar{G}_n)$, and its prime factorization, measured by the order of divisibility by a fixed prime p, each viewed as a function of n. It derives three formulas for its prime power divisibility, $ord_p(\bar{G}_n)$, two of which relate it to base p radix expansions of n, and which display different facets of its behavior. These formulas are used to determine the maximal growth rate of each $ord_p(\bar{G}_n)$ and structure of the fluctuations of these functions. It also defines analogous functions for all integer bases $b$ replacing prime bases. A final topic relates the factorizations of $\bar{G}_n$ to Chebyshev-type prime-counting estimates and the prime number theorem.

preprint2013arXiv

The L1 norm of the generalized de la Vallee Poussin kernel

Charles de la Vall'ee Poussin defined two different kernels that bear his name. This paper considers the one are a linear combinations of two Fej'er kernels, which are known as the delayed means. We show that the $L^1$ norms are constant in families of delayed means, and determine the exact value

Harsh Mehta

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

ALX: Large Scale Matrix Factorization on TPUs

Large Scale Transfer Learning for Differentially Private Image Classification

Long Range Language Modeling via Gated State Spaces

Momentum Improves Normalized SGD

Retouchdown: Adding Touchdown to StreetLearn as a Shareable Resource for Language Grounding Tasks in Street View

Products of binomial coefficients and unreduced Farey fractions

The L1 norm of the generalized de la Vallee Poussin kernel