Source author record

Shankar Krishnan

Shankar Krishnan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computer Vision Data Structures and Algorithms Discrete Mathematics Information Theory math.IT Networking and Internet Architecture

Catalog footprint

What is connected

5works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks

Batch Normalization (BN) uses mini-batch statistics to normalize the activations during training, introducing dependence between mini-batch elements. This dependency can hurt the performance if the mini-batch size is too small, or if the elements are correlated. Several alternatives, such as Batch Renormalization and Group Normalization (GN), have been proposed to address this issue. However, they either do not match the performance of BN for large batches, or still exhibit degradation in performance for smaller batches, or introduce artificial constraints on the model architecture. In this paper we propose the Filter Response Normalization (FRN) layer, a novel combination of a normalization and an activation function, that can be used as a replacement for other normalizations and activations. Our method operates on each activation channel of each batch element independently, eliminating the dependency on other batch elements. Our method outperforms BN and other alternatives in a variety of settings for all batch sizes. FRN layer performs $\approx 0.7-1.0\%$ better than BN on top-1 validation accuracy with large mini-batch sizes for Imagenet classification using InceptionV3 and ResnetV2-50 architectures. Further, it performs $>1\%$ better than GN on the same problem in the small mini-batch size regime. For object detection problem on COCO dataset, FRN layer outperforms all other methods by at least $0.3-0.5\%$ in all batch size regimes.

preprint2020arXiv

Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale

Coherent Gradients (CGH) is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. The key insight of CGH is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. In this paper, we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the techniques presented in the original paper do not scale beyond toy models and datasets, we propose new methods. By posing the problem of suppressing weak gradient directions as a problem of robust mean estimation, we develop a coordinate-based median of means approach. We present two versions of this algorithm, M3, which partitions a mini-batch into 3 groups and computes the median, and a more efficient version RM3, which reuses gradients from previous two time steps to compute the median. Since they suppress weak gradient directions without requiring per-example gradients, they can be used to train models at scale. Experimentally, we find that they indeed greatly reduce overfitting (and memorization) and thus provide the first convincing evidence that CGH holds at scale. We also propose a new test of CGH that does not depend on adding noise to training labels or on suppressing weak gradient directions. Using the intuition behind CGH, we posit that the examples learned early in the training process (i.e., "easy" examples) are precisely those that have more in common with other training examples. Therefore, as per CGH, the easy examples should generalize better amongst themselves than the hard examples amongst themselves. We validate this hypothesis with detailed experiments, and believe that it provides further orthogonal evidence for CGH.

preprint2016arXiv

Spatio-temporal Interference Correlation and Joint Coverage in Cellular Networks

This paper provides an analytical framework with foundations in stochastic geometry to characterize the spatio-temporal interference correlation as well as the joint coverage probability at two spatial locations in a cellular network. In particular, modeling the locations of cellular base stations (BSs) as a Poisson Point Process (PPP), we study interference correlation at two spatial locations $\ell_1$ and $\ell_2$ separated by a distance $v$, when the user follows \emph{closest BS association policy} at both spatial locations and moves from $\ell_1$ to $\ell_2$. With this user displacement, two scenarios can occur: i) the user is handed off to a new serving BS at $\ell_2$, or ii) no handoff occurs and the user is served by the same BS at both locations. After providing intermediate results such as probability of handoff and distance distributions of the serving BS at the two user locations, we use them to derive exact expressions for spatio-temporal interference correlation coefficient and joint coverage probability for any distance separation $v$. We also study two different handoff strategies: i) \emph{handoff skipping}, and ii) \emph{conventional handoffs}, and derive the expressions of joint coverage probability for both strategies. The exact analysis is not straightforward and involves a careful treatment of the neighborhood of the two spatial locations and the resulting handoff scenarios. To provide analytical insights, we also provide easy-to-use expressions for two special cases: i) static user ($v =0$) and ii) highly mobile user ($v \rightarrow \infty)$. As expected, our analysis shows that the interference correlation and joint coverage probability decrease with increasing $v$, with $v \rightarrow \infty$ corresponding to a completely uncorrelated scenario.

preprint2013arXiv

COAST: A Convex Optimization Approach to Stress-Based Embedding

Visualizing graphs using virtual physical models is probably the most heavily used technique for drawing graphs in practice. There are many algorithms that are efficient and produce high-quality layouts. If one requires that the layout also respect a given set of non-uniform edge lengths, however, force-based approaches become problematic while energy-based layouts become intractable. In this paper, we propose a reformulation of the stress function into a two-part convex objective function to which we can apply semi-definite programming (SDP). We avoid the high computational cost associated with SDP by a novel, compact re-parameterization of the objective function using the eigenvectors of the graph Laplacian. This sparse representation makes our approach scalable. We provide experimental results to show that this method scales well and produces reasonable layouts while dealing with the edge length constraints.

preprint2012arXiv

Achieving Approximate Soft Clustering in Data Streams

In recent years, data streaming has gained prominence due to advances in technologies that enable many applications to generate continuous flows of data. This increases the need to develop algorithms that are able to efficiently process data streams. Additionally, real-time requirements and evolving nature of data streams make stream mining problems, including clustering, challenging research problems. In this paper, we propose a one-pass streaming soft clustering (membership of a point in a cluster is described by a distribution) algorithm which approximates the "soft" version of the k-means objective function. Soft clustering has applications in various aspects of databases and machine learning including density estimation and learning mixture models. We first achieve a simple pseudo-approximation in terms of the "hard" k-means algorithm, where the algorithm is allowed to output more than $k$ centers. We convert this batch algorithm to a streaming one (using an extension of the k-means++ algorithm recently proposed) in the "cash register" model. We also extend this algorithm when the clustering is done over a moving window in the data stream.