Source author record

Michael Rubinstein

Michael Rubinstein appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning math.NT Artificial Intelligence Biological Physics cond-mat.soft Information Retrieval Multimedia physics.flu-dyn

Catalog footprint

What is connected

7works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Muse: Text-To-Image Generation via Masked Generative Transformers

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io

preprint2022arXiv

LASSIE: Learning Articulated Shapes from Sparse Image Ensemble via 3D Part Discovery

Creating high-quality articulated 3D models of animals is challenging either via manual creation or using 3D scanning tools. Therefore, techniques to reconstruct articulated 3D objects from 2D images are crucial and highly useful. In this work, we propose a practical problem setting to estimate 3D pose and shape of animals given only a few (10-30) in-the-wild images of a particular animal species (say, horse). Contrary to existing works that rely on pre-defined template shapes, we do not assume any form of 2D or 3D ground-truth annotations, nor do we leverage any multi-view or temporal information. Moreover, each input image ensemble can contain animal instances with varying poses, backgrounds, illuminations, and textures. Our key insight is that 3D parts have much simpler shape compared to the overall animal and that they are robust w.r.t. animal pose articulations. Following these insights, we propose LASSIE, a novel optimization framework which discovers 3D parts in a self-supervised manner with minimal user intervention. A key driving force behind LASSIE is the enforcing of 2D-3D part consistency using self-supervisory deep features. Experiments on Pascal-Part and self-collected in-the-wild animal datasets demonstrate considerably better 3D reconstructions as well as both 2D and 3D part discovery compared to prior arts. Project page: chhankyao.github.io/lassie/

preprint2022arXiv

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview and massively reduces search time. In this work, we focus on summarizing instructional videos, an under-explored area of video summarization. In comparison to generic videos, instructional videos can be parsed into semantically meaningful segments that correspond to important steps of the demonstrated task. Existing video summarization datasets rely on manual frame-level annotations, making them subjective and limited in size. To overcome this, we first automatically generate pseudo summaries for a corpus of instructional videos by exploiting two key assumptions: (i) relevant steps are likely to appear in multiple videos of the same task (Task Relevance), and (ii) they are more likely to be described by the demonstrator verbally (Cross-Modal Saliency). We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer. Using pseudo summaries as weak supervision, our network constructs a visual summary for an instructional video given only video and transcribed speech. To evaluate our model, we collect a high-quality test set, WikiHow Summaries, by scraping WikiHow articles that contain video demonstrations and visual depictions of steps allowing us to obtain the ground-truth summaries. We outperform several baselines and a state-of-the-art video summarization model on this new benchmark.

preprint2020arXiv

SpeedNet: Learning the Speediness in Videos

We wish to automatically predict the "speediness" of moving objects in videos---whether they move faster, at, or slower than their "natural" speed. The core component in our approach is SpeedNet---a novel deep network trained to detect if a video is playing at normal rate, or if it is sped up. SpeedNet is trained on a large corpus of natural videos in a self-supervised manner, without requiring any manual annotations. We show how this single, binary classification network can be used to detect arbitrary rates of speediness of objects. We demonstrate prediction results by SpeedNet on a wide range of videos containing complex natural motions, and examine the visual cues it utilizes for making those predictions. Importantly, we show that through predicting the speed of videos, the model learns a powerful and meaningful space-time representation that goes beyond simple motion cues. We demonstrate how those learned features can boost the performance of self-supervised action recognition, and can be used for video retrieval. Furthermore, we also apply SpeedNet for generating time-varying, adaptive video speedups, which can allow viewers to watch videos faster, but with less of the jittery, unnatural motions typical to videos that are sped up uniformly.

preprint2019arXiv

Topological Linking Drives Anomalous Thickening of Ring Polymers In Weak Extensional Flows

Molecular dynamics simulations confirm recent extensional flow experiments showing ring polymer melts exhibit strong extension-rate thickening of the viscosity at Weissenberg numbers $Wi<<1$. Thickening coincides with the extreme elongation of a minority population of rings that grows with $Wi$. The large susceptibility of some rings to extend is due to a flow-driven formation of topological links that connect multiple rings into supramolecular chains. Links form spontaneously with a longer delay at lower $Wi$ and are pulled tight and stabilized by the flow. Once linked, these composite objects experience larger drag forces than individual rings, driving their strong elongation. The fraction of linked rings generated by flow depends non-monotonically on $Wi$, increasing to a maximum when $Wi\sim1$ before rapidly decreasing when the strain rate approaches the relaxation rate of the smallest ring loops $\sim 1/τ_e$.

preprint2014arXiv

The highest lowest zero of general L-functions

Stephen D. Miller showed that, assuming the generalized Riemann Hypothesis, every entire $L$-function of real archimedian type has a zero in the interval $\frac12+i t$ with $-t_0 < t < t_0$, where $t_0\approx 14.13$ corresponds to the first zero of the Riemann zeta function. We give an example of a self-dual degree-4 $L$-function whose first positive imaginary zero is at $t_1\approx 14.496$. In particular, Miller's result does not hold for general $L$-functions. We show that all $L$-functions satisfying some additional (conjecturally true) conditions have a zero in the interval $(-t_2,t_2)$ with $t_2\approx 22.661$.

preprint2012arXiv

The distribution of solutions to xy = N mod a with an application to factoring integers

We consider the uniform distribution of solutions $(x,y)$ to $xy=N \mod a$, and obtain a bound on the second moment of the number of solutions in squares of length approximately $a^{1/2}$. We use this to study a new factoring algorithm that factors $N=UV$ provably in $O(N^{1/3+ε})$ time, and discuss the potential for improving the runtime to sub-exponential.