Source author record

Yuheng Hu

Yuheng Hu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Social and Information Networks Artificial Intelligence cs.CY Databases physics.soc-ph Human-Computer Interaction Information Retrieval Machine Learning

Catalog footprint

What is connected

6works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2015arXiv

BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this paper, we provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. We thus avoid the necessity for a domain expert or clean master data. We also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. We evaluate our methods over both synthetic and real data.

preprint2014arXiv

Analyzing User Activities, Demographics, Social Network Structure and User-Generated Content on Instagram

Instagram is a relatively new form of communication where users can instantly share their current status by taking pictures and tweaking them using filters. It has seen a rapid growth in the number of users as well as uploads since it was launched in October 2010. Inspite of the fact that it is the most popular photo sharing application, it has attracted relatively less attention from the web and social media research community. In this paper, we present a large-scale quantitative analysis on millions of users and pictures we crawled over 1 month from Instagram. Our analysis reveals several insights on Instagram which were never studied before: 1) its social network properties are quite different from other popular social media like Twitter and Flickr, 2) people typically post once a week, and 3) people like to share their locations with friends. To the best of our knowledge, this is the first in-depth analysis of user activities, demographics, social network structure and user-generated content on Instagram.

preprint2014arXiv

Whoo.ly: Facilitating Information Seeking For Hyperlocal Communities Using Social Media

Social media systems promise powerful opportunities for people to connect to timely, relevant information at the hyper local level. Yet, finding the meaningful signal in noisy social media streams can be quite daunting to users. In this paper, we present and evaluate Whoo.ly, a web service that provides neighborhood-specific information based on Twitter posts that were automatically inferred to be hyperlocal. Whoo.ly automatically extracts and summarizes hyperlocal information about events, topics, people, and places from these Twitter posts. We provide an overview of our design goals with Whoo.ly and describe the system including the user interface and our unique event detection and summarization algorithms. We tested the usefulness of the system as a tool for finding neighborhood information through a comprehensive user study. The outcome demonstrated that most participants found Whoo.ly easier to use than Twitter and they would prefer it as a tool for exploring their neighborhoods.

preprint2012arXiv

Bayesian Data Cleaning for Web Data

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns--CFDs--from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting a clean training sample is feasible in enterprise data scenarios, it is infeasible in web databases where there is no separate curated data. CFD based methods are unfortunately particularly sensitive to noise; we will empirically demonstrate that the number of CFDs learned falls quite drastically with even a small amount of noise. In order to overcome this limitation, we propose a fully probabilistic framework for cleaning data. Our approach involves learning both the generative and error (corruption) models of the data and using them to clean the data. For generative models, we learn Bayes networks from the data. For error models, we consider a maximum entropy framework for combing multiple error processes. The generative and error models are learned directly from the noisy data. We present the details of the framework and demonstrate its effectiveness in rectifying web data.

preprint2012arXiv

ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback

During broadcast events such as the Superbowl, the U.S. Presidential and Primary debates, etc., Twitter has become the de facto platform for crowds to share perspectives and commentaries about them. Given an event and an associated large-scale collection of tweets, there are two fundamental research problems that have been receiving increasing attention in recent years. One is to extract the topics covered by the event and the tweets; the other is to segment the event. So far these problems have been viewed separately and studied in isolation. In this work, we argue that these problems are in fact inter-dependent and should be addressed together. We develop a joint Bayesian model that performs topic modeling and event segmentation in one unified framework. We evaluate the proposed model both quantitatively and qualitatively on two large-scale tweet datasets associated with two events from different domains to show that it improves significantly over baseline models.

preprint2012arXiv

ET-LDA: Joint Topic Modeling For Aligning, Analyzing and Sensemaking of Public Events and Their Twitter Feeds

Social media channels such as Twitter have emerged as popular platforms for crowds to respond to public events such as speeches, sports and debates. While this promises tremendous opportunities to understand and make sense of the reception of an event from the social media, the promises come entwined with significant technical challenges. In particular, given an event and an associated large scale collection of tweets, we need approaches to effectively align tweets and the parts of the event they refer to. This in turn raises questions about how to segment the event into smaller yet meaningful parts, and how to figure out whether a tweet is a general one about the entire event or specific one aimed at a particular segment of the event. In this work, we present ET-LDA, an effective method for aligning an event and its tweets through joint statistical modeling of topical influences from the events and their associated tweets. The model enables the automatic segmentation of the events and the characterization of tweets into two categories: (1) episodic tweets that respond specifically to the content in the segments of the events, and (2) steady tweets that respond generally about the events. We present an efficient inference method for this model, and a comprehensive evaluation of its effectiveness over existing methods. In particular, through a user study, we demonstrate that users find the topics, the segments, the alignment, and the episodic tweets discovered by ET-LDA to be of higher quality and more interesting as compared to the state-of-the-art, with improvements in the range of 18-41%.

Yuheng Hu

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

BayesWipe: A Scalable Probabilistic Framework for Cleaning BigData

Analyzing User Activities, Demographics, Social Network Structure and User-Generated Content on Instagram

Whoo.ly: Facilitating Information Seeking For Hyperlocal Communities Using Social Media

Bayesian Data Cleaning for Web Data

ET-LDA: Joint Topic Modeling for Aligning Events and their Twitter Feedback

ET-LDA: Joint Topic Modeling For Aligning, Analyzing and Sensemaking of Public Events and Their Twitter Feeds