Source author record

Ashish Sureka

Ashish Sureka appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Software Engineering cs.CY Information Retrieval Social and Information Networks Artificial Intelligence Databases Digital Libraries Human-Computer Interaction Machine Learning physics.soc-ph Programming Languages

Catalog footprint

What is connected

18works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2016arXiv

A Bibliometric Study of Asia Pacific Software Engineering Conference from 2010 to 2015

The Asia-Pacific Software Engineering Conference (APSEC) is a reputed and a long-running conference which has successfully completed more than two decades as of year 2015. We conduct a bibliometric and scientific publication mining based study to how the conference has evolved over the recent past six years (year 2010 to 2015). Our objective is to perform in-depth examination of the state of APSEC so that the APSEC community can identify strengths, areas of improvements and future directions for the conference. Our empirical analysis is based on various perspectives such as: paper submission acceptance rate trends, conference location, scholarly productivity and contributions from various countries, analysis of keynotes, workshops, conference organizers and sponsors, tutorials, identification of prolific authors, computation of citation impact of papers and contributing authors, internal and external collaboration, university and industry participation and collaboration, measurement of gender imbalance, topical analysis, yearly author churn and program committee characteristics.

preprint2016arXiv

An Experimental Study on the Learning Outcome of Teaching Elementary Level Children using Lego Mindstorms EV3 Robotics Education Kit

Skills like computational thinking, problem solving, handling complexity, team-work and project management are essential for future careers and needs to be taught to students at the elementary level itself. Computer programming knowledge and skills, experiencing technology and conducting science and engineering experiments are also important for students at elementary level. However, teaching such skills effectively through active learning can be challenging for educators. In this paper, we present our approach and experiences in teaching such skills to several elementary level children using Lego Mindstorms EV3 robotics education kit. We describe our learning environment consisting of lessons, worksheets, hands-on activities and assessment. We taught students how to design, construct and program robots using components such as motors, sensors, wheels, axles, beams, connectors and gears. Students also gained knowledge on basic programming constructs such as control flow, loops, branches and conditions using a visual programming environment. We carefully observed how students performed various tasks and solved problems. We present experimental results which demonstrates that our teaching methodology consisting of both the course content and pedagogy was effective in imparting the desired skills and knowledge to elementary level children. The students also participated in a competitive World Robot Olympiad India event and qualified during the regional round which is an evidence of the effectiveness of the approach.

preprint2016arXiv

Application of Case-Based Teaching and Learning in Compiler Design Course

Compiler design is a course that discusses ideas used in construction of programming language compilers. Students learn how a program written in high level programming language and designed for humans understanding is systematically converted into low level assembly language understood by machines. We propose and implement a Case-based and Project-based Learning environment for teaching important Compiler design concepts (CPLC) to B.Tech third year students of a Delhi University (India) college. A case is a text that describes a real-life situation providing information but not solution. Previous research shows that case-based teaching helps students to apply the principles discussed in the class for solving complex practical problems. We divide one main project into sub-projects to give to students in order to enhance their practical experience of designing a compiler. To measure the effectiveness of case-based discussions, students complete a survey on their perceptions of benefits of case-based learning. The survey is analyzed using frequency distribution and chi square test of association. The results of the survey show that case-based teaching of compiler concepts does enhance students skills of learning, critical thinking, engagement, communication skills and team work.

preprint2016arXiv

Graph or Relational Databases: A Speed Comparison for Process Mining Algorithm

Process-Aware Information System (PAIS) are IT systems that manages, supports business processes and generate large event logs from execution of business processes. An event log is represented as a tuple of the form CaseID, TimeStamp, Activity and Actor. Process Mining is an emerging area of research that deals with the study and analysis of business processes based on event logs. Process Mining aims at analyzing event logs and discover business process models, enhance them or check for conformance with an a priori model. The large volume of event logs generated are stored in databases. Relational databases perform well for certain class of applications. However, there are certain class of applications for which relational databases are not able to scale. A number of NoSQL databases have emerged to encounter the challenges of scalability. Discovering social network from event logs is one of the most challenging and important Process Mining task. Similar-Task and Sub-Contract algorithms are some of the most widely used Organizational Mining techniques. Our objective is to investigate which of the databases (Relational or Graph) perform better for Organizational Mining under Process Mining. An intersection of Process Mining and Graph Databases can be accomplished by modelling these Organizational Mining metrics with graph databases. We implement Similar-Task and Sub-Contract algorithms on relational and NoSQL (graph-oriented) databases using only query language constructs. We conduct empirical analysis on a large real world data set to compare the performance of row-oriented database and NoSQL graph-oriented database. We benchmark performance factors like query execution time, CPU usage and disk/memory space usage for NoSQL graph-oriented database against row-oriented database.

preprint2016arXiv

Parichayana: An Eclipse Plugin for Detecting Exception Handling Anti-Patterns and Code Smells in Java Programs

Anti-patterns and code-smells are signs in the source code which are not defects (does not prevent the program from functioning and does not cause compile errors) and are rather indicators of deeper and bigger problems. Exception handling is a programming construct de- signed to handle the occurrence of anomalous or exceptional conditions (that changes the normal flow of program execution). In this paper, we present an Eclipse plug-in (called as Parichayana) for detecting exception handling anti-patterns and code smells in Java programs. Parichayana is capable of automatically detecting several commonly occurring excep- tion handling programming mistakes. We extend the Eclipse IDE and create new menu entries and associated action via the Parichayana plug- in (free and open-source hosted on GitHub). We compare and contrast Parichayana with several code smell detection tools and demonstrate that our tool provides unique capabilities in context to existing tools. We have created an update site and developers can use the Eclipse up- date manager to install Parichayana from our site. We used Parichyana on several large open-source Java based projects and detected presence of exception handling anti-patterns

preprint2016arXiv

Spider and the Flies : Focused Crawling on Tumblr to Detect Hate Promoting Communities

Tumblr is one of the largest and most popular microblogging website on the Internet. Studies shows that due to high reachability among viewers, low publication barriers and social networking connectivity, microblogging websites are being misused as a platform to post hateful speech and recruiting new members by existing extremist groups. Manual identification of such posts and communities is overwhelmingly impractical due to large amount of posts and blogs being published every day. We propose a topic based web crawler primarily consisting of multiple phases: training a text classifier model consisting examples of only hate promoting users, extracting posts of an unknown tumblr micro-blogger, classifying hate promoting bloggers based on their activity feeds, crawling through the external links to other bloggers and performing a social network analysis on connected extremist bloggers. To investigate the effectiveness of our approach, we conduct experiments on large real world dataset. Experimental results reveals that the proposed approach is an effective method and has an F-score of 0.80. We apply social network analysis based techniques and identify influential and core bloggers in a community.

preprint2016arXiv

Thirteen Years of Mining Software Repositories (MSR) Conference - What is the Bibliography Data Telling Us?

The Mining Software Repositories (MSR) conference is a reputed, long-running and flagship conference in the area of Software Analytics which has successfully completed more than one decade as of year 2016. We conduct a bibliometric and scientific publication mining based study to study how the conference has evolved over the recent past 13 years (from 2004 to 2007 as a workshop and then from 2008 to 2016 as a conference). Our objective is to perform an examination of the state of MSR so that the MSR community can identify strengths, areas of improvements and future directions for the conference.

preprint2015arXiv

Anvaya: An Algorithm and Case-Study on Improving the Goodness of Software Process Models generated by Mining Event-Log Data in Issue Tracking System

Issue Tracking Systems (ITS) such as Bugzilla can be viewed as Process Aware Information Systems (PAIS) generating event-logs during the life-cycle of a bug report. Process Mining consists of mining event logs generated from PAIS for process model discovery, conformance and enhancement. We apply process map discovery techniques to mine event trace data generated from ITS of open source Firefox browser project to generate and study process models. Bug life-cycle consists of diversity and variance. Therefore, the process models generated from the event-logs are spaghetti-like with large number of edges, inter-connections and nodes. Such models are complex to analyse and difficult to comprehend by a process analyst. We improve the Goodness (fitness and structural complexity) of the process models by splitting the event-log into homogeneous subsets by clustering structurally similar traces. We adapt the K-Medoid clustering algorithm with two different distance metrics: Longest Common Subsequence (LCS) and Dynamic Time Warping (DTW). We evaluate the goodness of the process models generated from the clusters using complexity and fitness metrics. We study back-forth \& self-loops, bug reopening, and bottleneck in the clusters obtained and show that clustering enables better analysis. We also propose an algorithm to automate the clustering process -the algorithm takes as input the event log and returns the best cluster set.

preprint2015arXiv

Applying Social Media Intelligence for Predicting and Identifying On-line Radicalization and Civil Unrest Oriented Threats

Research shows that various social media platforms on Internet such as Twitter, Tumblr (micro-blogging websites), Facebook (a popular social networking website), YouTube (largest video sharing and hosting website), Blogs and discussion forums are being misused by extremist groups for spreading their beliefs and ideologies, promoting radicalization, recruiting members and creating online virtual communities sharing a common agenda. Popular microblogging websites such as Twitter are being used as a real-time platform for information sharing and communication during planning and mobilization if civil unrest related events. Applying social media intelligence for predicting and identifying online radicalization and civil unrest oriented threats is an area that has attracted several researchers' attention over past 10 years. There are several algorithms, techniques and tools that have been proposed in existing literature to counter and combat cyber-extremism and predicting protest related events in much advance. In this paper, we conduct a literature review of all these existing techniques and do a comprehensive analysis to understand state-of-the-art, trends and research gaps. We present a one class classification approach to collect scholarly articles targeting the topics and subtopics of our research scope. We perform characterization, classification and an in-depth meta analysis meta-anlaysis of about 100 conference and journal papers to gain a better understanding of existing literature.

preprint2015arXiv

Intention-Oriented Process Model Discovery from Incident Management Event Logs

Intention-oriented process mining is based on the belief that the fundamental nature of processes is mostly intentional (unlike activity-oriented process) and aims at discovering strategy and intentional process models from event-logs recorded during the process enactment. In this paper, we present an application of intention-oriented process mining for the domain of incident management of an Information Technology Infrastructure Library (ITIL) process. We apply the Map Miner Method (MMM) on a large real-world dataset for discovering hidden and unobservable user behavior, strategies and intentions. We first discover user strategies from the given activity sequence data by applying Hidden Markov Model (HMM) based unsupervised learning technique. We then process the emission and transition matrices of the discovered HMM to generate a coarse-grained Map Process Model. We present the first application or study of the new and emerging field of Intention-oriented process mining on an incident management event-log dataset and discuss its applicability, effectiveness and challenges.

preprint2015arXiv

Kernel Based Sequential Data Anomaly Detection in Business Process Event Logs

Business Process Management Systems (BPMS) log events and traces of activities during the execution of a process. Anomalies are defined as deviation or departure from the normal or common order. Anomaly detection in business process logs has several applications such as fraud detection and understanding the causes of process errors. In this paper, we present a novel approach for anomaly detection in business process logs. We model the event logs as a sequential data and apply kernel based anomaly detection techniques to identify outliers and discordant observations. Our technique is unsupervised (does not require a pre-annotated training dataset), employs kNN (k-nearest neighbor) kernel based technique and normalized longest common subsequence (LCS) similarity measure. We conduct experiments on a recent, large and real-world incident management data of an enterprise and demonstrate that our approach is effective.

preprint2015arXiv

Survey Results on Threats To External Validity, Generalizability Concerns, Data Sharing and University-Industry Collaboration in Mining Software Repository (MSR) Research

Mining Software Repositories (MSR) is an applied and practise-oriented field aimed at solving real problems encountered by practitioners and bringing value to Industry. Replication of results and findings, generalizability and external validity, University-Industry collaboration, data sharing and creation dataset repositories are important issues in MSR research. Research consisting of bibliometric analysis of MSR paper shows lack of University-Industry collaboration, deficiency of studies on closed or propriety source dataset and lack of data as well as tool sharing by researchers. We conduct a survey of authors of past three years of MSR conference (2012, 2013 and 2014) to collect data on their views and suggestions to address the stated concerns. We asked 20 questions from more than 100 authors and received a response from 39 authors. Our results shows that about one-third of the respondents always make their dataset publicly available and about one-third believe that data sharing should be a mandatory condition for publication in MSR conferences. Our survey reveals that more than 50% authors used solely open-source software (OSS) dataset for their research. More than 50% of the respondents mentioned that difficulty in sharing Industrial dataset outside the company is one of the major impediments in University-Industry collaboration.

preprint2014arXiv

Chaff from the Wheat : Characterization and Modeling of Deleted Questions on Stack Overflow

Stack Overflow is the most popular CQA for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted from Stack Overflow at the discretion of experienced community members and moderators. We present the first study of deleted questions on Stack Overflow. We divide our study into two parts (i) Characterization of deleted questions over approx. 5 years (2008-2013) of data, (ii) Prediction of deletion at the time of question creation. Our characterization study reveals multiple insights on question deletion phenomena. We observe a significant increase in the number of deleted questions over time. We find that it takes substantial time to vote a question to be deleted but once voted, the community takes swift action. We also see that question authors delete their questions to salvage reputation points. We notice some instances of accidental deletion of good quality questions but such questions are voted back to be undeleted quickly. We discover a pyramidal structure of question quality on Stack Overflow and find that deleted questions lie at the bottom (lowest quality) of the pyramid. We also build a predictive model to detect the deletion of question at the creation time. We experiment with 47 features based on User Profile, Community Generated, Question Content and Syntactic style and report an accuracy of 66%. Our feature analysis reveals that all four categories of features are important for the prediction task. Our findings reveal important suggestions for content quality maintenance on community based question answering websites.

preprint2013arXiv

A Case-Study on Teaching Undergraduate-Level Software Engineering Course Using Inverted-Classroom, Large-Group, Real-Client and Studio-Based Instruction Model

We present a case-study on teaching an undergraduate level course on Software Engineering (second year and fifth semester of bachelors program in Computer Science) at a State University (New Delhi, India) using a novel teaching instruction model. Our approach has four main elements: inverted or flipped classroom, studio-based learning, real-client projects and deployment, large team and peer evaluation. We present our motivation and approach, challenges encountered, pedagogical benefits, findings (both positive and negative) and recommendations. Our motivation was to teach Software Engineering using an active learning (significantly increasing the engagement and collaboration with the Instructor and other students in the class), team-work, balance between theory and practice, imparting both technical and managerial skills encountered in real-world and problem-based learning (through an intensive semester-long project). We conduct a detailed survey (anonymous, optional and online) and present the results of student responses. Survey results reveal that for nearly every students (class size: 89) the instruction model was new, interesting and had a positive impact on the motivation in addition to meeting the learning outcome of the course.

preprint2013arXiv

Fit or Unfit : Analysis and Prediction of 'Closed Questions' on Stack Overflow

Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as 'closed' by experienced users and community moderators. A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of 'closed' questions in Stack Overflow. We download 4 years of publicly available data which contains 3.4 Million questions. We first analyze and characterize the complete set of 0.1 Million 'closed' questions. Next, we use a machine learning framework and build a predictive model to identify a 'closed' question at the time of question creation. One of our key findings is that despite being marked as 'closed', subjective questions contain high information value and are very popular with the users. We observe an increasing trend in the percentage of closed questions over time and find that this increase is positively correlated to the number of newly registered users. In addition, we also see a decrease in community participation to mark a 'closed' question which has led to an increase in moderation job time. We also find that questions closed with the Duplicate and Off Topic labels are relatively more prone to reputation gaming. For the 'closed' question prediction task, we make use of multiple genres of feature sets based on - user profile, community process, textual style and question content. We use a state-of-art machine learning classifier based on an ensemble learning technique and achieve an overall accuracy of 73%. To the best of our knowledge, this is the first experimental study to analyze and predict 'closed' questions on Stack Overflow.

preprint2013arXiv

Solutions to Detect and Analyze Online Radicalization : A Survey

Online Radicalization (also called Cyber-Terrorism or Extremism or Cyber-Racism or Cyber- Hate) is widespread and has become a major and growing concern to the society, governments and law enforcement agencies around the world. Research shows that various platforms on the Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users and a potential of a very quick and widespread diffusion of message) such as YouTube (a popular video sharing website), Twitter (an online micro-blogging service), Facebook (a popular social networking website), online discussion forums and blogosphere are being misused for malicious intent. Such platforms are being used to form hate groups, racist communities, spread extremist agenda, incite anger or violence, promote radicalization, recruit members and create virtual organi- zations and communities. Automatic detection of online radicalization is a technically challenging problem because of the vast amount of the data, unstructured and noisy user-generated content, dynamically changing content and adversary behavior. There are several solutions proposed in the literature aiming to combat and counter cyber-hate and cyber-extremism. In this survey, we review solutions to detect and analyze online radicalization. We review 40 papers published at 12 venues from June 2003 to November 2011. We present a novel classification scheme to classify these papers. We analyze these techniques, perform trend analysis, discuss limitations of existing techniques and find out research gaps.

preprint2012arXiv

Characterizing Pedophile Conversations on the Internet using Online Grooming

Cyber-crime targeting children such as online pedophile activity are a major and a growing concern to society. A deep understanding of predatory chat conversations on the Internet has implications in designing effective solutions to automatically identify malicious conversations from regular conversations. We believe that a deeper understanding of the pedophile conversation can result in more sophisticated and robust surveillance systems than majority of the current systems relying only on shallow processing such as simple word-counting or key-word spotting. In this paper, we study pedophile conversations from the perspective of online grooming theory and perform a series of linguistic-based empirical analysis on several pedophile chat conversations to gain useful insights and patterns. We manually annotated 75 pedophile chat conversations with six stages of online grooming and test several hypothesis on it. The results of our experiments reveal that relationship forming is the most dominant online grooming stage in contrast to the sexual stage. We use a widely used word-counting program (LIWC) to create psycho-linguistic profiles for each of the six online grooming stages to discover interesting textual patterns useful to improve our understanding of the online pedophile phenomenon. Furthermore, we present empirical results that throw light on various aspects of a pedophile conversation such as probability of state transitions from one stage to another, distribution of a pedophile chat conversation across various online grooming stages and correlations between pre-defined word categories and online grooming stages.

preprint2011arXiv

Mining User Comment Activity for Detecting Forum Spammers in YouTube

Research shows that comment spamming (comments which are unsolicited, unrelated, abusive, hateful, commercial advertisements etc) in online discussion forums has become a common phenomenon in Web 2.0 applications and there is a strong need to counter or combat comment spamming. We present a method to automatically detect comment spammer in YouTube (largest and a popular video sharing website) forums. The proposed technique is based on mining comment activity log of a user and extracting patterns (such as time interval between subsequent comments, presence of exactly same comment across multiple unrelated videos) indicating spam behavior. We perform empirical analysis on data crawled from YouTube and demonstrate that the proposed method is effective for the task of comment spammer detection.

Ashish Sureka

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

A Bibliometric Study of Asia Pacific Software Engineering Conference from 2010 to 2015

An Experimental Study on the Learning Outcome of Teaching Elementary Level Children using Lego Mindstorms EV3 Robotics Education Kit

Application of Case-Based Teaching and Learning in Compiler Design Course

Graph or Relational Databases: A Speed Comparison for Process Mining Algorithm

Parichayana: An Eclipse Plugin for Detecting Exception Handling Anti-Patterns and Code Smells in Java Programs

Spider and the Flies : Focused Crawling on Tumblr to Detect Hate Promoting Communities

Thirteen Years of Mining Software Repositories (MSR) Conference - What is the Bibliography Data Telling Us?

Anvaya: An Algorithm and Case-Study on Improving the Goodness of Software Process Models generated by Mining Event-Log Data in Issue Tracking System

Applying Social Media Intelligence for Predicting and Identifying On-line Radicalization and Civil Unrest Oriented Threats

Intention-Oriented Process Model Discovery from Incident Management Event Logs

Kernel Based Sequential Data Anomaly Detection in Business Process Event Logs

Survey Results on Threats To External Validity, Generalizability Concerns, Data Sharing and University-Industry Collaboration in Mining Software Repository (MSR) Research

Chaff from the Wheat : Characterization and Modeling of Deleted Questions on Stack Overflow

A Case-Study on Teaching Undergraduate-Level Software Engineering Course Using Inverted-Classroom, Large-Group, Real-Client and Studio-Based Instruction Model

Fit or Unfit : Analysis and Prediction of 'Closed Questions' on Stack Overflow

Solutions to Detect and Analyze Online Radicalization : A Survey

Characterizing Pedophile Conversations on the Internet using Online Grooming

Mining User Comment Activity for Detecting Forum Spammers in YouTube