Source author record

P. Nagabhushan

P. Nagabhushan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Information Retrieval Machine Learning

Catalog footprint

What is connected

13works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents

Page segmentation is considered to be the crucial stage for the automatic analysis of documents with complex layouts. This has traditionally been carried out in uncompressed documents, although most of the documents in real life exist in a compressed form warranted by the requirement to make storage and transfer efficient. However, carrying out page segmentation directly in compressed documents without going through the stage of decompression is a challenging goal. This research paper proposes demonstrating the possibility of carrying out a page segmentation operation directly in the run-length data of the CCITT Group-3 compressed text document, which could be single- or multi-columned and might even have some text regions in the inverted text color mode. Therefore, before carrying out the segmentation of the text document into columns, each column into paragraphs, each paragraph into text lines, each line into words, and, finally, each word into characters, a pre-processing of the text document needs to be carried out. The pre-processing stage identifies the normal text regions and inverted text regions, and the inverted text regions are toggled to the normal mode. In the sequel to initiate column separation, a new strategy of incremental assimilation of white space runs in the vertical direction and the auto-estimation of certain related parameters is proposed. A procedure to realize column-segmentation employing these extracted parameters has been devised. Subsequently, what follows first is a two-level horizontal row separation process, which segments every column into paragraphs, and in turn, into text-lines. Then, there is a two-level vertical column separation process, which completes the separation into words and characters.

preprint2020arXiv

Target specific mining of COVID-19 scholarly articles using one-class approach

In recent years, several research articles have been published in the field of corona-virus caused diseases like severe acute respiratory syndrome (SARS), middle east respiratory syndrome (MERS) and COVID-19. In the presence of numerous research articles, extracting best-suited articles is time-consuming and manually impractical. The objective of this paper is to extract the activity and trends of corona-virus related research articles using machine learning approaches. The COVID-19 open research dataset (CORD-19) is used for experiments, whereas several target-tasks along with explanations are defined for classification, based on domain knowledge. Clustering techniques are used to create the different clusters of available articles, and later the task assignment is performed using parallel one-class support vector machines (OCSVMs). Experiments with original and reduced features validate the performance of the approach. It is evident that the k-means clustering algorithm, followed by parallel OCSVMs, outperforms other methods for both original and reduced feature space.

preprint2014arXiv

Automatic Detection of Font Size Straight from Run Length Compressed Text Documents

Automatic detection of font size finds many applications in the area of intelligent OCRing and document image analysis, which has been traditionally practiced over uncompressed documents, although in real life the documents exist in compressed form for efficient storage and transmission. It would be novel and intelligent if the task of font size detection could be carried out directly from the compressed data of these documents without decompressing, which would result in saving of considerable amount of processing time and space. Therefore, in this paper we present a novel idea of learning and detecting font size directly from run-length compressed text documents at line level using simple line height features, which paves the way for intelligent OCRing and document analysis directly from compressed documents. In the proposed model, the given mixed-case text documents of different font size are segmented into compressed text lines and the features extracted such as line height and ascender height are used to capture the pattern of font size in the form of a regression line, using which the automatic detection of font size is done during the recognition stage. The method is experimented with a dataset of 50 compressed documents consisting of 780 text lines of single font size and 375 text lines of mixed font size resulting in an overall accuracy of 99.67%.

preprint2014arXiv

Automatic Removal of Marginal Annotations in Printed Text Document

Recovering the original printed texts from a document with added handwritten annotations in the marginal area is one of the challenging problems, especially when the original document is not available. Therefore, this paper aims at salvaging automatically the original document from the annotated document by detecting and removing any handwritten annotations that appear in the marginal area of the document without any loss of information. Here a two stage algorithm is proposed, where in the first stage due to approximate marginal boundary detection with horizontal and vertical projection profiles, all of the marginal annotations along with some part of the original printed text that may appear very close to the marginal boundary are removed. Therefore as a second stage, using the connected components, a strategy is applied to bring back the printed text components cropped during the first stage. The proposed method is validated using a dataset of 50 documents having complex handwritten annotations, which gives an overall accuracy of 89.01% in removing the marginal annotations and 97.74% in case of retrieving the original printed text document.

preprint2014arXiv

Direct Processing of Document Images in Compressed Domain

With the rapid increase in the volume of Big data of this digital era, fax documents, invoices, receipts, etc are traditionally subjected to compression for the efficiency of data storage and transfer. However, in order to process these documents, they need to undergo the stage of decompression which indents additional computing resources. This limitation induces the motivation to research on the possibility of directly processing of compressed images. In this research paper, we summarize the research work carried out to perform different operations straight from run-length compressed documents without going through the stage of decompression. The different operations demonstrated are feature extraction; text-line, word and character segmentation; document block segmentation; and font size detection, all carried out in the compressed version of the document. Feature extraction methods demonstrate how to extract the conventionally defined features such as projection profile, run-histogram and entropy, directly from the compressed document data. Document segmentation involves the extraction of compressed segments of text-lines, words and characters using the vertical and horizontal projection profile features. Further an attempt is made to segment randomly a block of interest from the compressed document and subsequently facilitate absolute and relative characterization of the segmented block which finds real time applications in automatic processing of Bank Cheques, Challans, etc, in compressed domain. Finally an application to detect font size at text line level is also investigated. All the proposed algorithms are validated experimentally with sufficient data set of compressed documents.

preprint2014arXiv

Direct Processing of Run Length Compressed Document Image for Segmentation and Characterization of a Specified Block

Extracting a block of interest referred to as segmenting a specified block in an image and studying its characteristics is of general research interest, and could be a challenging if such a segmentation task has to be carried out directly in a compressed image. This is the objective of the present research work. The proposal is to evolve a method which would segment and extract a specified block, and carry out its characterization without decompressing a compressed image, for two major reasons that most of the image archives contain images in compressed format and decompressing an image indents additional computing time and space. Specifically in this research work, the proposal is to work on run-length compressed document images.

preprint2014arXiv

Entropy Computation of Document Images in Run-Length Compressed Domain

Compression of documents, images, audios and videos have been traditionally practiced to increase the efficiency of data storage and transfer. However, in order to process or carry out any analytical computations, decompression has become an unavoidable pre-requisite. In this research work, we have attempted to compute the entropy, which is an important document analytic directly from the compressed documents. We use Conventional Entropy Quantifier (CEQ) and Spatial Entropy Quantifiers (SEQ) for entropy computations [1]. The entropies obtained are useful in applications like establishing equivalence, word spotting and document retrieval. Experiments have been performed with all the data sets of [1], at character, word and line levels taking compressed documents in run-length compressed domain. The algorithms developed are computational and space efficient, and results obtained match 100% with the results reported in [1].

preprint2014arXiv

Extraction of Line Word Character Segments Directly from Run Length Compressed Printed Text Documents

Segmentation of a text-document into lines, words and characters, which is considered to be the crucial pre-processing stage in Optical Character Recognition (OCR) is traditionally carried out on uncompressed documents, although most of the documents in real life are available in compressed form, for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation has motivated us to take up research in document image analysis using compressed documents. In this paper, we think in a new way to carry out segmentation at line, word and character level in run-length compressed printed-text-documents. We extract the horizontal projection profile curve from the compressed file and using the local minima points perform line segmentation. However, tracing vertical information which leads to tracking words-characters in a run-length compressed file is not very straight forward. Therefore, we propose a novel technique for carrying out simultaneous word and character segmentation by popping out column runs from each row in an intelligent sequence. The proposed algorithms have been validated with 1101 text-lines, 1409 words and 7582 characters from a data-set of 35 noise and skew free compressed documents of Bengali, Kannada and English Scripts.

preprint2014arXiv

Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents

Document Image Analysis, like any Digital Image Analysis requires identification and extraction of proper features, which are generally extracted from uncompressed images, though in reality images are made available in compressed form for the reasons such as transmission and storage efficiency. However, this implies that the compressed image should be decompressed, which indents additional computing resources. This limitation induces the motivation to research in extracting features directly from the compressed image. In this research, we propose to extract essential features such as projection profile, run-histogram and entropy for text document analysis directly from run-length compressed text-documents. The experimentation illustrates that features are extracted directly from the compressed image without going through the stage of decompression, because of which the computing time is reduced. The feature values so extracted are exactly identical to those extracted from uncompressed images.

preprint2014arXiv

Texture Defect Detection in Gradient Space

In this paper, we propose a machine vision algorithm for automatically detecting defects in patterned textures with the help of gradient space and its energy. Experiments on real fabric images with defects show that the proposed method can be used for automatic detection of fabric defects in textile industries.

preprint2013arXiv

Periodicity Extraction using Superposition of Distance Matching Function and One-dimensional Haar Wavelet Transform

Periodicity of a texture is one of the important visual characteristics and is often used as a measure for textural discrimination at the structural level. Knowledge about periodicity of a texture is very essential in the field of texture synthesis and texture compression and also in the design of frieze and wall papers. In this paper, we propose a method of periodicity extraction from noisy images based on superposition of distance matching function (DMF) and wavelet decomposition without de-noising the test images. Overall DMFs are subjected to single-level Haar wavelet decomposition to obtain approximate and detailed coefficients. Extracted coefficients help in determination of periodicities in row and column directions. We illustrate the usefulness and the effectiveness of the proposed method in a texture synthesis application.

preprint2012arXiv

Automatic Detection of Texture Defects Using Texture-Periodicity and Gabor Wavelets

In this paper, we propose a machine vision algorithm for automatically detecting defects in textures belonging to 16 out of 17 wallpaper groups using texture-periodicity and a family of Gabor wavelets. Input defective images are subjected to Gabor wavelet transformation in multi-scales and multi-orientations and a resultant image is obtained in L2 norm. The resultant image is split into several periodic blocks and energy of each block is used as a feature space to automatically identify defective and defect-free blocks using Ward's hierarchical clustering. Experiments on defective fabric images of three major wallpaper groups, namely, pmm, p2 and p4m, show that the proposed method is robust in finding fabric defects without human intervention and can be used for automatic defect detection in fabric industries.

preprint2012arXiv

GLCM-based chi-square histogram distance for automatic detection of defects on patterned textures

Chi-square histogram distance is one of the distance measures that can be used to find dissimilarity between two histograms. Motivated by the fact that texture discrimination by human vision system is based on second-order statistics, we make use of histogram of gray-level co-occurrence matrix (GLCM) that is based on second-order statistics and propose a new machine vision algorithm for automatic defect detection on patterned textures. Input defective images are split into several periodic blocks and GLCMs are computed after quantizing the gray levels from 0-255 to 0-63 to keep the size of GLCM compact and to reduce computation time. Dissimilarity matrix derived from chi-square distances of the GLCMs is subjected to hierarchical clustering to automatically identify defective and defect-free blocks. Effectiveness of the proposed method is demonstrated through experiments on defective real-fabric images of 2 major wallpaper groups (pmm and p4m groups).

P. Nagabhushan

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents

Target specific mining of COVID-19 scholarly articles using one-class approach

Automatic Detection of Font Size Straight from Run Length Compressed Text Documents

Automatic Removal of Marginal Annotations in Printed Text Document

Direct Processing of Document Images in Compressed Domain

Direct Processing of Run Length Compressed Document Image for Segmentation and Characterization of a Specified Block

Entropy Computation of Document Images in Run-Length Compressed Domain

Extraction of Line Word Character Segments Directly from Run Length Compressed Printed Text Documents

Extraction of Projection Profile, Run-Histogram and Entropy Features Straight from Run-Length Compressed Text-Documents

Texture Defect Detection in Gradient Space

Periodicity Extraction using Superposition of Distance Matching Function and One-dimensional Haar Wavelet Transform

Automatic Detection of Texture Defects Using Texture-Periodicity and Gabor Wavelets

GLCM-based chi-square histogram distance for automatic detection of defects on patterned textures