A Method for Webpage Classification Based on URL Using Clustering

Sunita; Gurvinder          Singh; Vijay       Rana

doi:10.2174/2213275912666190612143913

Abstract

Background: Pattern mining is the mechanism of extracting useful information from a large dataset of information. A sub-field of web mining is sequential Noisy data extraction from user query, which is considered along with redundancy handling. This redundancy handling mechanism employed in the existing literature is known as ambiguity handling. The clustering mechanism employed in the existing system includes k means, semantic search and incremental growth of the internet.

Aims: The proposed works comprise an analysis of techniques used to extract useful URLs to replace noisy data.

Methods: We consider noisy data extraction from user query considered along with the redundancy handling. This redundancy handling mechanism employed in the existing literature is known as ambiguity handling. The clustering mechanism used in the existing system includes k means and semantic search. These mechanisms are static, causing performance degradation in terms of execution time. It suggests the performance improvement mechanism in this literature.

Results: The methods MPV (Most-Probable-Values) clustering and N-gram techniques for improvement considered in existing literature can further be improved using the research methodology specified through this literature.

Conclusion: In the proposed system, results are based on MPV clustering with N-grams techniques. N-gram analyzes the instances of a word or phrase across all query data. The parameters fetch the results in terms of execution time and the number of URLs retrieves for web page classification.

Keywords: Ambiguity, clustering, noisy data, user query, URL, webpage.

Graphical Abstract

Rights & Permissions Print Cite

Article Metrics

3

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/2213275912666190612143913	Print ISSN 2666-2558
Publisher Name Bentham Science Publisher	Online ISSN 2666-2566

Recent Advances in Computer Science and Communications

A Method for Webpage Classification Based on URL Using Clustering

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract