The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
With the increasing number of Mauritian-owned websites on the internet, the need for classification is becoming highly important. Our objective in this research is to classify a list of websites into seven broad categories namely education, entertainment, government, health, tourism, sports and shopping. The homepage of three hundred and nineteen websites have been used in this study. We have exploited...
Focused crawlers aim to fetch pages only related to a specific subject area from millions of web pages on the Internet. The essential task in a focused crawler is to predict whether a page is related to the target subject area or not without actually fetching the page content itself. Link context based focused crawlers focus on the surrounding text around each link to classify the page pointed by...
Current classification techniques use word matching and clustering techniques to classify webpages. These techniques use ad hoc approach of checking and matching the entire keywords in a webpage for classification. These methods are efficient but not without problems. In general, they suffer from the following problems 1) As they use brute force matching for the entire document, they tend to be slow...
In this study we propose a genetic algorithm to select best features for Web page classification problem to improve accuracy and run time performance of the classifiers. The increase in the amount of information on the Web has caused the need for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines' performance. To determine whether a Web page belongs...
The problem of spam detection is a crucial task in the web information retrieval systems. The dynamic nature of information resources as well as the continuous changes in the information demands of the users makes the task of web spam detection a challenging topic. So far many different methods from researchers with different backgrounds have been proposed to tackle with spam web pages problem. In...
In Web database integration, crawling data pages is important for data extraction. The fact that data are contained by multiple result pages increases the difficulty of accessing data for integration. Thus, it is necessary to accurately and automatically crawl query result pages from Web database. To address this problem, we propose a novel approach based on URL classification to effectively identify...
Mutual information formula is improved by using the hyperlink factor in this paper. Introduction of hyperlink elements of web pages can improve the classification accuracy in feature selection method based on mutual information and correlation by experiment, especially for those of strong. So the improvement is effective in web page classification.
Web page categorization is becoming a pivotal technology in processing and organizing a mass of documents and data. The feature is selected to improve text-processing technology thinking of factor hyperlink in Maximum Entropy Model. Experiment finds that the method is more effective. It not only can get the most consistent distribution, but ensure the accuracy and universality in sorting webpage classification...
This paper presents a study on the performance of attribute selection methods to be used with Ant-Miner algorithm for web text categorization. The new generated data set by each attribute selection method was classified with Ant-Miner to see the performance in terms of predictive accuracy and the number of rules generated. The results of classification were also compared to C4.5 algorithm.
Genre classification is a key aspect of music descriptions. In 2006, Schedl et al. presented a method for genre classification through web-based co-occurrence analysis. We evaluate whether this method is still valid, given the evolution of the web search technologies. We identify some issues with page count as the main parameter for the analysis in relation with the used genre taxonomies, choice of...
Client honeypots are security devices designed to find servers that attack clients. High-interaction client honeypots (HICHPs) classify potentially malicious Web pages by driving a dedicated vulnerable Web browser to retrieve and classify these pages. Considering the size of the Internet, the ability to identify many malicious Web pages is a crucial task. HICHPs, however, present challenges: They...
Despite the growth of the Web in recent years, some portion of the Web remains largely underdeveloped, as shown in lack of high quality contents. An example is the botany specific Web directory, in which lack of well-structured Web directories have limited user's ability to brows the necessary information. In this research we propose an improved framework for constructing a specific Web directory...
In this paper, reclassification for the current classification through K-means would be implemented based on the feedback of Web usage mining in order to improve the accuracy of news recommendation and convergence of classification. It could extract most relative keywords and eliminate the disturbance of multi-vocal word in one category based on feedback of Web usage. The reclassification of news...
Traditional automatic classifiers often conduct misclassifications. Folksonomy, a new manual classification scheme based on tagging efforts of users with freely chosen keywords can effective resolve this problem. Even though the scalability of folksonomy is much higher than the other manual classification schemes, the method cannot deal with tremendous number of items such as whole Weblog articles...
Classification based on mining association rules is a method with good accuracy and human readable classification model. The aim of this paper is to propose modification of the basic association based classification method, which can be used for the data extracted from Web pages. In this paper, the modifications of the method and necessary discretization of numeric attributes will be described. Next,...
This paper addresses practical aspects of Web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the Web. We study techniques...
Two important factors which indirectly influence the Internet shoppers to make some online purchases are the visual layout and the presentation of web page. In this paper, we propose an approach of web page layout analysis in order to assess the design of e-commerce Web sites. Firstly, our proposed method segments each web page into five different blocks: top, left, center, right and bottom. We study...
The rapid growth of Web has made it a huge source of information which will make the availability of data easier and more efficient if its content is well organized. Automatic classification of Web pages is one of the major methods in the Web content mining (WCM) which can be of great value in the development and maintenance of Web directories. Based on the analysis done, CMAC neural network showed...
The method that using repeating information, appeared in Web pages to represent the semantic meaning can be used to improve the correct rate of Web pages classification. This paper analyses and improves the traditional repeating patterns representation methods, and further proposes a new semantic representation of Web information based on repeating patterns. First, the repeating patterns are formal...
In the recent few years, web mining has become a hotspot of data mining with the development of Internet. Web pages classification is one of the essential techniques for web mining since classifying web pages of an interesting class is often the first step of mining the web. The high dimensional text vocabulary space is one of the main challenges of web pages. In this paper, we study the capabilities...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.