The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this paper, we focus on promoting multi-label learning task with ensemble learning. Compared to traditional single algorithm methods, it has been recognized that ensemble methods could achieve much better performance than each constituent learned model, especially under the conditional independence of different classifiers. Existing multi-label ensemble algorithms mainly focus on creating diverse...
In this paper, Multi-Task Linear Dependency Modeling is proposed to distinguish drug-related webpages that contain lots of images and text. Linear Dependency Modeling exploits semantic relations between images features and text features, and Multi-Task Learning takes advantage of metadata of webpages. Meaningful information of webpages can be made use of fully to improve classification accuracy. Experimental...
The background of this paper is the issue of how to overview the knowledge of a given query keyword. Especially, we focus on concerns of those who search for Web pages with a given query keyword. The Web search information needs of a given query keyword is collected through search engine suggests. Given a query keyword, we collect up to around 1,000 suggests, while many of them are redundant. We cluster...
Different from general-purpose search engines, vertical search engine only needs to collect and index only a specific knowledge domain, and then provides more professional search services for users. In this paper, we propose a novel library resource vertical search engine based on ontology technology. In the vertical search engine, the information that crawler collects from Internet should be further...
The steady growth and popularization of the Web has led spammers to develop techniques to circumvent search engines aiming good visibility to their web pages in search results. They are responsible for serious problems such as dissatisfaction, irritation, exposure to unpleasant or malicious content, and financial loss. Despite different machine learning approaches have been used to detect web spam,...
Spatial analysis in many fields requires effective address extraction from text reports. This problem is of particular importance in social science where news reports contain information about socially relevant incidents. Previous address extraction work focuses on web pages where addresses are separated from other text, however news reports contain addresses embedded in text. Hence, the need for...
In order to manage and organize information on the web, we propose a novel web page classification strategy integrating topic model and SVM. We use topic model to harness the implicit information on web pages for feature extraction. Accuracy of the strategy is 84.15%, 2.23% superior to the traditional classification strategy based on CHI.
The incentive for this work originates from the need of retrieving useful web news pages from the Indian news websites corpus. News web pages contrast from other web pages; it is mainly vital to recognize web news accurately for precise classification. We will likely locate a simple yet efficient technique to mine news articles from web corpus. To accomplish this task, the automatic recognition method...
In this paper we present an overview of our proposed algorithms for classifying regions of web pages based on content and visual properties. We show how hidden Markov trees may be effective for the classification and how this may end up offering improved experiences to users who are trying to view webpages.
Detecting explicit user actions, i.e., requests for web pages such as hyper-link clicks, from passive traces is fundamental for many applications, such as network forensics or content popularity estimation. Every URL explicitly visited by a user usually triggers further automatic URL requests to obtain all objects that compose the web page. HTTP traces provide a summary of all URLs requested by users,...
This work addresses the problem of URL topic classification by making use of the text of Uniform Resource Locators (URLs). We have introduced a method for classifying the web pages into topics by extending the Jaccard distance measure and using the n-gram approach. We have also compared our method with the best performing known distance measures for Boolean data in the literature i.e. Jaccard, Dice...
Co-training is a semi-supervised learning paradigm that trains some classifiers and let them label some unlabelled instances for each other during the learning process. One challenge of the co-training style algorithm is to train an initial weakly useful predictor when the number of labeled instances is very limited. In this paper, we use Teaching-to-learn and Learning-to-teach strategy, which each...
Paper focuses on the optimization of the advertising and other additional costs for the small business ecommerce web sites. The aim of this paper was to propose a dynamic neural network based algorithm to predict number of clicks on a particular advertising link in three web pages of three different small companies working in the same business segment. The dynamic neural network based algorithm was...
This paper presents TweeVist, a geo-tweet visualization system to support users grasp event happens over time and space from tweets while they browse any web pages based on spatio-temporal analysis. TweeVist presents a tag cloud of tweets in different time periods are associated with web pages based on detected events. In order to detect events, the system extracts normal events (e.g., crowded restaurants,...
Web spam has the effect of polluting search engine results and decreasing the usefulness of search engines.Web spam can be classified according to the methods used to raise the web page's ranking by subverting web search engine's algorithms used to rank search results. The main types are: content spam, link spam and cloaking spam. There has been little or no work on automatically classifying web spam...
Internet and search engines are increasing its prominence in modern day life. Search engines like Google, Bing and Yahoo are perhaps the largest source of information that anyone can access at anytime in the present day life. People have different interests while using the Internet. Advanced users could be interested in automatically extracting information from pages for later processing and web mining,...
In this paper, we examine an algorithm to update n-grams word dictionary (thesaurus) and evaluate its effectiveness in binary classification problem. The thesaurus is used as a reference to generate the numerical feature attributes of web pages. Generally, the n-grams word dictionary is built once using a set of training data and its content is never updated. Hence, the content is static and its coverage...
Support Vector Machine (SVM) is a powerful classifier used widely in textual and web classification. It tries to find an hyperplane that separates positive and negative data, maximizes the margin. SVM is a classifier that is based on a kernel whose choice is very critical. We propose in this paper an implicit links based Gaussian kernel that uses an implicit links based distance. This kernel helps...
Classification and extraction of web finds its applications in semantic web, searching and information extraction. The first part of the paper deals with the problem of classifying web pages, according to their content. Further, the methodology to classify web pages hierarchically in order to achieve topic-wise modeling of websites using multi label tree classifier, a variant of classification where...
The explosive growth of webpage number on the Web has brought up some problems in the search process. One of these problems is that the general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic. Another problem is the massive increase in the number of pages to be indexed by Web search systems. In this research, two steps...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.