The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Online advertisements (ads) have taken over the web, nowedays most websites contain some sort of ads. While ads produce revenue for the server maintainer or to businesses, they have become intrusive and dangerous as ever. The ads use more bandwidth, show inappropriate content, and spread malware such as adware and ransomware. Although there are many products to block ads, also known as ad blockers,...
Web crawlers have been misused for several malicious purposes such as downloading server data without permission from the website administrator. In this paper, based on one observation that normal users and malicious crawlers have different short-term and long-term download behaviors, we develop a new anti-crawler mechanism called PathMarker to detect and constrain persistent distributed crawlers...
With the rise of Internet technology and development of mobile application, more and more data and information are around us. However, it is not always easy to find the needed information that people want. Therefore, a good recommendation system is required for giving useful or interesting information. To provide useful information for user, a good classification of data is needed for recommendation...
In order to manage and organize information on the web, we propose a novel web page classification strategy integrating topic model and SVM. We use topic model to harness the implicit information on web pages for feature extraction. Accuracy of the strategy is 84.15%, 2.23% superior to the traditional classification strategy based on CHI.
Thanks to the proliferation of internet, a lot of data are produced by both websites and personal users. The documents are required to be classified in terms of their content in order to reach the necessary information fast and correctly from produced data. One of the biggest difficulties in document classification systems is detection of attribute that represent the classes in best way. In this research,...
Web spam has the effect of polluting search engine results and decreasing the usefulness of search engines.Web spam can be classified according to the methods used to raise the web page's ranking by subverting web search engine's algorithms used to rank search results. The main types are: content spam, link spam and cloaking spam. There has been little or no work on automatically classifying web spam...
Support Vector Machine (SVM) is a powerful classifier used widely in textual and web classification. It tries to find an hyperplane that separates positive and negative data, maximizes the margin. SVM is a classifier that is based on a kernel whose choice is very critical. We propose in this paper an implicit links based Gaussian kernel that uses an implicit links based distance. This kernel helps...
With the rapid development of the Internet, the demand of people on the Internet retrieval is increasing gradually. The meta-search engine is different from general search engine. It combines multiple search engine results and returns them to the user, but in order to meet the needs of different users, we need classify the results returned by the meta-search engine. Therefore, this article will discuss...
Classification and extraction of web finds its applications in semantic web, searching and information extraction. The first part of the paper deals with the problem of classifying web pages, according to their content. Further, the methodology to classify web pages hierarchically in order to achieve topic-wise modeling of websites using multi label tree classifier, a variant of classification where...
Traditionally in Web crawling, the required features are extracted from the whole contents of HTML pages. However, the position which a word is located inside the HTML tags indicates its importance in the web page. This research proposes two ideas concerning the Feature Selection stage in HTML web pages. The first idea reduces the features by simply extracting them from the important tags in an HTML...
The explosive growth of webpage number on the Web has brought up some problems in the search process. One of these problems is that the general purpose search engines often return too many irrelevant results when users are searching for specific information on a given topic. Another problem is the massive increase in the number of pages to be indexed by Web search systems. In this research, two steps...
Classification of web content is an interesting and widely pursued field of research in machine learning. Web classification could be done in various ways based upon the criteria chosen. Subjective classification involves classification of web pages based upon the subject to which these pages belong (say history, economics, politics, etc.). Another way of classifying web pages could be based upon...
The motivation behind the work is that the prediction of web user's browsing behavior while serving the Internet, reduces the user's browsing access time and avoids the visit of unnecessary pages to ease network traffic. This research work introduces parallel Support Vector Machines for web page prediction. The web contains an enormous amount of data and web data increases exponentially, but the training...
Huge amount of user request data is generated in web-log. Predicting users' future requests based on previously visited pages is important for web page recommendation, reduction of latency, on-line advertising etc. These applications compromise with prediction accuracy and modelling complexity. we propose a Web Navigation Prediction Framework for webpage Recommendation(WNPWR) which creates and generates...
Currently, There are many E-commerce websites around the internet world. These E-commerce websites can be categorized into many types which one of them is C2C (Customer to Customer) websites such as eBay and Amazon. The main objective of C2C websites is an online market place that everyone can buy or sell anything at any time. Since, there are a lot of products in the E-commerce websites and each...
This paper proposes an event data extraction method that extracts business event data, such as coupons, tickets, sales campaigns, etc., from a homepage or blog of shops and pushes them to users. Users no longer need to browse their favorite shops' homepage one by one. The method supports comprehensiveness and effectiveness for event data obtainment. This proposition works into two tasks: web page...
These days, the Internet is developing at an exponential rate and can cover just about any data required. Nonetheless, the immense measure of web pages makes it more difficult to effectively discover the target data by a user. Therefore, an efficient method, for classifying this huge amount of data is essential if the web pages are to be exploited to its full potential. In the domain of automatic...
Web spam is one of the recent problems of search engines because it powerfully reduced the quality of the Web page. Web spam has an economic impact because spammers provide a large free advertising data or sites on the search engines and so an increase in the web traffic. In this paper we have implemented spam detection system based on a SVM classifier that combines new link features with content...
The information in World Wide Web is dynamic and growing faster. Existing topic based search engines are not adequate to retrieve information required by the users. So there is a necessity to develop genre based search engines. Firstly, web genres have to be identified to develop genre based search engines. Presently, there exist a few genre corpuses which include web genres like articles, online...
Domain — specific search focuses on one area of knowledge. Applying broad based ranking algorithms to vertical search domains is not desirable. The broad based ranking model builds upon the data from multiple domains existing on the web. Vertical search engines attempt to use a focused crawler that index only relevant web pages to a predefined topic. With Ranking Adaptation Model, one can adapt an...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.