The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Knowledge discovery from the Web is a cyclic process. In this paper we focus on the important part of transforming unstructured information from Web pages into structured relations. Relation extraction systems capture information from natural language text on Web pages, called Web text. However, extraction is quite costly and time consuming. Worse, many Web pages may not contain a textual representation...
In this paper, we propose an identification algorithm of malicious Web pages for crawlers, which collect Web pages for the later task to detect malicious Web pages based on the content. Recently, some organization would have to automatically crawl the Web pages with the crawlers for later checking by humans. However, since manually checking Web pages is an expensive task, the total cost would be enormous...
With the rapid development of the Internet, popular entities have more and more instances on the Web. It is observed that, on one hand, for the same Web entity, different Web entity instances often contain different attributes, and for the same attribute, different Web entity instances often use different labels; on the other, new Web entity instances which contain new attributes and labels are appearing...
Along with the rapidly development of the information retrieval and web technology, web entity retrieval has become a new popular way for getting specific information, such as looking for a book or a movie. Like document retrieval, generally there are too many results returned for a query, so ranking is still a necessary step during the entity retrieval process. This paper will focus on the ranking...
With the widespread of Internet application, more and more enterprises build their Web sites and provide business information through Web pages. Web page classification could be used to assign the enterprise Web pages to one or more predefined business categories. On the purpose of Internet-based enterprises administration in E-government system, algorithms and application related to web page classification...
In this paper we propose a new multi-view semi-supervised learning algorithm called Local Co-Training(LCT). The proposed algorithm employs a set of local models with vector outputs to model the relations among examples in a local region on each view, and iteratively refines the dominant local models (i.e. the local models related to the unlabeled examples chosen for enriching the training set) using...
The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre, even when the Web page belongs to more than...
It is well known that Web users create links with different intentions. However, a key question, which is not well studied, is how to categorize the links and how to quantify the strength of the influence of a Web page on another if there is a link between the two linked Web pages. In this paper, we focus on the problem of link semantics analysis, and propose a novel supervised learning approach to...
Web page classification is the automated assigning of predefined subject category to the document. Automatic Web page classification is one of the most essential techniques for Web mining given that the Web is a huge repository of various information including images, videos etc. And there is a need for categorization Web pages to satisfy user needs. The classification of Web pages into each category...
The explosive Web make it hard to organize and manage Web information automatically. Therefore, online learning method such as incremental learning is gradually become effective instrument in practical applications. From our experiments, traditional incremental learning shows some flaws in the iterative process. To overcome the drawback caused by using only support vector to represent the whole former...
The increasing numbers of Web pages on the cyber world result to the less effectiveness of document retrieval that matches the need of users. The classification of Web pages is one of the solutions to solve this problem. This paper proposes VAMSVM_WPC model which is a novel voting algorithm for classifying the Web pages, which uses a multi-class SVM method. First, feature is generated from text and...
This paper presents a new algorithm of Web page classification, CUCS(Combined UC and SVM), for large training set. CUCS combines the advantages of SVM (Support Vector Machine) and UC (Unsupervised Clustering), achieving high precision and fast speed. In the training stage, CUCS gets clustering centers, which include positive example centers and negative ones, by means of UC. Then CUCS prunes training...
We describe a method to retrieve images found on Web pages with specified object class labels, using an analysis of text around the image and of image appearance. Our method determines whether an object is both described in text and appears in a image using a discriminative image model and a generative text model. Our models are learnt by exploiting established online knowledge resources (Wikipedia...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.