The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Being able to identify locations associated to a Web resource is essential for providing location-based Web applications. However, geographical information in Web documents is rarely supplied in a machine-readable way and therefore not easily discoverable. As a consequence, it is necessary to extract geographical keywords from Web documents and to associate locations with them. This method is called...
New e-services come on-line each year at an exponential rate. Most of them have the need to analyze and interpret enormous quantities of data. However, many of them do not take into account the emotions and sentiments in the Web page for their analysis. Thus, in this work, we proposed a novel system to obtain data of interest from a Web search engine by analyzing the emotional and sentimental content...
Software requirements documents (SRDs) are often authored in general-purpose rich-text editors, such as MS Word. SRDs contain instances of logical structures, such as use case, business rule, and functional requirement. Automated recognition and extraction of these instances enables advanced requirements management features, such as automated traceability, template conformance checking, guided editing,...
A multi-agent based Web mining model is designed for the improvement of the efficiency of keywords based search engine. The model divides mining task into several parallel agents which coordinately work together, and the mining efficiency is improved greatly. Evolving from HITS, algorithm named Grabber in the model removes Link Farm pages in the expansion of root set, makes anchor text similarity...
In this work, we developed a self-organizing map (SOM) technique for using web-based text analysis to forecast when a group is undergoing a phase change. By “phase change”, we mean that an organization has fundamentally shifted attitudes or behaviors. For instance, when ice melts into water, the characteristics of the substance change. A formerly peaceful group may suddenly adopt violence, or a violent...
The large amounts of software source code projects available on the Internet or within companies are creating new information retrieval challenges. Present-day source code search engines, such as Google Code Search, tend to treat source code as pure text, as they do with web pages. However, source code files differ from web pages or pure text files in that each file may contain certain blocks expressing...
Web text mining is a growing research area in data mining. Interestingly, the existing Web text mining algorithms have concentrated on finding frequent patterns while discarding the less frequent ones that may contain outliers. In addition, the domain knowledge in one industry is partly different from that in the others. Whatever they belong to, web texts are analyzed using the same dictionary. This...
Communication through web is becoming increasingly popular thanks to wireless and cellular networks. As this awareness spreads far and wide in different countries, significant complexities arise in terms of language and communication means for extracting information on the web. This is particularly true in India where more than fifteen officially recognized language texts and more variations in local...
Web page classification plays an essential role in facilitating more efficient information retrieval and information processing. Conventionally, web text documents are represented by term frequency matrix for classification purpose. However, considering the limitations of representing documents using terms or keywords, we propose to represent web pages using information extraction patterns that are...
Several approaches to educational web-based content enrichment have been devised. Annotations in form of comments and other types of remarks obviously supply these approaches. Annotations allow enriching the educational materials mainly by retaining key information or comments; they can support visual search and also collaboration. In this paper we present a method for an acquisition of new educational...
Language Model (LM) constitutes one of the key components in Keyword Spotting (KWS). The rapid development of the World Wide Web (WWW) makes it an extremely large and valuable data source for LM training, but it is not optimal to use the raw transcripts from WWW due to the mismatch of content between the web corpus and the test data. This paper proposes a novel two-step data selection method based...
In this paper, we propose high-speed, accurate algorithms for detecting hazardous Web pages. Our algorithms automatically choose strings that appear especially in HTML elements of hazardous Web pages. We use these strings in combination as features of SVMs (support vector machines), and detect hazardous Web pages. Since our algorithms do not rely on the text parts of Web pages, they can detect Web...
Conference web pages display their topics information in different ways, and conferences in different domains accept papers on different topics. Automatic extraction of topics information from conference web pages is thus a difficult task and has not received much attention from the research community. In this paper, we propose a method for extracting topics information that uses a web page segmentation...
This paper addresses issues in generating responses by extracting sentences from the Web for spoken decisionmaking dialogue systems. Various decision criteria are usually involved when selecting an alternative from a given set of alternatives. Such a dialogue system is required to explain the alternatives in terms of each decision criterion focusing on why the alternative is recommended. Preparation...
A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. In this paper, we propose a crawler system using genetic algorithm to improve its crawling performance. Apart from estimating the best path to follow, our system also expands its...
Parallel corpus is the valuable resource for some important applications of natural language processing such as statistical machine translation, dictionary construction, cross-language information retrieval. The Web is a huge resource of knowledge, which partly contains bilingual information in various kinds of web pages. It currently attracts many studies on building parallel corpora based on the...
In this paper, we propose a framework to answer questions of opinion type. The data source is the web pages returned from the search engine. By using Bayes Classifier, the main texts on the pages are classified into three categories at sentence level: positive review, negative review and neutral review. K-means method is used to cluster the sentences of positive review and negative review respectively...
A representation of the World Wide Web as a directed graph, with vertices representing web pages and edges representing hypertext links, underpins the algorithms used by web search engines today. However, this representation involves a key oversimplification of the true complexity of the Web: an edge in the traditional Web graph represents only the existence of a hyperlink; information on the context...
In web pages, the reviews are written in natural language and are unstructured-free-texts scheme. Online product reviews is considered as a significant informative resource which is useful for both potential customers and product manufacturers. The task of manually scanning through large amounts of review one by one is computational burden and is not practically implemented with respect to businesses...
As the number of pages on the web is permanently increasing, there is a need to classify pages into categories to facilitate indexing or searching them. In the method proposed here, we use both textual and visual information to find a suitable representation of web page content. In this paper, several term weights, based on TF or TF-IDF weighting are proposed. Modification is based on visual areas,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.