The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The Arabic web content is growing rapidly and the need for its efficient management is gaining importance and the morphological complexity of Arabic raises many challenges in this regard. This paper reports on some of our work aimed at designing text mining and query pre-processing tools that are able to efficiently process and search large quantities of Arabic web data. In our research we try to...
In this study, we have proposed an extraction method for inaccurate example sentences using a Web search engine for multilingual parallel texts. We developed a multilingual parallel-text sharing system named Tack Pad for multilingual communication in the medical field. However, it should be noted that parallel texts created by people can be inaccurate. Hence, we cannot use these parallel texts in...
Information extraction (IE) from corpora is texts analysis in order to extract structured information such as Named Entities (NE) which may be names of person, organization, address, date, location etc. ... GATE is a software toolkit written in Java from 1995 and widely used worldwide by many communities (scientists, companies, teachers, students) for natural language processing. We have experimented...
Distributional semantics is the branch of natural language processing that attempts to model the meanings of words, phrases and documents from the distribution and usage of words in a corpus of text. In the past three years, research in this area has been accelerated by the availability of the Semantic Vectors package, a stable, fast, scalable, and free software package for creating and exploring...
Association mining is widely used in pattern discovery. For large scale financial textual data analysis, however, association mining is relatively less applied due to low efficiency in text manipulation. This paper presents a fast finance textual mining system, based on search engine and concept graph, for large scale financial textual association mining and visualization. Through the experiments...
We have developed the tool which is specialized to the target that a user is able to construct a word network while evaluating what word to set as a node. A co-occurrence network of words is a complex network with huge number of nodes and links and is not capable for man to interpret as it is. Therefore, we have designed its interface that displays a network as network confines gradually expand in...
In the blogosphere, the amount of digital content is expanding and for search engines, new challenges have been imposed. Due to the changing information need, automatic methods are needed to support blog search users to filter information by different facets. In our work, we aim to support blog search with genre and facet information. Since we focus on the news genre, our approach is to classify blogs...
In this paper, we propose a framework to answer questions of opinion type. The data source is the web pages returned from the search engine. By using Bayes Classifier, the main texts on the pages are classified into three categories at sentence level: positive review, negative review and neutral review. K-means method is used to cluster the sentences of positive review and negative review respectively...
A representation of the World Wide Web as a directed graph, with vertices representing web pages and edges representing hypertext links, underpins the algorithms used by web search engines today. However, this representation involves a key oversimplification of the true complexity of the Web: an edge in the traditional Web graph represents only the existence of a hyperlink; information on the context...
This paper presents a light-weight information retrieval and analysis architecture that addresses the complex task of gathering, combining, and storing documents to enable indepth analysis. The growing interest in mining the Internet for conversation topics, opinions, and influencers has resulted in many free and commercial products. At the heart of such capability are two core technologies: information...
This paper presents the concept of surface text patterns for extracting purpose data from the web. In order to obtain an optimal set of patterns, we have developed a method for learning purpose patterns automatically. A corpus was downloaded from the Internet using bootstrapping by providing a few hand-crafted examples of each purpose pattern to a generic search engine. This corpus was then tagged...
This paper includes the details on the implementation of the Zycox (Document Analyzer). It analyses the complete document by making permutation & combination of words and sentences which are further searched on the different search engines to find their relevant URL's and thus gives the complete statistical analyses on the document with its originality and percentage of document copied from the...
In order to prompt the efficiency of the server based Chinese input method for IPTV, the behaviors of querying for the program's text information are analyzed. Then it's proposed to integrate the full-text search engine of Sphinx with the input method to mine the accurate associating characters or words. To fit the main querying behavior for the program role's names, the program synopses are mined...
The Web is now playing an important part in people's real-life activities. Scientists of not only computer science but also sociology and economics might be interested in mining on information directly related to real-life events, or news-related information on the Web. In this paper we propose a system to enable mining on news-related articles instead of raw web pages. There are functionally two...
Free and open source software strongly promotes the reuse of source code. Some open source Java components/libraries are distributed as jar archives only containing the bytecode and some additional information. For whoever wanting to integrate this jar in her own project, it is important to determine the license(s) of the code from which the jar archive was produced, as this affects the way that such...
Current automatic wrappers using DOM tree and visual properties of data records to extract the required information from the search engine results pages generally have limitations such as the inability to check the similarity of tree structures accurately. Our study on the properties of data records shows that these data records located in search engine results pages are not only having similar visual...
The development of world wide web calls for how to efficiently exploit the information. Mostly, current search engines return a set of related documents which contain keywords. However, users expect the exact and concrete answer for each question. Therefore, it is necessary to build an automatic question answering system (QA). In this paper, we focus on building a QA for Vietnamese. This task especially...
Along with the fast developing of network technology, the number of Web page and user of network search become very enormous. In order to solve the problem of inefficiency and low precision in the search that users have different demand and knowledge background, this paper presents a new text model called vocabulary semantic net which can be applied to build personalized search engine and tested with...
Today the Internet in almost all ethnic groups and cultures is found and the Web pages are developing very quickly in most countries and different languages. Considering the size and incoherent available information in the Internet has made the use of search engines obvious and necessary. Since search engines pay less attention to the linguistics and content features of documents in different languages...
Knowledge about herbal medicine can be contributed from experts in several cultures. With the conventional techniques, it is hard to find the way which the experts can build a self-sustainable community for exchanging their information. In this paper, the Knowledge Unifying Initiator for Herbal Information (KUIHerb) is used as a platform for building a web community for collecting the intercultural...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.