The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Thai language is considered as a non-segmented language where words are a string of symbols without explicit word boundaries, and also the structure of written Thai language is highly ambiguous. This problem causes an indexing technique has become a main issue in Thai text retrieval. To construct an inverted index for Thai texts, an index terms extraction technique is usually required to segment texts...
This paper examines the feasibility of implementing statistic-oriented term extraction and evaluation methods in extracting historical terms from aligned parallel corpora of Chinese historical classics and their translations. It proposes to take transliteration as anchor points to establish sentence-level alignment. It also investigates the approach to extract term translation pairs based on 4000...
A novel approach of the entity relation extraction is proposed by this paper, it is different from the previous approaches, and the syntactic knowledge extraction is specific section, which automatically extracts the characteristic words and patterns based on hierarchy bootstrapping machine learning. It advocates using a small amount of seed information and a large collection of easily-obtained unlabeled...
Semantic orientation analysis of sentiment word is to determine its polarity and degree, including original orientation, dynamic orientation and modified orientation. In this paper, we correct the orientation in different contexts through dependency relationship and some rules. The result shows that accuracy and recall rate is improved a lot.
Work in opinion mining and classification often assumes the incoming documents to be opinionated. Opinion mining system makes false hits while attempting to compute polarity values for non-subjective or factual sentences or documents. It becomes imperative to decide whether a given document contains subjective information or not as well as to identify which portions of the document are subjective...
In this paper, we construct and compare several feature extraction approaches in order to find a better solution for classification of Turkish Web documents in the marketing domain. We produce our feature extraction techniques using characteristics of the Turkish language, structures of Web documents and online content in the marketing domain. We form datasets in different feature spaces and we apply...
In this paper, we propose a generic text summarization method that generates summaries of Turkish texts by ranking sentences according to their scores calculated using their surface level features and extracting the highest ranked ones from the original documents. In order to extract sentences which form a summary with an extensive coverage of main content of the text and less redundancy, we use the...
This paper presents three different ways to describe the notion concept. The first one uses the idea of hierarchy and employs a graph to define the connections between attributes and concepts. To enable concepts generation, manipulation or measurement a matrix model is developed. Thus, the entire space of terms could be generated by a set of (linearly) independent terms over a numerical field. The...
Stemming is one of many tools used in information retrieval to combat the vocabulary mismatch problem, in which query words do not match document words. Stemming in the Arabic language does not fit into the usual mold, because stemming in most research in other languages so far depends only on eliminating prefixes and suffixes from the word, but Arabic words contain infixes as well. In this paper...
This article aims to solve the problem of extracting the cultural terms and their correspondent English translations from the heterogeneous literature of the translation of the ancient Chinese classics. As the tool of text processing, regular expressions can help to realize the matching in the patterned text. This research focuses on design the target-oriented regular expressions to fit the pattern...
The task of definition extraction aims to acquire definitions of terms from texts. This task is a subtask of terminology extraction, ontology construction, semantic relation learning, and question answering and so on. This paper presents a bootstrapping approach to automatic extracting definitions of domain-specific terms from unannotated Chinese free texts. Experimental results in three domains of...
A novel architecture is presented for the matching of Web-services based on the extraction of interpretation graphs from natural language text. The graphs of candidate services are compared to that of the query using a numerical node-node similarity calculation based on the structure of the graphs. The similarity score of their best alignment with the query may then be used to rank the candidates.
This paper suggests an alternative solution for the task of spoken document retrieval (SDR). The proposed system runs retrieval on multi-level transcriptions (word and phone) produced by word and phone recognizers respectively, and their outputs are combined. We propose to use latent Dirichlet allocation (LDA) model for capturing the semantic information on word transcription. The LDA model is employed...
Recently, there is a growing need to access historical Arabic handwritten manuscripts (HAH manuscripts) that are stored in large archives; therefore, managing tools for automatic searching, indexing, classifying and retrieval of HAH manuscripts are required. The peculiar characteristics of Arabic handwriting have added an extra challenging dimension in developing such systems. This paper presents...
With the continuous development and growing popularity of the Internet, the amount of information on-line is in the explosive growth. How to find out the information that we need correctly and quickly from the mass data, then put in the front. Under this background, the Internet search engine grows up rapidly. This article describes the search engine on the general principle and common technology,...
Classical information retrieval models are based on representation of document terms without considering linguistic elements. This article presents a model based on the Discourse Nominal Structure; which lets us take linguistic characteristics of text into account. The model presented is evaluated in comparison with the vector space model. Based on observations during the experimentation we propose...
In the last few years, information extraction (IE) has become a rapidly expanding field as the machine-readable documents keep growing exponentially. IE is the perfect solution to transform factual knowledge from publications into database entries. Many efforts have been made to automatically extract and mine scientific texts ranging from biochemical to terrorism attacks reports. This study is looking...
In this paper, an algorithm to normalize noisy text, which only focuses on the Arabic language, is introduced. Although there have been many theories that discuss Arabic text processing, there has not been, so far, one theory that focuses on noisy Arabic texts. Additionally, this paper introduces a new similarity measure to stem Arabic noisy document. The need for such a new measure stems from the...
Extracting specific information from a collection of documents is called information extraction (IE). In general, the information on the a Web is well structured in HTML or XML format. And the work of IE from structured documents (in HTML or XML), basically uses learning techniques for pattern matching in the content. In this paper, we have proposed a novel approach for interactive information extraction...
Given a text or collection of texts involving unconstrained language, a basic task in a multitude of applications is the identification of stems and endings for each word form, which is termed morphological analysis. In this paper, the use of an ant colony optimization (ACO) metaheuristic is proposed for a linguistic task that involves the automated morphological segmentation of Ancient Greek word...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.