The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Named entities such as people, locations, and organizations play a vital role in characterizing online content. They often reflect information of interest and are frequently used in search queries. Although named entities can be detected reliably from textual content, extracting relations among them is more challenging, yet useful in various applications (e.g., news recommending systems). In this...
A simple semantic lexicon extraction method is proposed based on one hypothesis and three filtering rules from Baidu Chinese Network Encyclopedia. The acquired affective lexicon includes emotional words and their lexical semantic relations including synonyms and antonyms. The acquiring method is recursive algorithm using the seed words. The extracted affective lexicon is labeled with affective tendency...
This paper studies cross-lingual semantic similarity (CLSS) between five European languages (i.e. English, French, German, Spanish and Italian) via unsupervised word embeddings from a cross-lingual lexicon. The vocabulary in each language is projected onto a separate high-dimensional vector space, and these vector spaces are then compared using several different distance measures (i.e., correlation,...
Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage...
The Web of Data is an increasingly rich source of information, which makes it useful for Big Data analysis. However, there is no guarantee that this Web of Data will provide the consumer with truthful and valuable information. Most research has focused on Big Data's Volume, Velocity, and Variety dimensions. Unfortunately, Veracity and Value, often regarded as the fourth and fifth dimensions, have...
Given the plethora of social networking sites, it can be difficult for users to browse too many sites and discover social friends. For example, for a new diabetes patient, how can s/he find the users with similar symptoms on different dedicated sites and form supporting groups with them? Since different sites may use different vocabularies, this problem is challenging to match users across different...
The popular “bag-of-visual-words” approach for representing and searching visual documents consists in describing images (or video keyframes) using a set of descriptors, that correspond to quantized low-level features. Most of existing approaches for visual words are inspired from works in text indexing, based on the implicit assumption that visual words can be handled the same way as text words....
Many researchers have recognized Wikipedia as a resource of huge dynamic knowledge base in recent years. This paper provides a new approach for obtaining measures of terms semantic relatedness, which maps terms to relevant Wikipedia articles as the background information for analyzing. The proposed algorithm WLA focuses on the hyperlink structure and summary paragraph extracted from the topic pages...
In language learning scenarios, the use of glossing technique has a positive effect on incidental vocabulary acquisition as a by-product of reading. However, the preparation of materials that include glosses can be a time consuming task for the teacher. Automatic glossing tools have gained interest to help reduce such efforts, and to provide a better experience using electronic documents. Most glossing...
This paper describes our study on developing the text and speech databases for automatic speech recognition of Vietnamese using an available source of linguistic data: the Internet. First, a two-stage procedure is applied to extract a general text corpus which can be used for researches on Vietnamese language such as speech recognition, audio-visual speech recognition, and natural language processing…...
This paper presents a text simplification method that transforms complex sentences into simplified forms. Our method uses NLP-techniques to simplify the text based on the target audience context, improving its overall understandability. We evaluate our approach in two aspects: grammatical structure and understandability. In both aspects, our approach achieved good results, showing its applicability...
In this paper we introduce Linked Data driven development, a lightweight methodology for using Linked Data throughout the software life cycle. We explain the idea of Linked Data and how it plays an important role in the semantic web. Furthermore, we describe the necessary steps and approaches when using Linked Data for improving the software development process and give a discussion on the bonuses...
The vision of creating a Linked Data Web brings together the challenge of allowing queries across highly heterogeneous and distributed datasets. In order to query Linked Data on the Web today, end-users need to be aware of which datasets potentially contain the data and also which data model describes these datasets. The process of allowing users to expressively query relationships in RDF while abstracting...
In this paper, we present HAMEX, a new public dataset that contains mathematical expressions available in their on-line handwritten form and in their audio spoken form. We have designed this dataset so that, given a mathematical expression, its handwritten signal and its audio signal can be used jointly to design multimodal recognition systems. Here, we describe the different steps that allowed us...
Social bookmarking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures aka folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics...
As recent research shows, efficient navigability of tagging systems is only possible if the number of tags grows hand in hand with the number of tagged resources. However, the number of resources grows typically faster than the number of tags. In this paper we analyze how enriching of user tags with tags generated from Google queries influences navigability in tagging systems. The analysis dataset...
Automatic extraction of Chinese synonyms plays an important role in information retrieval and semantic resource construction. Based on the analyzing and comparing the different technologies of synonyms extraction, this paper proposes multi-strategy method including literal similarity algorithm, pattern matching algorithm and PageRank algorithm to extraction Chinese synonyms from encyclopedia resource...
This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all...
This paper presents a method to identify the topics of documents based on Wikipedia category network. It is to improve the method previously proposed by Schonhofen by taking into account the weights of words in hyperlink texts in Wikipedia articles. The experiments on computing and team sport domains have been carried out and showed that our proposed method outperforms the Schonhofen's one.
The state-of-the art in visual object retrieval from large databases allows to search millions of images on the object level. Recently, complementary works have proposed systems to crawl large object databases from community photo collections on the Internet. We combine these two lines of work to a large-scale system for auto-annotation of holiday snaps. The resulting method allows for automatic labeling...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.