The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Named entities such as people, locations, and organizations play a vital role in characterizing online content. They often reflect information of interest and are frequently used in search queries. Although named entities can be detected reliably from textual content, extracting relations among them is more challenging, yet useful in various applications (e.g., news recommending systems). In this...
This paper studies cross-lingual semantic similarity (CLSS) between five European languages (i.e. English, French, German, Spanish and Italian) via unsupervised word embeddings from a cross-lingual lexicon. The vocabulary in each language is projected onto a separate high-dimensional vector space, and these vector spaces are then compared using several different distance measures (i.e., correlation,...
Wikipedia encyclopedia is an attractive source for comparable corpora in many languages. Most researchers develop their own script to perform document alignment task, which requires efforts and time. In this paper, we present WikiDocsAligner, an off-the-shelf Wikipedia Articles alignment handy tool. The implementation of WikiDocsAligner does not require the researchers to import/export of interlanguage...
Many researchers have recognized Wikipedia as a resource of huge dynamic knowledge base in recent years. This paper provides a new approach for obtaining measures of terms semantic relatedness, which maps terms to relevant Wikipedia articles as the background information for analyzing. The proposed algorithm WLA focuses on the hyperlink structure and summary paragraph extracted from the topic pages...
This paper describes our study on developing the text and speech databases for automatic speech recognition of Vietnamese using an available source of linguistic data: the Internet. First, a two-stage procedure is applied to extract a general text corpus which can be used for researches on Vietnamese language such as speech recognition, audio-visual speech recognition, and natural language processing…...
In this paper, we present HAMEX, a new public dataset that contains mathematical expressions available in their on-line handwritten form and in their audio spoken form. We have designed this dataset so that, given a mathematical expression, its handwritten signal and its audio signal can be used jointly to design multimodal recognition systems. Here, we describe the different steps that allowed us...
Social bookmarking tools are rapidly emerging on the Web as it can be witnessed by the overwhelming number of participants. In such spaces, users annotate resources by means of any keyword or tag that they find relevant, giving raise to lightweight conceptual structures aka folksonomies. In this respect, needless to mention that ontologies can be of benefit for enhancing information retrieval metrics...
This paper presents a method to identify the topics of documents based on Wikipedia category network. It is to improve the method previously proposed by Schonhofen by taking into account the weights of words in hyperlink texts in Wikipedia articles. The experiments on computing and team sport domains have been carried out and showed that our proposed method outperforms the Schonhofen's one.
The state-of-the art in visual object retrieval from large databases allows to search millions of images on the object level. Recently, complementary works have proposed systems to crawl large object databases from community photo collections on the Internet. We combine these two lines of work to a large-scale system for auto-annotation of holiday snaps. The resulting method allows for automatic labeling...
Tagging is one of the most popular services in Web 2.0. As a special form of tagging, social tagging is done collaboratively by many users, which forms a so-called folksonomy. As tagging has become widespread on the Web, the tag vocabulary is now very informal, uncontrolled, and personalized. For this reason, many tags are unfamiliar and ambiguous to users so that they fail to understand the meaning...
As a corpus for knowledge extraction, Wikipedia has become one of the promising resources among researchers in various domains such as NLP, WWW, IR and AI since it has a great coverage of concepts for wide-range domain, remarkable accuracy and easy-handled structure for analysis. Relatedness measurement among concepts is one of the traditional research topics on Wikipedia analysis. The value of relatedness...
Rating the similarity of two or more text documents is an essential task in information retrieval. For example, document similarity can be used to rank search engine results, cluster documents according to topics etc. A major challenge in calculating document similarity originates from the fact that two documents can have the same topic or even mean the same, while they use different wording to describe...
This paper presents an approach to increasing the reusability of online course repositories by collecting and annotating learning content using consensually agreed vocabularies. It proposes support to courseware authors focused on two aspects: (i) providing means for creating ontological structures (to be used as a skeleton for structuring course related material) and partially populate them with...
Document indexation is an essential task achieved by archivists or automatic indexing tools. To retrieve relevant documents to a query, keywords describing this document have to be carefully chosen. Archivists have to find out the right topic of a document before starting to extract the keywords. For an archivist indexing specialized documents, experience plays an important role. But indexing documents...
The growing amount of geo-referenced image collections facilitates the enrichment and adaptation of existing geographic thesauri to image retrieval and their exploitation in map-based interfaces (Google Map and Earth, Yahoo! Map, etc.). Most large scale systems for visualizing pictures of geographic entities are weakly structured (unless for commercial entities), with inhomogeneous coverage; they...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.