The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The background of this paper is the issue of how to overview the knowledge of a given query keyword. Especially, we focus on concerns of those who search for Web pages with a given query keyword. The Web search information needs of a given query keyword is collected through search engine suggests. Given a query keyword, we collect up to around 1,000 suggests, while many of them are redundant. We cluster...
Comprehensibility is an important quality aspect of documents. Incomprehensible documents are of little utility to readers even if they are relevant. However, for many difficult queries such as technical ones, the topically relevant documents tend to be characterized by poor comprehensibility. This makes it difficult for users to satisfy their information needs when searching for documents about difficult...
Web entities contain a wealth of information. Customers would more like to get a list of relevant entities instead of a list of web pages when they submit a query to the search engine. So the research of related entity finding (REF) is a meaningful work. In this paper we investigate the last task of REF: Entity Homepage Finding. In this paper, we propose a combining multi-attributes (five attributes)...
We present SAFEWapp, an open-source static analysis framework for JavaScript web applications. It provides a faithful (partial) model of web application execution environments of various browsers, based on empirical data from the main web pages of the 9,465 most popular websites. A main feature of SAFEWapp is the configurability of DOM tree abstraction levels to allow users to adjust a trade-off between...
As a means to share knowledge, the community question answering (CQA) service provides users a chance to obtain or provide help by raising or answering questions. After a question is posted, the system must find an appropriate individual to answer this question. Several approaches have recently been proposed to find experts in CQA. In this paper, a new method to find experts in CQA is proposed by...
Wikipedia, a collaborative and user driven encyclopedia is considered to be the largest content thesaurus on the web, expanding into a massive database housing a huge amount of information. In this paper, we present the design and implementation of a MapReduce-based Wikipedia link analysis system that provides a hierarchical examination of document connectivity in Wikipedia and captures the semantic...
With the fast growth rate of information availability through the World Wide Web, search engines' ranking become limited to deal with such enormous amount of information. Web search engines should be enriched with methodologies that enable it to understand the content of Web pages, then to align pages to the correct query category that highly match its content. In this paper, a proposed system is...
Numerous critical Internet applications with high-quality services, such as Web directory, search engine, Web crawler, recommendation system and user profile detector, etc. Almost depend on the efficient and accurate of web page classification system. Traditional supervised or semi-supervised machine learning methods become more and more difficult to adapt to the explosive Internet information. This...
The interesting, targeted, relevant advertisement is considered as one of the most honest proceeds for personalizing recommendation. Topic identification is the most important technique for the unstructured web pages. Conventional content classification approaches based on bag of words are difficult to process massive web pages. In this paper, Wikipedia Category Network (WCN) nodes are used to identify...
Social book marking services allow users to add bookmarks of web pages with freely chosen keywords as tags. Personalized recommender systems recommend new and useful bookmarks added by other users. We propose a new method to find similar users and to select relevant bookmarks in a social book marking service. Our method is lightweight, because it uses a small set of important tags for each user to...
We present a statistical model for content extraction from HTML documents. The model operates on Document Object Model (DOM) tree of the corresponding HTML document. It evaluates each tree node and associated statistical features to predict significance of the node towards overall content of the document. The model exploits feature set including link densities and text distribution across the nodes...
Nowadays we are facing the daily information overload. It is thus difficult to get exactly the information we need. It often happens that while reading, we find a word we do not understand and we would need an explanation or some additional information about this word. For this purpose annotations in the Web environment are created and attached to such words. In this paper we propose a method for...
This paper presents a novel approach for extracting the main content from Web documents written in languages not based on the Latin alphabet. In practice, the HTML tags are based on the English language and, certainly, the English character set is encoded in the interval [0,127] of the Unicode character set. On the other hand, many languages, such as the Arabic language, use a different interval for...
Crowd sourcing is becoming more and more important for commercial purposes. With the growth of crowd sourcing platforms like MTurk or Micro workers, a huge work force and a large knowledge base can be easily accessed and utilized. But due to the anonymity of the workers, they are encouraged to cheat the employers in order to maximize their income. Thus, this paper presents two crowd-based approaches...
Currently the user's web search is disjoint from the resources which is subsequently browsed. Specifically the related instances of the search are not displayed on the following pages. This lack of continuity between the actual search and the web sites displayed may lead to skimming by the user to identify what is relevant on the pages. This paper presents an approach to the continuous modeling of...
In recent years, there is a rapid advancement in Information and Communication Technology (ICT). However, the explosive growth of ICT and its many applications in education, health, agriculture etc. are confined to a limited number of privileged people who have both language and digital literacy. At present the repositories in Internet are mainly in English, as a consequence users unfamiliar to English...
A majority of web personalization research concentrates on customizing a single website. On the contrary, recommending web pages across websites is the focus of this study. We emphasize that eliciting user interests among different topics within a domain is an important concern in cross-website page recommendations. Enhancing Wikipedia's categorization system through heuristic information extraction,...
Web-scale relation extraction is crucial to building the Web people search engines. Previous extraction models, such as Snowball, focus only on single type extraction, while the real applications always require as many as possible types of relation. In this paper, we propose a novel Web-scale relation extraction framework Multi-Type Snowball (MultiSnowball). MultiSnowball targets at extracting multiple...
The origin of a music artist or a band is an important kind of musical meta-data as it usually influences his/her/its music. In this paper, we propose three approaches to automatically determine the country of origin of a person or institution, which we apply to music artists and bands. The first approach investigates estimates of page counts returned for specific queries to Web search engines. The...
The content and structure of linked information such as sets of web pages or research paper archives are dynamic and keep on changing. Even though different methods are proposed to exploit both the link structure and the content information, no existing approach can effectively deal with this evolution. We propose a novel joint model, called Link-IPLSI, to combine texts and links in a topic modeling...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.