The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper addresses the issue of web information extraction to support automatic teacher information management. We propose an effective approach based on block segmentation. First, the teacher introduction web pages are divided into independent blocks, where html tags and punctuation marks are used as segmentation criterion. Then CRF model is employed to label the text. We apply this approach on...
Primary Question detection in online forum is a subtask of extracting question-answer pairs. In this paper, by surveying the forms of questions in Chinese online forums, a combination of textual and N-gram features achieved via feature selection is adopted to help detecting primary questions. By viewing primary question detection a binary classification problem, decision tree classifier C4.5 and support...
Topic model is an increasing useful tool to analyze the semantic level meanings and capture the topical features. However, there is few research about the comparative study of the topic models. In this paper, we describe our comparative study of three topic models in the extrinsic application of topic clustering. The topic model distance is defined on the converged parameters of topic models, which...
Topic-oriented search engine (topic-search) is a new IR service which provides compounded types of information with certain user queried topic in one page. It firstly categorizes user query into a certain domain, and then organizes several types of information based on the query keywords into a magazine-style topic page for user. In this paper, we propose a Chinese topic-oriented search engine service,...
Annual reports of Chinese securities companies have become the most significant and reliable source of information for domestic and foreign investors. Semantic annotation of them enhanced information retrieval and improved interoperability. In this paper we first review the major features of annual reports which are tagged PDF format, then propose a novel ontology-based NLP approach to semantic annotate...
Temporal information is an important characteristic of event. It can be used in information retrieval process to organize the returned result. In Chinese, the presentations of time expression are very complex, which make it difficult to both accurately recognize a time expression and precisely connecting it with a given event in a Web page that contains multiple events. To address these problems,...
Document genre information is one of the most distinguishing features in information retrieval, which brings order to the search results. What the genre classification concerned is not the topic but the genre of document. In this paper, two different feature sets were employed: bag of words which are derived by feature selection method and structural features which are selected manually and subjectively...
Extracting question-answer pairs from online forums is a meaningful work due to the huge amount of valuable user generated resource contained in forums. In this paper we consider the problem of extracting Chinese question-answer pairs for the first time. We present a strategy to detect Chinese questions and their answers. We propose a sequential rule based method to find questions in a forum thread,...
Conditional random fields (CRFs) have been used for many sequence labeling tasks and got excellent results. Further, the supervised model strongly depends on the huge training data. Active learning is a different way rather than relying on a large amount random sampling. However, random sampling constructively participates in the optimal choosing training examples. Based on different query strategies,...
Web page content extraction can be achieved by node-based and segmentation-based algorithms respectively on top of the document object model (DOM). However, the node-based algorithm often removes content embedded as anchor text; while the segmentation-based way can not distinguish irrelevant text from content text when they are divided into the same segment. The two kinds of algorithms don't keep...
Text clustering techniques were usually used to structure the text documents into topic related groups which can facilitate users to get a comprehensive understanding on corpus or results from information retrieval system. Most of existing text clustering algorithm which derived from traditional formatted data clustering heavily rely on term analysis methods and adopted vector space model (VSM) as...
The relationship of similarity may be the most universal relationship that exists between every two objects in either the material world or the mental world. Although similarity modeling has been the focus of cognitive science for decades, many theoretical and realistic issues are still under controversy. In this paper, a new theoretical framework that conforms to the nature of similarity and incorporates...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.