Extracting Table Information from the Web

Yeon-Seok Kim; Kyong-Ho Lee

doi:10.1007/978-3-540-28640-0_41

Extracting Table Information from the Web

Source

Lecture Notes in Computer Science > Document Analysis Systems VI > Internet Documents > 438-441

Abstract

With the ubiquity of the Web, the volume of Web documents continues to grow at a rapid speed. Since the Web is a vast source of information, extracting useful information from Web documents is important.

HTML (Hypertext Markup Language), which is a format for visual rendering of Web documents, defines tag

for representation of a table. On the other hand, most of the existing HTML documents use tags to present a formatting layout of a document. As a prerequisite for information extraction from the Web, it is required to determine whether tags are used to present genuine tables or not.

Generally, a table is a facility for presenting relational information structurally and concisely. This paper defines a table as an array of relational data. Specifically, we regard a table that relates an attribute and its value, as a genuine table as reported in previous works. In this paper, set of attribute cells and set of value cells are defined as an attribute area and a value area, respectively.

Most previous works concerning table identification in HTML documents are based on a specific domain or take a lot of training data and time. This paper presents an efficient method for identifying tables in HTML documents prior to extracting information from the Web.

Identifiers

series ISSN :	0302-9743
series e-ISSN :	1611-3349
book ISBN :	978-3-540-23060-1
book e-ISBN :	978-3-540-28640-0
DOI	10.1007/978-3-540-28640-0_41

Authors

Yeon-Seok Kim

Yonsei University, Dept. Computer Science, Seoul, Korea

Kyong-Ho Lee

Yonsei University, Dept. Computer Science, Seoul, Korea

Additional information

Data set: Springer

Publisher

Springer Berlin Heidelberg

chapter

Read online
Download
Add to read later
Add to collection
Add to followed
Share

Export to bibliography


Assign to other user
	×
Wrong email address

INFONA - science communication portal

Extracting Table Information from the Web $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Yeon-Seok Kim

Kyong-Ho Lee

Additional information

Publisher

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

Extracting Table Information from the Web