Query-based text extraction algorithm for web pages.
The objective of this research is to develop a query-based text extraction algorithm to generate an abstract from a Web document automatically. The algorithm was derived after a study of a sample of 60 sample Web pages. These Web pages were chosen from 5 different subject areas and retrieved using t...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Theses and Dissertations |
Published: |
2008
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/1632 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Summary: | The objective of this research is to develop a query-based text extraction algorithm to generate an abstract from a Web document automatically. The algorithm was derived after a study of a sample of 60 sample Web pages. These Web pages were chosen from 5 different subject areas and retrieved using the AltaVista Search Engine. The development of this algorithm was based on sentence weight (through simple calculation), cue words, location of the sentence and the application of canned abstracts. To test out the new algorithm, a total of 50 Web pages (from 10 different subject areas) were retrieved from the Internet through AltaVista Search Engine. The abstracts of these Web pages were then generated by hand by simulating the new algorithm. |
---|