Information extraction for elegislation
Information extraction (IE) is the process of transforming unstructured information of documents into a structured database of structured information. This technology allowed more narrowed-down search results of documents stored in Document Management System (DMS). An IE system was developed to augm...
Saved in:
Main Authors: | , , , |
---|---|
Format: | text |
Language: | English |
Published: |
Animo Repository
2010
|
Subjects: | |
Online Access: | https://animorepository.dlsu.edu.ph/etd_bachelors/11062 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | De La Salle University |
Language: | English |
Summary: | Information extraction (IE) is the process of transforming unstructured information of documents into a structured database of structured information. This technology allowed more narrowed-down search results of documents stored in Document Management System (DMS). An IE system was developed to augment a Blue Ribbon Committee (BRC) DMS for the eParticipation Project. IE architectures were studied and related tools were identified to develop the IE system specifically for the BRC. The IE System is composed of 7 minor modules namely Sentence Splitter, Tokenizer, Cross Reference, Part of Speech Tagger, Unknown Word, Named Entity Recognition and Preparser, 3 major modules which are Semantic Tagger, CoReference Resolution and Preparser, 3 major modules which are Semantic Tagger, CoReference Resolution and Template Filler, and 2 external modules which are Search and Evaluation modules. With the help and constant communication with the Blue Ribbon Committee, the research was able to gather documents that helped in creating the system. Also, the output is already created and extracted based on the preference of the client and that the output system is already meeting the standards requested by the Blue Ribbon Committee. Overall, the system showed favorable results in the actual testing phase which had an output of 95.42%, but when the initial format of the documents were followed, the result of the system would be 100% accurate. Upon presenting the system to the main stakeholders, they remarked that what they had seen was already beyond their expectations and they were very pleased about the outcome. There are still parts of the system which could be improved on, such as train the values of the POS Tagger and the Named Entity Recognition from the documents being fed, update the library used to open word document files, add documents and templates to the system's process, add image recognition to the system, update web crawler for more sources and improve the search ranking algorithm. |
---|