Graph of words for document classification
This FYP project is about the implementations and experimental studies of a novel framework for large data classifications of textual documents. Under this new framework, documents are first transferred from sentences into graph-of-words, so the original classification problem is then considered as...
Saved in:
Main Author: | |
---|---|
Other Authors: | |
Format: | Final Year Project |
Language: | English |
Published: |
2018
|
Subjects: | |
Online Access: | http://hdl.handle.net/10356/75040 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Institution: | Nanyang Technological University |
Language: | English |
Summary: | This FYP project is about the implementations and experimental studies of a novel framework for large data classifications of textual documents. Under this new framework, documents are first transferred from sentences into graph-of-words, so the original classification problem is then considered as graph classification and advanced representation learning (RL) model subgraph2vec can be applied. However, as shared by many other RL based methods, poor efficiency problem is serious because in general NLP dataset has a huge vocabulary. Thus, this project proposes hash embeddings version of subgraph2vec to significantly reduce required memory for training phase, make system become efficient without harming the quality of resultant representations. The approach is evaluated in terms of time, required memory, accuracy and f1 score with benchmark datasets on 3 domains (the first 2 are graph classification task and the last task is document classification). Through experiments, proposed approach outperforms other RL based methods and achieves comparable results with state-of-the-art method. Finally, the FYP project introduces semi supervised version of the method and observes the significant increases in sentimental analysis task. |
---|