Fuzzy bag-of-words model for document representation

One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhao, Rui, Mao, Kezhi
Other Authors: School of Electrical and Electronic Engineering
Format: Article
Language:English
Published: 2020
Subjects:
Online Access:https://hdl.handle.net/10356/142434
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-142434
record_format dspace
spelling sg-ntu-dr.10356-1424342020-06-22T04:39:47Z Fuzzy bag-of-words model for document representation Zhao, Rui Mao, Kezhi School of Electrical and Electronic Engineering Engineering::Electrical and electronic engineering Document Classification Document Representation One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term in the document. To count the number of occurrence of a basis term, BoW conducts exact word matching, which can be regarded as a hard mapping from words to the basis term. BoW representation suffers from its intrinsic extreme sparsity, high dimensionality, and inability to capture high-level semantic meanings behind text data. To address the aforementioned issues, we propose a new document representation method named fuzzy Bag-of-Words (FBoW) in this paper. FBoW adopts a fuzzy mapping based on semantic correlation among words quantified by cosine similarity measures between word embeddings. Since word semantic matching instead of exact word string matching is used, the FBoW could encode more semantics into the numerical representation. In addition, we propose to use word clusters instead of individual words as basis terms and develop fuzzy Bag-of-WordClusters (FBoWC) models. Three variants under the framework of FBoWC are proposed based on three different similarity measures between word clusters and words, which are named as FBoWC mean FBoWC max, and FBoWC min, respectively. Document representations learned by the proposed FBoW and FBoWC are dense and able to encode high-level semantics. The task of document categorization is used to evaluate the performance of learned representation by the proposed FBoW and FBoWC methods. The results on seven real-word document classification datasets in comparison with six document representation learning methods have shown that our methods FBoW and FBoWC achieve the highest classification accuracies. 2020-06-22T04:39:46Z 2020-06-22T04:39:46Z 2017 Journal Article Zhao, R., & Mao, K. (2018). Fuzzy bag-of-words model for document representation. IEEE Transactions on Fuzzy Systems, 26(2), 794-804. doi:10.1109/TFUZZ.2017.2690222 1063-6706 https://hdl.handle.net/10356/142434 10.1109/TFUZZ.2017.2690222 2-s2.0-85045022736 2 26 794 804 en IEEE Transactions on Fuzzy Systems © 2017 IEEE. All rights reserved.
institution Nanyang Technological University
building NTU Library
country Singapore
collection DR-NTU
language English
topic Engineering::Electrical and electronic engineering
Document Classification
Document Representation
spellingShingle Engineering::Electrical and electronic engineering
Document Classification
Document Representation
Zhao, Rui
Mao, Kezhi
Fuzzy bag-of-words model for document representation
description One key issue in text mining and natural language processing is how to effectively represent documents using numerical vectors. One classical model is the Bag-of-Words (BoW). In a BoW-based vector representation of a document, each element denotes the normalized number of occurrence of a basis term in the document. To count the number of occurrence of a basis term, BoW conducts exact word matching, which can be regarded as a hard mapping from words to the basis term. BoW representation suffers from its intrinsic extreme sparsity, high dimensionality, and inability to capture high-level semantic meanings behind text data. To address the aforementioned issues, we propose a new document representation method named fuzzy Bag-of-Words (FBoW) in this paper. FBoW adopts a fuzzy mapping based on semantic correlation among words quantified by cosine similarity measures between word embeddings. Since word semantic matching instead of exact word string matching is used, the FBoW could encode more semantics into the numerical representation. In addition, we propose to use word clusters instead of individual words as basis terms and develop fuzzy Bag-of-WordClusters (FBoWC) models. Three variants under the framework of FBoWC are proposed based on three different similarity measures between word clusters and words, which are named as FBoWC mean FBoWC max, and FBoWC min, respectively. Document representations learned by the proposed FBoW and FBoWC are dense and able to encode high-level semantics. The task of document categorization is used to evaluate the performance of learned representation by the proposed FBoW and FBoWC methods. The results on seven real-word document classification datasets in comparison with six document representation learning methods have shown that our methods FBoW and FBoWC achieve the highest classification accuracies.
author2 School of Electrical and Electronic Engineering
author_facet School of Electrical and Electronic Engineering
Zhao, Rui
Mao, Kezhi
format Article
author Zhao, Rui
Mao, Kezhi
author_sort Zhao, Rui
title Fuzzy bag-of-words model for document representation
title_short Fuzzy bag-of-words model for document representation
title_full Fuzzy bag-of-words model for document representation
title_fullStr Fuzzy bag-of-words model for document representation
title_full_unstemmed Fuzzy bag-of-words model for document representation
title_sort fuzzy bag-of-words model for document representation
publishDate 2020
url https://hdl.handle.net/10356/142434
_version_ 1681058740515635200