Product name recognition and normalization in internet forums

Collecting user feedback of products is a common practice for the product providers to better understand consumers' concerns or requirements and to further improve their products or marketing strategies. Even though dedicated review sites (e.g., Epinions, Amazon, CNET reviews) supply the relati...

Full description

Saved in:
Bibliographic Details
Main Author: Yao, Yangjie
Other Authors: Sun Aixin
Format: Theses and Dissertations
Language:English
Published: 2014
Subjects:
Online Access:https://hdl.handle.net/10356/61814
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-61814
record_format dspace
spelling sg-ntu-dr.10356-618142023-03-04T00:46:06Z Product name recognition and normalization in internet forums Yao, Yangjie Sun Aixin School of Computer Engineering DRNTU::Engineering::Computer science and engineering Collecting user feedback of products is a common practice for the product providers to better understand consumers' concerns or requirements and to further improve their products or marketing strategies. Even though dedicated review sites (e.g., Epinions, Amazon, CNET reviews) supply the relatively straightforward approach as user feedback about one specific product is usually well organized in a list, collecting user feedback from Internet forums is challenging. One reason is that user feedback about a product often spreads in different discussion threads in forums. More importantly, users often mention product names with a large number of name variations. On the other hand, Internet forums cover feedback from many more users. Thus, user feedback in more comprehensive aspects can be obtained. We propose a method named Gren to recognize and normalize mobile phone names from Internet forums. Instead of directly recognizing phone names from sentences as in most named entity recognition tasks, we propose an approach to generating candidate names as the first step. The candidate names capture short forms, spelling variations, and nicknames of products, but are not noise free. To predict whether a candidate name mention in a sentence indeed refers to a specific phone model, a CRF based name recognizer is developed. The CRF (Conditional Random Field) model is trained by using a large set of sentences obtained in a semiautomatic manner with minimal manual labeling effort. Lastly, a rule-based name normalization component maps a recognized name to its formal form. For evaluation, we randomly select 20 threads related to 20 mobile phones from an Internet forum. Each thread contains about 100 post messages. We manually labeled the mobile phone name mentions in these posts and mapped the true mentions to their formal names. In total, about 4000 sentences have been manually labeled which contain about 1000 phone name mentions. Evaluated on labeled data, Gren outperforms all baseline methods. Specifically, it achieves precision and recall of 0.918 and 0.875 respectively, with the best feature setting. Comparing to Stanford NER which is considered as a strong baseline, 134% improvement on recall is observed. We also provide detailed analysis of the intermediate results obtained by each of the three components in Gren and observe that features from Blown clustering are the most effective features. Removing them results in the largest degradation in F1 from 0.896 to 0.804. Two implications for NER tasks are further made based on our observation. First, if candidate named entities are able to be pre generated, a large number of training examples may be generated at very low cost for manual annotation. Second, if we can segment the sentences and pre-generate the text chunks, we are able to rewrite the sentences. The rewriting enables us to take surrounding words of a candidate named entity to be its context in a more natural manner. MASTER OF ENGINEERING (SCE) 2014-10-27T06:37:29Z 2014-10-27T06:37:29Z 2014 2014 Thesis Yao, Y. (2014). Product name recognition and normalization in internet forums. Master’s thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/61814 10.32657/10356/61814 en 69 p. application/pdf
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic DRNTU::Engineering::Computer science and engineering
spellingShingle DRNTU::Engineering::Computer science and engineering
Yao, Yangjie
Product name recognition and normalization in internet forums
description Collecting user feedback of products is a common practice for the product providers to better understand consumers' concerns or requirements and to further improve their products or marketing strategies. Even though dedicated review sites (e.g., Epinions, Amazon, CNET reviews) supply the relatively straightforward approach as user feedback about one specific product is usually well organized in a list, collecting user feedback from Internet forums is challenging. One reason is that user feedback about a product often spreads in different discussion threads in forums. More importantly, users often mention product names with a large number of name variations. On the other hand, Internet forums cover feedback from many more users. Thus, user feedback in more comprehensive aspects can be obtained. We propose a method named Gren to recognize and normalize mobile phone names from Internet forums. Instead of directly recognizing phone names from sentences as in most named entity recognition tasks, we propose an approach to generating candidate names as the first step. The candidate names capture short forms, spelling variations, and nicknames of products, but are not noise free. To predict whether a candidate name mention in a sentence indeed refers to a specific phone model, a CRF based name recognizer is developed. The CRF (Conditional Random Field) model is trained by using a large set of sentences obtained in a semiautomatic manner with minimal manual labeling effort. Lastly, a rule-based name normalization component maps a recognized name to its formal form. For evaluation, we randomly select 20 threads related to 20 mobile phones from an Internet forum. Each thread contains about 100 post messages. We manually labeled the mobile phone name mentions in these posts and mapped the true mentions to their formal names. In total, about 4000 sentences have been manually labeled which contain about 1000 phone name mentions. Evaluated on labeled data, Gren outperforms all baseline methods. Specifically, it achieves precision and recall of 0.918 and 0.875 respectively, with the best feature setting. Comparing to Stanford NER which is considered as a strong baseline, 134% improvement on recall is observed. We also provide detailed analysis of the intermediate results obtained by each of the three components in Gren and observe that features from Blown clustering are the most effective features. Removing them results in the largest degradation in F1 from 0.896 to 0.804. Two implications for NER tasks are further made based on our observation. First, if candidate named entities are able to be pre generated, a large number of training examples may be generated at very low cost for manual annotation. Second, if we can segment the sentences and pre-generate the text chunks, we are able to rewrite the sentences. The rewriting enables us to take surrounding words of a candidate named entity to be its context in a more natural manner.
author2 Sun Aixin
author_facet Sun Aixin
Yao, Yangjie
format Theses and Dissertations
author Yao, Yangjie
author_sort Yao, Yangjie
title Product name recognition and normalization in internet forums
title_short Product name recognition and normalization in internet forums
title_full Product name recognition and normalization in internet forums
title_fullStr Product name recognition and normalization in internet forums
title_full_unstemmed Product name recognition and normalization in internet forums
title_sort product name recognition and normalization in internet forums
publishDate 2014
url https://hdl.handle.net/10356/61814
_version_ 1759853761629519872