Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods

Named entity disambiguation is one of the most challenging tasks in natural language processing. In many Thai news categories, referential ambiguity is often found, i.e., in addition to its formal names, an entity is often referred to by other names, called name aliases. Name co-occurrence informati...

Full description

Saved in:
Bibliographic Details
Main Authors: Thawatchai Suwanapong, Thanaruk Theeramunkong, Ekawit Nantajeewarawat
Format: บทความวารสาร
Language:English
Published: Science Faculty of Chiang Mai University 2019
Online Access:http://it.science.cmu.ac.th/ejournal/dl.php?journal_id=8506
http://cmuir.cmu.ac.th/jspui/handle/6653943832/63997
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Chiang Mai University
Language: English
id th-cmuir.6653943832-63997
record_format dspace
spelling th-cmuir.6653943832-639972019-05-07T09:59:42Z Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods Thawatchai Suwanapong Thanaruk Theeramunkong Ekawit Nantajeewarawat Named entity disambiguation is one of the most challenging tasks in natural language processing. In many Thai news categories, referential ambiguity is often found, i.e., in addition to its formal names, an entity is often referred to by other names, called name aliases. Name co-occurrence information is very useful for name-alias relationship identification, and it is usually represented by a co-occurrence matrix in the vector space model. Traditionally, a co-occurrence matrix is constructed by multiplying a weighted name-by-document matrix, possibly normalized, and its transpose. This paper proposes an alternative co-occurrence matrix construction method using association measures. The effects of association measures are investigated by comparing their use with the traditional co-occurrence matrix construction method. Various complementary factors are considered in the comparison, e.g., weighting schemes, a normalization process, and linkage functions for hierarchical clustering. Two collections of Thai news articles, 1,000 articles in the domain of football and 1,000 articles in the domain of politics, are used in experiments. The experimental results show that co-occurrence matrix construction using association measures yields the highest performance in both news domains. 2019-05-07T09:59:42Z 2019-05-07T09:59:42Z 2017 บทความวารสาร 0125-2526 http://it.science.cmu.ac.th/ejournal/dl.php?journal_id=8506 http://cmuir.cmu.ac.th/jspui/handle/6653943832/63997 Eng Science Faculty of Chiang Mai University
institution Chiang Mai University
building Chiang Mai University Library
country Thailand
collection CMU Intellectual Repository
language English
description Named entity disambiguation is one of the most challenging tasks in natural language processing. In many Thai news categories, referential ambiguity is often found, i.e., in addition to its formal names, an entity is often referred to by other names, called name aliases. Name co-occurrence information is very useful for name-alias relationship identification, and it is usually represented by a co-occurrence matrix in the vector space model. Traditionally, a co-occurrence matrix is constructed by multiplying a weighted name-by-document matrix, possibly normalized, and its transpose. This paper proposes an alternative co-occurrence matrix construction method using association measures. The effects of association measures are investigated by comparing their use with the traditional co-occurrence matrix construction method. Various complementary factors are considered in the comparison, e.g., weighting schemes, a normalization process, and linkage functions for hierarchical clustering. Two collections of Thai news articles, 1,000 articles in the domain of football and 1,000 articles in the domain of politics, are used in experiments. The experimental results show that co-occurrence matrix construction using association measures yields the highest performance in both news domains.
format บทความวารสาร
author Thawatchai Suwanapong
Thanaruk Theeramunkong
Ekawit Nantajeewarawat
spellingShingle Thawatchai Suwanapong
Thanaruk Theeramunkong
Ekawit Nantajeewarawat
Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods
author_facet Thawatchai Suwanapong
Thanaruk Theeramunkong
Ekawit Nantajeewarawat
author_sort Thawatchai Suwanapong
title Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods
title_short Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods
title_full Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods
title_fullStr Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods
title_full_unstemmed Name-alias Relationship Identification in Thai News Articles: A Comparison of Co-occurrence Matrix Construction Methods
title_sort name-alias relationship identification in thai news articles: a comparison of co-occurrence matrix construction methods
publisher Science Faculty of Chiang Mai University
publishDate 2019
url http://it.science.cmu.ac.th/ejournal/dl.php?journal_id=8506
http://cmuir.cmu.ac.th/jspui/handle/6653943832/63997
_version_ 1681425999409971200