Time expression and named entity analysis and recognition

This dissertation presents our analysis of intrinsic characteristics of time expressions and named entities, and our use of these characteristics to design algorithms to recognize time expressions and named entities from unstructured text. Regarding time expressions, we analyze four diverse dataset...

Full description

Saved in:
Bibliographic Details
Main Author: Zhong, Xiaoshi
Other Authors: Erik Cambria
Format: Thesis-Doctor of Philosophy
Language:English
Published: Nanyang Technological University 2020
Subjects:
Online Access:https://hdl.handle.net/10356/142924
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
id sg-ntu-dr.10356-142924
record_format dspace
institution Nanyang Technological University
building NTU Library
continent Asia
country Singapore
Singapore
content_provider NTU Library
collection DR-NTU
language English
topic Engineering::Computer science and engineering
spellingShingle Engineering::Computer science and engineering
Zhong, Xiaoshi
Time expression and named entity analysis and recognition
description This dissertation presents our analysis of intrinsic characteristics of time expressions and named entities, and our use of these characteristics to design algorithms to recognize time expressions and named entities from unstructured text. Regarding time expressions, we analyze four diverse datasets and find five common characteristics about them. Firstly, most time expressions are very short. Secondly, most time expressions contain at least one time-related word that can distinguish time expressions from common text. Thirdly, only a small group of words are used to express time information. Fourthly, words in time expressions demonstrate similar syntactic behaviour. Finally, time expressions are formed by loose structure. According to these five characteristics, we propose two methods to model time expressions. The first method is a type-based method termed SynTime. SynTime defines three main syntactic token types, namely time token, modifier, and numeral, to group time-related token regular expressions, and designs a small set of general heuristic rules to recognize time expressions. These heuristic rules are only relevant to token types and are independent of specific tokens, therefore, SynTime is independent of specific domains and specific text types that consist of specific tokens. Our second method is a learning-based method termed TOMN. TOMN defines a constituent-based tagging scheme with four tags, namely T, M, N, and O, indicating four types of constituent words of time expressions. In modeling, TOMN assigns a word with a TOMN tag under conditional random fields (CRFs) with minimal features. Essentially, our TOMN scheme overcomes the problem of inconsistent tag assignment that is caused by the conventional position-based tagging schemes (e.g., the BIO and BILOU schemes). Experimental results on three datasets demonstrate the efficiency, effectiveness, and robustness of SynTime and TOMN against four state-of-the-art methods. Regarding named entities, we analyze two benchmark datasets and find three common characteristics about them. Firstly, most named entities contain uncommon words, which mainly appear in named entities and hardly appear in common text. Secondly, named entities are mainly made up of proper nouns. Thirdly, named entities are formed by loose structure. These three characteristics motivate us to design a CRFs-based learning method termed UGTO to model named entities. Like TOMN, UGTO defines another constituent-based tagging scheme with four tags, namely U, G, T, and O, indicating four types of constituent words of named entities, namely uncommon words, generic modifiers, trigger words, and those words outside named entities. In modeling, our UGTO scheme models named entities under a CRF framework with minimal features. Experiments on two benchmark diverse datasets show that UGTO performs more effectively than two representative state-of-the-art methods. When analyzing time expressions and named entities, we discover that their length, in terms of the number of words, follows a family of power-law distributions. Furthermore, we find that these power-law distributions widely appear in the length-frequency of entities in seventeen languages (e.g., Chinese, English, and German) and different types of entities (e.g., named entities, time expressions, and aspect terms). We explain this linguistic phenomenon by the principle of least effort in communication and the preference for short entities, and justify our explanation by a stochastic process, in which the probabilities are derived from real-word datasets, that reproduces power-law distributions in the length-frequency of generated entities.
author2 Erik Cambria
author_facet Erik Cambria
Zhong, Xiaoshi
format Thesis-Doctor of Philosophy
author Zhong, Xiaoshi
author_sort Zhong, Xiaoshi
title Time expression and named entity analysis and recognition
title_short Time expression and named entity analysis and recognition
title_full Time expression and named entity analysis and recognition
title_fullStr Time expression and named entity analysis and recognition
title_full_unstemmed Time expression and named entity analysis and recognition
title_sort time expression and named entity analysis and recognition
publisher Nanyang Technological University
publishDate 2020
url https://hdl.handle.net/10356/142924
_version_ 1683493251138453504
spelling sg-ntu-dr.10356-1429242020-10-28T08:40:54Z Time expression and named entity analysis and recognition Zhong, Xiaoshi Erik Cambria Jagath C Rajapakse School of Computer Science and Engineering Centre for Computational Intelligence cambria@ntu.edu.sg, ASJagath@ntu.edu.sg Engineering::Computer science and engineering This dissertation presents our analysis of intrinsic characteristics of time expressions and named entities, and our use of these characteristics to design algorithms to recognize time expressions and named entities from unstructured text. Regarding time expressions, we analyze four diverse datasets and find five common characteristics about them. Firstly, most time expressions are very short. Secondly, most time expressions contain at least one time-related word that can distinguish time expressions from common text. Thirdly, only a small group of words are used to express time information. Fourthly, words in time expressions demonstrate similar syntactic behaviour. Finally, time expressions are formed by loose structure. According to these five characteristics, we propose two methods to model time expressions. The first method is a type-based method termed SynTime. SynTime defines three main syntactic token types, namely time token, modifier, and numeral, to group time-related token regular expressions, and designs a small set of general heuristic rules to recognize time expressions. These heuristic rules are only relevant to token types and are independent of specific tokens, therefore, SynTime is independent of specific domains and specific text types that consist of specific tokens. Our second method is a learning-based method termed TOMN. TOMN defines a constituent-based tagging scheme with four tags, namely T, M, N, and O, indicating four types of constituent words of time expressions. In modeling, TOMN assigns a word with a TOMN tag under conditional random fields (CRFs) with minimal features. Essentially, our TOMN scheme overcomes the problem of inconsistent tag assignment that is caused by the conventional position-based tagging schemes (e.g., the BIO and BILOU schemes). Experimental results on three datasets demonstrate the efficiency, effectiveness, and robustness of SynTime and TOMN against four state-of-the-art methods. Regarding named entities, we analyze two benchmark datasets and find three common characteristics about them. Firstly, most named entities contain uncommon words, which mainly appear in named entities and hardly appear in common text. Secondly, named entities are mainly made up of proper nouns. Thirdly, named entities are formed by loose structure. These three characteristics motivate us to design a CRFs-based learning method termed UGTO to model named entities. Like TOMN, UGTO defines another constituent-based tagging scheme with four tags, namely U, G, T, and O, indicating four types of constituent words of named entities, namely uncommon words, generic modifiers, trigger words, and those words outside named entities. In modeling, our UGTO scheme models named entities under a CRF framework with minimal features. Experiments on two benchmark diverse datasets show that UGTO performs more effectively than two representative state-of-the-art methods. When analyzing time expressions and named entities, we discover that their length, in terms of the number of words, follows a family of power-law distributions. Furthermore, we find that these power-law distributions widely appear in the length-frequency of entities in seventeen languages (e.g., Chinese, English, and German) and different types of entities (e.g., named entities, time expressions, and aspect terms). We explain this linguistic phenomenon by the principle of least effort in communication and the preference for short entities, and justify our explanation by a stochastic process, in which the probabilities are derived from real-word datasets, that reproduces power-law distributions in the length-frequency of generated entities. Doctor of Philosophy 2020-07-13T02:58:49Z 2020-07-13T02:58:49Z 2020 Thesis-Doctor of Philosophy Zhong, X. (2020). Time expression and named entity analysis and recognition. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/142924 10.32657/10356/142924 en This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). application/pdf Nanyang Technological University