Detecting novel and interested topics from open sources based on deep neural network and natural language processing techniques

One of the factors threatening the security of coastal countries is piracy. With the Cov-19 pandemic, piracy incidents have also become more frequent than usual, making it a challenge to the safety of residents and social stability. At the same time, published news reports on open resources for p...

Full description

Saved in:
Bibliographic Details
Main Author: Ma, Shuting
Other Authors: Mao Kezhi
Format: Thesis-Master by Coursework
Language:English
Published: Nanyang Technological University 2022
Subjects:
Online Access:https://hdl.handle.net/10356/157271
Tags: Add Tag
No Tags, Be the first to tag this record!
Institution: Nanyang Technological University
Language: English
Description
Summary:One of the factors threatening the security of coastal countries is piracy. With the Cov-19 pandemic, piracy incidents have also become more frequent than usual, making it a challenge to the safety of residents and social stability. At the same time, published news reports on open resources for piracy incidents are truly treasure for piracy research. With the maturity of artificial intelligence technology and the continuous development of Natural Language Processing, how to reasonably use these open resource text materials for analysis has become an important research direction. This project first introduces the possible applications of NLP to pirate news materials. The relevant piracy news materials were collected from the open resources, marked and cleaned to form a new dataset related to this topic. Four mainstream text classification models, textCNN, Bi-LSTM, Transformer, and Bert, theoretical introductions and practical tests are carried out, and Bert is finally selected as the base model. To address the imbalanced data classification problem, this project proposes and explores a variety of methods combined with deep learning and machine learning. On the one hand, data resampling has been achieved to improve the balance of the dataset. On the other hand, with Bert has been chosen to do classification, Costive-SVM is constructed in a fully connected layer with Triplet Loss to separate the labels of positive and negative samples. After fine-tuning, the performance of the model has been improved, where the over-fitting problem in the optimization process is solved as well. Finally, the F1 score improved from 0.46 to 0.87.