Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset

Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon...

Full description

Saved in:

Bibliographic Details
Main Authors:	Baseer, F., Jaafar, J., Aziz, I.B.A., Habib, A.
Format:	Conference or Workshop Item
Published:	Institute of Electrical and Electronics Engineers Inc. 2020
Online Access:	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85097536620&doi=10.1109%2fICCI51257.2020.9247814&partnerID=40&md5=1b1f615b9f333e079497762ef059e259 http://eprints.utp.edu.my/29859/
Tags:	Add Tag No Tags, Be the first to tag this record!
Institution:	Universiti Teknologi Petronas

id	my.utp.eprints.29859
record_format	eprints
spelling	my.utp.eprints.298592022-03-25T02:58:20Z Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset Baseer, F. Jaafar, J. Aziz, I.B.A. Habib, A. Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development. Â© 2020 IEEE. Institute of Electrical and Electronics Engineers Inc. 2020 Conference or Workshop Item NonPeerReviewed https://www.scopus.com/inward/record.uri?eid=2-s2.0-85097536620&doi=10.1109%2fICCI51257.2020.9247814&partnerID=40&md5=1b1f615b9f333e079497762ef059e259 Baseer, F. and Jaafar, J. and Aziz, I.B.A. and Habib, A. (2020) Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset. In: UNSPECIFIED. http://eprints.utp.edu.my/29859/
institution	Universiti Teknologi Petronas
building	UTP Resource Centre
collection	Institutional Repository
continent	Asia
country	Malaysia
content_provider	Universiti Teknologi Petronas
content_source	UTP Institutional Repository
url_provider	http://eprints.utp.edu.my/
description	Urdu is among the most widely used languages in the world for verbal and written communication. Due to lack of optimized and user friendly native Urdu-script support on various platforms, it is mostly written in Romanized script in soft form. In our research, we have developed a refined Urdu lexicon using tokens with the highest frequency of occurrence in the data set. This data set is basically a raw corpus of colloquial Urdu written in Romanized script. The corpus was collected from volunteer participants who used this language as a mode of communication on the Internet and text massaging. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. Edit Distance and K-means Clustering techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development. Â© 2020 IEEE.
format	Conference or Workshop Item
author	Baseer, F. Jaafar, J. Aziz, I.B.A. Habib, A.
spellingShingle	Baseer, F. Jaafar, J. Aziz, I.B.A. Habib, A. Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
author_facet	Baseer, F. Jaafar, J. Aziz, I.B.A. Habib, A.
author_sort	Baseer, F.
title	Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
title_short	Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
title_full	Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
title_fullStr	Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
title_full_unstemmed	Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset
title_sort	refined urdu lexicon development k-means clustering based computational model using colloquial romanized urdu dataset
publisher	Institute of Electrical and Electronics Engineers Inc.
publishDate	2020
url	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85097536620&doi=10.1109%2fICCI51257.2020.9247814&partnerID=40&md5=1b1f615b9f333e079497762ef059e259 http://eprints.utp.edu.my/29859/
_version_	1738657025565392896

Refined Urdu Lexicon Development K-Means Clustering Based Computational Model Using Colloquial Romanized Urdu Dataset

Similar Items