Romanized urdu Corpus development (RUCD) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset

Faisal Baseer; Asad Habib; Jawad Ashraf

doi:10.1109/INTECH.2016.7845117

Romanized urdu Corpus development (RUCD) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset

Baseer, Faisal, Habib, Asad, Ashraf, Jawad

Source

2016 Sixth International Conference on Innovative Computing Technology (INTECH) > 513 - 518

Abstract

Urdu ranks very high among languages used for communication in the Sourthrn Asia. Even though with great following, it clearly lack computational support that is why it is written in Romanized Urdu script. Even though, a lot of Romanized Urdu data is available online but it still lacks a refined Corpus. In our research, we have proposed a refined Romanized urdu Corpus using tokens with the highest frequency of occurrence in the data set, which was collected from volunteer participants who used this language as a mode of communication interactively. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. "Edit Distance" and "K-means Clustering" techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development.

Identifiers

book e-ISBN :	978-1-5090-2000-3
DOI	10.1109/INTECH.2016.7845117

Authors

Baseer, Faisal

Institute of Information Technology (IIT), Kohat University of Science and Technology (KUST) Kohat, Pakistan

Habib, Asad

Institute of Information Technology (IIT), Kohat University of Science and Technology (KUST) Kohat, Pakistan

Ashraf, Jawad

Institute of Information Technology (IIT), Kohat University of Science and Technology (KUST) Kohat, Pakistan

Keywords

Sun Internet Electronic mail Data mining Feeds Postal services Market research Natural Language Engineering Urdu Corpus Development Colloquial Urdu Corpus Romanized Urdu Corpus Computational Lexeme Extraction

Additional information

Data set: ieee

Publisher

IEEE

chapter

Read online
Download
Add to read later
Add to collection
Add to followed
Share

Export to bibliography


Assign to other user
	×
Wrong email address

INFONA - science communication portal

Romanized urdu Corpus development (RUCD) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Baseer, Faisal

Habib, Asad

Ashraf, Jawad

Keywords

Additional information

Publisher

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

Romanized urdu Corpus development (RUCD) model: Edit-distance based most frequent unique unigram extraction approach using real-time interactive dataset