Urdu ranks very high among languages used for communication in the Sourthrn Asia. Even though with great following, it clearly lack computational support that is why it is written in Romanized Urdu script. Even though, a lot of Romanized Urdu data is available online but it still lacks a refined Corpus. In our research, we have proposed a refined Romanized urdu Corpus using tokens with the highest frequency of occurrence in the data set, which was collected from volunteer participants who used this language as a mode of communication interactively. The raw corpus is passed through a series of steps such as Prepossessing, Tokenization and Annotation before passing it to computationally extensive subsequent steps. "Edit Distance" and "K-means Clustering" techniques are used for identification of candidate tokens and their potential selection/ inclusion in the refined lexicon. We have also identified most commonly used tokens, candidate tokens and other lingual attributes from the data collected. Based on analysis, we have proposed a computational model for refined colloquial Romanized Urdu lexicon development.