E-mail address categorization based on semantics of surnames

Veluru, S., Rahulamathavan, Y., Viswanath, P., Longley, P. & Rajarajan, M. (2013). E-mail address categorization based on semantics of surnames. Proceedings of the 2013 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2013 - 2013 IEEE Symposium Series on Computational Intelligence, SSCI 2013, pp. 222-229. doi: 10.1109/CIDM.2013.6597240

[img]
Preview
PDF
Download (178kB) | Preview

Abstract

Surname (family name) analysis is used in geography to understand population origins, migration, identity, social norms and cultural customs. Some of these are supposedly evolved over generations. Surnames exhibit good statistical properties that can be used to extract information in names data set such as automatic detection of ethnic or community groups in names. An e-mail address, often contains surname as a substring. This containment may be full or partial. An e-mail address categorization based on semantics of surnames is the objective of this paper. This is achieved in two phases. First phase deals with surname representation and clustering. Here, a vector space model is proposed where latent semantic analysis is performed. Clustering is done using the method called averagelinkage method. In the second phase, an email is categorized as belonging to one of the categories (discovered in first phase). For this, substring matching is required, which is done in an efficient way by using suffix tree data structure. We perform experimental evaluation for the 500 most frequently occurring surnames in India and United Kingdom. Also, we categorize the e-mail addresses that have these surnames as substrings.

Item Type: Article
Additional Information: © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.
Uncontrolled Keywords: Vector space model, latent semantic analysis, surnames, average link clustering method, suffix tree
Subjects: Q Science > QA Mathematics > QA75 Electronic computers. Computer science
Z Bibliography. Library Science. Information Resources > Z665 Library Science. Information Science
Divisions: School of Engineering & Mathematical Sciences
URI: http://openaccess.city.ac.uk/id/eprint/2913

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics