Abstract—Out-of-vocabulary words are a significant
challenge for cross-language information retrieval. Names of
people constitute a large portion of out-of-vocabulary words, as
there are different methodologies to match names that are
written in various languages. Some of the methods convert
names to phonetic codes, such as Soundex, or transliterate
names from one language to another. We propose a technique to
map characters automatically from different languages into
English, without human interference and without prior
knowledge of the language. This technique can provide a
statistical or phonetic model that can be used later for name
comparisons or named transliterations into a cross-language.
The method also generates Soundex codes for the source
language based on English Soundex codes. We implement this
technique for five languages: Arabic, Russian, Urdu, Hindi, and
Persian. Five Soundex tables are provided as the result of this
technique.
Index Terms—CLIR, data linkage, IR, name matching.
The authors are with the Florida Institute of Technology, Melbourne, FL
32901 USA (e-mail: malshuaili1994@my.fit.edu, mcarvalho@cs.fit.edu).
[PDF]
Cite: Mazin Al-Shuaili and Marco Carvalho, "Character Mapping for Cross-Language," International Journal of Future Computer and Communication vol. 5, no. 1, pp. 18-22, 2016.