Punjabi Machine TransliterationM. G. Abbas Malik
Department of Linguistics
Denis Diderot, University of Paris 7
Paris, France
abbas.malik@gmail.com Abstract
Machine Transliteration is to transcribe a
word written in a script with approximate
phonetic equivalence in another language.
It is useful for machine translation,
cross-lingual information retrieval,
multilingual text and speech processing.
Punjabi Machine Transliteration (PMT)
is a special case of machine transliteration
and is a process of converting a word
from Shahmukhi (based on Arabic script)
to Gurmukhi (derivation of Landa,
Shardha and Takri, old scripts of Indian
subcontinent), two scripts of Punjabi, irrespective
of the type of word.
The Punjabi Machine Transliteration
System uses transliteration rules (character
mappings and dependency rules) for
transliteration of Shahmukhi words into
Gurmukhi. The PMT system can transliterate
every word written in Shahmukhi.
1 Introduction
Punjabi is the mother tongue of more than 110
million people of Pakistan (66 million), India (44
million) and many millions in America, Canada
and Europe. It has been written in two mutually
incomprehensible scripts Shahmukhi and Gurmukhi
for centuries. Punjabis from Pakistan are
unable to comprehend Punjabi written in Gurmukhi
and Punjabis from India are unable to
comprehend Punjabi written in Shahmukhi. In
contrast, they do not have any problem to understand
the verbal expression of each other. Punjabi
Machine Transliteration (PMT) system is an
effort to bridge the written communication gap
between the two scripts for the benefit of the millions
of Punjabis around the globe.
Transliteration refers to phonetic translation
across two languages with different writing systems
(Knight & Graehl, 1998), such as Arabic to
English (Nasreen & Leah, 2003). Most prior
work has been done for Machine Translation
(MT) (Knight & Leah, 97; Paola & Sanjeev,
2003; Knight & Stall, 1998) from English to
other major languages of the world like Arabic,
Chinese, etc. for cross-lingual information retrieval
(Pirkola et al, 2003), for the development
of multilingual resources (Yan et al, 2003; Kang
& Kim, 2000) and for the development of crosslingual
applications.
PMT is a special kind of machine transliteration.
It converts a Shahmukhi word into a Gurmukhi
word irrespective of the type constraints
of the word. It not only preserves the phonetics
of the transliterated word but in contrast to usual
transliteration, also preserves the meaning.
Two scripts are discussed and compared.
Based on this comparison and analysis, character
mappings between Shahmukhi and Gurmukhi are
drawn and transliteration rules are discussed.
Finally, architecture and process of the PMT system
are discussed. When it is applied to Punjabi
Unicode encoded text especially designed for
testing, the results were complied and analyzed.
PMT system will provide basis for Cross-
Scriptural Information Retrieval (CSIR) and
Cross-Scriptural Application Development
(CSAD).
2 Punjabi Machine Transliteration
According to Paola (2003), “When writing a foreign
name in one’s native language, one tries to
preserve the way it sounds, i.e. one uses an orthographic
representation which, when read
aloud by the native speaker of the language,
sounds as it would when spoken by a speaker of
the foreign language – a process referred to as
Transliteration”. Usually, transliteration is referred
to phonetic translation of a word of some
1137
specific type (proper nouns, technical terms, etc)
across languages with different writing systems.
Native speakers may not understand the meaning
of transliterated word.
PMT is a special type of Machine Transliteration
in which a word is transliterated across two
different writing systems used for the same language.
It is independent of the type constraint of
the word. It preserves both the phonetics as well
as the meaning of transliterated word.
3 Scripts of Punjabi
3.1 Shahmukhi
Shahmukhi derives its character set form the
Arabic alphabet. It is a right-to-left script and the
shape assumed by a character in a word is context
sensitive, i.e. the shape of a character is different
depending whether the position of the
character is at the beginning, in the middle or at
the end of the word. Normally, it is written in
Nastalique, a highly complex writing system that
is cursive and context-sensitive. A sentence illustrating
Shahmukhi is given below:
X}Z Ìáââ y6– ÌÐâ< 6ڻ– ~@