Abstract: This paper gives an overview and an 
evaluation of Web pages of Asian languages on the 
Web, in particular of those languages that have not been 
focused on so far. The authors have collected over 100 
million Asian Web pages downloaded from 42 Asian 
country domains, identified the languages based on N-
gram statistics and analyzed their language properties. 
Primarily the number of pages written in each language 
measures the presence of a language. The survey reveals 
that the digital language divide exists at a serious 
level in the region. The state of multilingualism and 
the dominating presence of cross-border languages, 
English in particular, are analyzed. The paper sheds 
light on script and encoding issues of Asian language 
texts on the Web. In order to promote language resource 
collection and sharing, authors have a vision of creating 
an observation-collection instrument for Asian language 
resources on the Web. The results of the survey show 
the feasibility of this vision, and provide us with a better 
idea of the steps needed to realize that vision.

Keywords: Asian languages, Data Mining, Web 
Statistics, Language Identification, Standards, 
Multilingualism, Encoding, Web as Corpus, Digital 
Language Divide.

INTRODUCTION 

Since the early days of Web development, various 
attempts have been made to grasp the language 
distribution of the Web. An estimate of language 
distribution in terms of Internet users’ languages has 
been regularly reported by a marketing research group 
[1]. Estimates of the distribution of the Web documents 
are compiled by various groups, each with a different 
scope and focus. The work of Alis Technologies and the 
Internet Society [2] is among the earliest. Network and 
Development  Foundation  (FUNREDES)  compiles  a
regular report focused on the Romance language group 
[3], and Online Computer Library Center’s (OCLC) 
Web Characterization Project [4] covers large number 
of European languages. Most of these surveys have 

An Analysis of Asian Language Web Pages 

S. T. Nandasara1*, Shigeaki Kodama2, Chew Yew Choong3, Rizza Caminero4, Ahmed 
Tarcan5, Hammam Riza6, Robin Lee Nagano7, Yoshiki Mikami8
1 University of Colombo School of Computing, Colombo, Sri Lanka
2, 3, 4, 8 Nagaoka University of Technology, Nagaoka, Niigata, Japan
5 Dicle University, Diyarbakir, 21280, Turkey
6 IPTEKnet, BPPT, Indonesia
7 Miskolc University, Miskolc, Hungary
stn@ucsc.cmb.ac.lk, kodamas@kjs.nagaokaut.ac.jp, yewchoong@hotmail.com, rhyze1018@yahoo.com, 
tarcan@dicle.edu.tr, hammam@iptek.net.id, nagano.robin@chello.hu, mikami@kjs.nagaokaut.ac.jp

Revised: 24 September 2008; Accepted: 24 July 2008

evolved along with the multilingual search engines like 
Inktomi, Yahoo, Google, Alltheweb, etc. The language-
specific search capability of the search engines has 
provided a means of surveying for researchers. Although 
these surveys have given us fairly good pictures about 
European language presence on the Web, far less 
attention has been paid to Asian languages, among them 
“less computerized languages” in particular. 

This ignorance may arise partly from the technical 
difficulties of language identification of Asian languages 
and partly from “commercial value” of Asian languages 
that has been low. With the exceptions of Chinese, 
Japanese, Korean, Thai, Malay, Turkish, Arabic, and 
Hebrew, nothing is known about the extent of the 
presence of Asian languages on the Web. We felt a strong 
need to implement an independent survey instrument 
to observe the activity level of those languages. The 
UNESCO report, presented to the Tunis phase of the 
World Summit on the Information Society, “Measuring 
Language Diversity on the Internet” [5], shares exactly 
the same concerns as we do.

In response to this, the Language Observatory 
(LO) project was launched in 2003 under the sponsorship 
of the Japan Science and Technology Agency (JST) and 
has been implemented in collaboration with several 
international partners who have common interests with 
us [6]. After a few years of development work, the LO 
team has trained our own language identification engine 
to cover more than three hundred languages of the world, 
and has acquired the capability to collect terabyte size 
Web documents from the Internet. The paper is based on 
the preliminary survey results of this project. 

In addition, we have begun a sister project, 
the Asian Language Resource Network project by the 
sponsorship of Japanese Ministry of Education, Culture, 
Sports, Science and Technology (MEXT) from 2005. 
We  find  a  synergy   between  those  two  projects:  the 
observation instrument for Asian languages can work 
as a language resource collection instrument as well. 
We have a vision of integrating the two projects as an 
observation-collection instrument for Asian language 
resources on the Web.

The International Journal on Advances in ICT for Emerging Regions 2008 01 (01) : 12 - 23

* corresponding author


OBJECTIVES 

The objectives of this paper are firstly to give an overview 
for Asian languages on the Web, in particular for those 
languages that have been ignored up to now. Through 
this study, we have tried to spotlight the presence of 
Asian languages. Here the presence of a language is 
measured primarily by the number of pages written 
in each language and is supplemented by additional 
indicators like pages-per-capita to give an indication 
of the relative intensity of Web authorship. In terms of 
language coverage, we discovered 55 Asian languages. 
Chinese, Japanese and Korean are excluded from the 
analysis because the presence of these languages can be 
relatively easily measured by using existing commercial 
search engines. 

Secondly, the paper tries to describe the state of 
multilingualism in Asian country domains. The state of 
multilingualism can be defined at various levels, from 
a personal or document level to a societal level. In this 
study, we show a multiple language presence in each 
country domain. To give an overview of cross-border 
languages is a part of these efforts.

Thirdly, the paper tries to shed light on script 
and encoding issues of Asian languages. The paper 
tries to answer questions like; to what extent is UCS/
Unicode employed for Asian languages? What scripts 
are actually used to represent a specific language? To 
what extent are locally developed encodings used? Most 
European languages are written in only one script, Latin, 
Cyrillic or Greek. Some Asian languages, however, are 
written in a variety of scripts. This is most notable in 
the central Asian region, where the same language can 
be presented in Cyrillic, Arabic and Latin. In addition, 
each script is presented in various encoding schemes. 
While UCS/Unicode is expected to play a pivotal role 
in the promotion of multilingual document processing 
on the Web, its actual implementation on the Web seems 
still very much limited. Instead, various local legacy 
encoding schemes are employed. In order to promote 
language resource collection and development, due 
attention should be paid to script and encoding variety 
issues, particularly in this case, where it leads to a chaotic 
situation of encoding as observed in Asian language 
documents on the Web.

And lastly, the section three discusses the datathe section three discusses the data 
collection using UbiCrawler, language identification 
process including creating training data sets and 
analytical methodologies. The Asian language presence 
on the Web is discussed in “Asian Language Presence 
on the Web” section. The state of multilingualism and 
the presence of cross-border languages are discussed 
in “Multilingualism in the Asian Web” section  and 
script and encoding issues are discussed in “ Script and 
Encoding Issues” section . Finally discussions on future 
research areas and conclusion are given in “Descussion” 
and “Conclusion”  sections respectively.

METHODOLOGY 

Web Pages Collected 

We use a Web crawler that works by downloading Web 
pages from the Internet. While downloading, it traces 
links within pages and recursively crawls to gather those 
newly discovered pages. The collection of downloaded 
Web pages is then passed to the language identification 
engine and the language properties of the pages are 
identified.[7].

The latest Asia crawl (excluding China, Japan and 
Korea) focused on Web pages in 42 country domains in 
Asia. The crawl was begun from a seed file containing 
13,286 URLs . The list of ccTLDs  (country code TopTopop 
Level Domains) contains ae, af, a�, bd, bh, bn, bt, cy,evel Domains) contains ae, af, a�, bd, bh, bn, bt, cy,Domains) contains ae, af, a�, bd, bh, bn, bt, cy,omains) contains ae, af, a�, bd, bh, bn, bt, cy,s) contains ae, af, a�, bd, bh, bn, bt, cy,) contains ae, af, a�, bd, bh, bn, bt, cy, 
id, il, in, iq, ir, jo, kg, kh, kw, kz, la, lb, lk, mm, mn, mv, 
my, np, om, ph, pk, ps, qa, sa, sg, sy, th, tj, tm, tp, tr, uz, 
vn and ye. Web pages outside of these ccTLDs were not 
crawled. The crawl was performed using a decentralized, 
parallel crawler called UbiCrawler [8][9]. The crawler is 
configured to stop tracing further links at a depth of 8 
and to download a maximum of 50,000 pages per site. 
The crawler waits 30 seconds for http header responds 
before giving up.

The Asia crawl started from 5th July 2006 at 
11:00hrs and ended on 19th July 2006 at 19:03hrs 
without any problem. We downloaded 107,141,679 Web 
pages in total, 652,710,237,381 bytes in si�e. 

UbiCrawler supports the Robot ExclusionbiCrawler supports the Robot Exclusion 
standard and we fully respect it at all Web sites. The 
crawler is configured to check and analy�e Robots.txt 
on every new Web site. If a Web site indicates Web 
robots are not welcome, our crawler will not download 
that Web site. The latest Asia crawl discovered 45,348 
Robots.txt files.

Further, Web sites and their contents change Web sites and their contents change 
over time. Most search engines have accumulated their 
databases to have longer (in time) coverage. This means 
that in the database, there might be many obsolete Web 
sites and pages. Because the pages downloaded during 
one short period of time in our study accurately reflect 
the “current” status of Web sites.

Lastly, while search engines generally cache all 
types of files, we only crawl for html and text files, both 
static and dynamic. Although there are many documents  
available in PDF format, we excluded PDF files because 
of technical difficulties in handling PDF for Language 
Identification Module (LIM). 

Language Identification Process 

The Language Identification Module (LIM) developed 
for the Language Observatory Project (LOP) [10] can 
simultaneously detect the triplet of Language, Script 
and Encoding (LSE) scheme (LSE is used below for 
this triplet) for each document. The identification is 
based on the n-gram statistics of documents. A natural 
language model, which assumes that the probability of 
the  next  word  depends  on  the  previous   few   words  

13 An Analysis of Asian Language Web Pages

The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008


is generally known as an N-gram model, and a series 
of N characters (or N bytes) can be referred to as an “N 
– gram” as well. The advantages of the n-gram approach 
are that it does not require a special dictionary or word 
frequency list for each language, and it can detect 
encoding scheme.

LIM consists of two components. First, the 
training component accumulates sets of shift-codons 
from the training data. The term “shift-codon” is derived 
from the genetic term “codon”, a sequence of three 
nucleotides. Shift-codons are, as the naming implies, 
three byte strings extracted from the first position, the 
second position, (n-2)-th position of a training data 
(n is the length of the training data). The set of shift-
codons thus created are stored with the LSE tags into 
the reference database. The source of training data is 
translations of the Universal Declaration of Human 
Rights  (UDHR) provided by the United Nation’s Office 
of Higher Commissioner for Human Rights.

The second component, the identification 
component, produces shift-codons of the target data and 
then compares them with all sets of shift-codons stored in 
the reference database. After comparison, the component 
calculates the matching ratio of the shift-codons of the 
target text to those of the training text (the number of 
matched codons of the target document divided by the 
total number of codons). Then the component returns the 
LSE that shows the highest matching ratio as a result. 
The component returns “Below Threshold” when the 
highest matching ratio is below a given threshold, and 
returns “No Match” when  no single codon of the target 
document matches with those of stored reference data. 
The component returns “Short” or “Empty” when the 
byte length of the target document is not enough to be 
identified or no content is found on the target document 
after removing HTML tags.

There are two data sets used in the language 
identifier (LI). First, it is the data set that we use to 
train the LI; we called it the Training corpus (TC). The 
second data set is the Validation corpus (VC), which 
contains 500 multilingual Web pages that manually 
checked by users to confirm their actual language, script 
and encoding (LSE). The purpose of this corpus is to 
ensure the accuracy of the LI. Since we already know 
the correct LSE of the VC, every time we made changes 
to the LI, we can perform an experiment against the VC 
and find out how the changes affect the accuracy rate.

The language identification engine LI has been 
trained in more than 200 languages of the world (345 in 
terms of LSEs) at the time of this survey. Among them, 
62 languages are spoken in Asia and total of 98 different 
encodings for Asian language scripts have been trained. 
Missing Asian languages from the UDHR listing are 
Zhuang, Yi, Hmong (including its various dialects), 
Shan, Karen, Oriya, Divehi, D�ongkha (Bhutanese), 
etc. 

Languages  selected here are official or nationally 
recognized  languages in  respective Asian countries. 
Training   data   sets  are   based  on   the   Universal 
Declaration  of  Human  Rights  document,   which  has 

been converted into each language and into commonly 
used encoding schemes including UTF-8. Table 1 is 
the complete list of the Asian languages targeted in 
this survey, classified by language family. Additional 
information for the languages is also listed, vi�: the 
script(s) for the language and the encodings we trained 
LIM over.

ASIAN LANGUAGE PRESENCE ON THE WEB 

Introduction to Asian Languages 

We can list several language families on the Asian 
continent: Austroasiatic, Austronesian, Dravidian, Indo-
Iranian, Mongolian, Semitic, Sino-Tibetan, Thai-Kadai, 
Turkic, and Tungus. Some of these language families are 
not firmly established and could be regrouped into the 
larger language groups or could be divided into smaller 
sub-groups. For example, the Turkic, Mongolian, and 
Tungus language families can be regrouped into larger 
language family Altaic, and the Indo-Iranian language 
family can be divided into the Indo-Aryan, Iranian, 
and Kafiri. There are some isolated languages around 
the Asian continent, e.g. Korean, Japanese, Ainu, and 
Burushaski. Some European languages – English, 
Russian, French, and Portuguese – are also used in the 
region as official languages, and from the mixture of 
an indigenous language and an introduced language, 
pidgins or creoles have emerged.

Among those language families, Sino-Tibetan 
has the largest number of speakers, estimated at 1.2 
billion. Next comes Indo-Iranian, with at least 700 
million speakers in India, and more than 200 million 
people in Pakistan, Bangladesh, Iran and other 
South and MiddleEast Asian countries. Malay in the 
Austronesian language family has around 250 million 
speakers in Indonesia, Malaysia, Brunei, Singapore, the 
southern Philippines, and Thailand. Tamil, a Dravidian 
family,  has about 200 million speakers in India. Semitic 
includes a language of many speakers, that is, Arabic, 
the number of which is estimated to be about 200 
million. Other language families have a relatively small 
number of speakers. Among the isolated languages, 
Japanese has the largest number of speakers with about 
125 million and Korean follows with about 75 million. 
When we describe the Asian languages, we cannot avoid 
mentioning the diversity of scripts they use. Contrasted 
with Western Europe, the diversity is outstanding. In 
Southeast and South Asian countries, many scripts that 
come from the Brahmi script are used, and in the East 
and Near East Asian countries, Han�i script and some 
other indigenous scripts are used. Latin, Arabic and 
Cyrillic script are also used with some additional letters 
and diacritical marks.

Web Presence by Country 

In Figure 1, the colouring of map is based on the number 
of  Web  pages  per  1000  population,  as  this  is  the 
reflection of the degree of presence of a country  on  the

S.T. Nandasara, Shigeaki Kodama, Chew Yew Choong, Razza Caminero 14
Ahmed Tarcan, Hammam Riza, Robin Lee Nagano, Yoshiki Mikami

October 2008  The International Journal on Advances in ICT for Emerging Regions 01 (01) 


Table 1: List of Language/Script/Encoding[1] trained, grouped by language family

   [Austronesian]    [Indo-Iranian]    [Dravidian]
Achehnese/Latin/Latin1 Assamese/Bengali/UTF-8 Kannada/Kannada/UTF-8
Balinese/Latin/Latin1 Balochi/Arabic/UTF-8 Tamil/Tamil/UTF-8
Bikol/Bicolano/Latin/Latin1 Bengali/Bengali/UTF-8 Tamil/Tamil/Vikata
Bugisnese/Latin/Latin1 Bhojpuri/Devanagari/Agra Tamil/Tamil/Shree
Cebuano/Latin/Latin1 Dari/Arabic/UTF-8 Tamil/Tamil/Kumudam
Filipino/Latin/Latin1 Farsi/Persian/Arabic/UTF-8 Tamil/Tamil/Amudham
Hiligaynon/Latin/Latin1 Gujarati/Gujarati/UTF-8 Telugu/Telugu/UTF-8
Indonesian/Latin/Latin1 Hindi/Devanagari/UTF-8 Telugu/Telugu/TLW
Javanese/Latin/Latin1 Hindi/Devanagari/Naidunia Telugu/Telugu/Shree
Kapampangan/Latin/Latin1 Hindi/Devanagari/Arjun
Iloko/Latin/Latin1 Hindi/Devanagari/Shusha    [Semitic]
Madurese/Latin/Latin1 Hindi/Devanagari/Shivaji Arabic/Arabic/UTF-8
Malay/Latin/Latin1 Hindi/Devanagari/Sanskrit Arabic/Arabic/Arabic
Minangkabau/Latin/Latin1 Hindi/Devanagari/Kiran Hebrew/Hebrew/UTF-8
Sundanese/Latin/Latin1 Hindi/Devanagari/Hungama Hebrew/Hebrew/Hebrew
Tetun/Latin/Latin1 Hindi/Devanagari/Shree
Waray/Latin/Latin1 Hindi/Devanagari/KrutiDev    [Turcic]

Kashimiri/Devanagari/UTF-8 Abkha�/Latin/UTF-8
   [Austro-Asiatic] Kurdish/Latin/UTF-8 Abkha�/Cyrillic/8859-5

Hmong/Latin/Latin1 Magahi/Devanagari/UTF-8 Abkha�/Cyrillic/Abkh
Khmer/Khmer/UTF-8 Magahi/Devanagari/Agra A�eri /Latin/A�.Times
Vietnamese/Latin/UTF-8 Marathi/Devanagari/KrutiDev A�eri /Cyrillic/A�.Times
Vietnamese/Latin/TCVN Marathi/Devanagari/Shivaji Ka�akh/Cyrillic/8859-5
Vietnamese/Latin/VIQR Marathi/Devanagari/Kiran Ka�akh/Arabic/UTF-8
Vietnamese/Latin/VPS Marathi/Devanagari/Shree Tatar/Latin/Latin1

Nepali/Devanagari/UTF-8 Turkish/Latin/UTF-8
   [Sino-Tibetan] Osetin/Arabic/UTF-8 Turkish/Latin/Turkish

Burmese/Burmese/UTF-8 Osetin/Cyrillic/UTF-8 Uighur/Latin/UTF-8
Chinese/Han�i/GB2312 Pashtu/Arabic/UTF-8 Uighur/Latin/Latin1
Chinese/Han�i/UTF-8 Punjabi/Arabic/UTF-8 U�bek/Latin/Latin1
Hani/Latin/Latin Sanskrit/Devanagari/UTF-8
Tamang/Devanagari/UTF-8 Saraiki /Arabic/UTF-8    [Thai-Kidai]
Tibetan/Tibetan/UTF-8 Sinhala/Sinhala/UTF-8 Lao/Lao/UTF-8

Sinhala/Sinhala/Kaputa Thai/Thai/TIS620
   [Mongolian] Sinhala/Sinhala/Metta Thai/Thai/UTF-8

Mongolian/Cyrillic/UTF-8 Tajiki/Arabic/UTF-8 Zhuang/Latin/Latin1
Mongolian/Cyrillic/8859-5 Urdu/Arabic/UTF-8  

                                  
    [1] Local proprietary encodings are shown in this table by names of font. font. 

15 An Analysis of Asian Language Web Pages

The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008


Web. This shows that Israel is the highest (3757 pages 
per 1000 population) in the rank and Singapore and 
Cyprus follow, respectively. The population data was 
obtained from the CIA World Factbook (estimates as of 
July 2006).

Figure 1 shows that Ka�akhstan and A�erbaijan 
respectively have the highest Web page size per 1000 
population among Central Asian countries. Figure 1 
also shows that Cambodia, Afghanistan, Pakistan, India, 
Syria, Yemen, Bangladesh, and the last, Myanmar, have 
the least number of pages presence on the Web (between 
5 (4.54%) to 0 (0.35%) pages per 1000 population). It 
is worth noting that Myanmar, the neighboring country 
to Thailand, has the least (0.35%) among all the Asian 
countries. 

The presence on the Web of each Asian country 
is given at ccTLD level in Table 2. In Table 2, ranking 
is based on the percentage of Web presence against the 
total Web pages in the region. This shows that Israel 
(28.88%), Thailand (11.72%) and Turkey (10.61%) with 
a higher number of language presence on the internet at 
ccTLD level. Table 2 was tabulated using the Number of 
Web Pages collected by the crawler engine.

Web Presence by Language 

Fourth column of Table  3  shows  the  total  number  of

Figure 1: Presence of Web Pages by Country in the Asian Region
Table 2: Percentage of Web Pages on the Internet at 

ccTLD Level

ccTLD
% of 
Web 
Pages

ccTLD
% of 
Web 
Pages

ccTLD
% of 
Web 
Pages

il 28.88 ae 0.87 lk 0.13
th 11.72 kg 0.69 bn 0.09
tr 10.61 pk 0.69 ps 0.08

my 6.41 cy 0.59 tm 0.08
kz 6.01 mn 0.37 kh 0.06
sg 5.39 np 0.37 kw 0.06
id 5.36 lb 0.32 qa 0.05
vn 4.19 jo 0.27 sy 0.05
in 3.98 bh 0.23 bt 0.04
ir 3.75 tj 0.22 mv 0.03
ph 2.55 bd 0.19 ye 0.03
uz 2.13 la 0.14 mm 0.02
az 2.10 om 0.14 tp 0.01
sa 0.98 af 0.13 iq 0.00

 
Web pages identified by the survey. The data shown in 
the third column of the table is the speaker population of 
that language with statistics taken from the UDHR 

S.T. Nandasara, Shigeaki Kodama, Chew Yew Choong, Razza Caminero 16
Ahmed Tarcan, Hammam Riza, Robin Lee Nagano, Yoshiki Mikami

October 2008  The International Journal on Advances in ICT for Emerging Regions 01 (01) 


Web site. In principle, all Asian languages listed in first 
column in Table 3 are considered as local languages. The 
ranking is based on the number of pages. Table 3 shows 
that Hebrew, Thai, Turkish, Vietnamese, Arabic, Tatar, 
Farsi, Javanese, Indonesian, Malay, Sudanese, Hindi, 
Dari,   Uzbek   and  Mongolian   have  a   relatively high 
presence on the Web. The highest number is for Hebrew, 
and the second highest for Thai. The fifth column gives 
the number of pages per 1000 speakers of each language. 
An almost identical ranking is observed in both the 
number of pages and the pages per population.

A high degree of “divide” in terms of usage level of  
languages can be observed among the Asian languages. 
The number of Hebrew pages per 1000 speakers is 28 
times higher than that of the Malay language (ranked 
tenth in Table 3), 300 times higher than Pushtu (ranked 
20th), and 3,000 times higher than Gujarati (ranked 
50th). The speakers’ population of languages is said to 
follow Zipf’s Law - the n-th ranked language speaker is 
one n-th of the population of the top ranked language. 
But if we measure the size of a language by the number 
of pages written in the respective language, the relative 
size of the 1st, 10th, 20th and 50th ranked language in 
Table 3 becomes a series of 1, 0.036, 0.0035, 0.0001. 
Our observation suggests that the number of Web pages 
written in each language follows a progressive power 
law curve. The situation evidenced here can be well 
described as a “Digital Language Divide”.

MULTILINGUALISM IN THE ASIAN WEB 

Multilingualism by Country Domain

The most recent version of Ethnologue [11] lists close 
to seven thousand languages around the world. More 
than 2600 of them are spoken in the Asian region. 
This indicates that a large scale linguistic diversity is 
observable in Asia. Among the 2600s’, only around 
51 languages are recognized by Asian governments as 
official or national language(s) of the country and other 
languages have been recognized as  languages for home 
use. 

Through the survey, a rich diversity of written 
pages was found in the country with the richest diversity 
of languages in the region, i.e. Indonesia. It is interesting 
to note that there are a significantly larger number of 
pages in Javanese compared to either Indonesian 
or Malay. The major language found in Indonesia, 
Malaysia, Brunei, Singapore, Southern Thailand and 
Phillipines can be categorized into a single root Malay 
language spoken in different dialects. This surprising 
result shows two things: Javanese has a dominating Web 
presence in Indonesia. The lesser Sundanese, Madurese, 
Achehnese and Buginese languages are found to be of 
great importance to Indonesia’s local language diversity 
on the Internet (see Table 3).

Table 3: Number of Web Pages Collected from Asian ccTLDs, by Language

Language Script Speaker population Total number of pages
No. of pages per 
1000 speakers

Hebrew Hebrew 4,612,000 11,957,314 18.08
Thai Thai 21,000,000 7,752,785 11.72
Turkish Latin 59,000,000 3,959,328 5.99
Vietnamese Latin 66,897,000 2,006,469 3.03
Arabic Arabic 280,000,000 1,671,122 2.53
Tatar Latin 7,000,000 1,575,442 2.38
Farsi Latin 33,000,000 1,293,880 1.96
Javanese Latin 75,000,000 1,267,981 1.92
Indonesian Latin 140,000,000 866,238 1.31
Malay Latin 17,600,000 432,784 0.65
Sundanese Latin 27,000,000 217,298 0.33
Hindi & others Devanagari 182,000,000 119,948 0.18
Dari Arabic 7,000,000 107,963 0.16
Uzbek Latin 18,386,000 57,212 0.09
Mongolian Cyrillic 2,330,000 51,140 0.08
Ka�akh Arabic 8,000,000 48,652 0.07
Madurese Latin 10,000,000 47,246 0.07
Uighur Latin 7,464,000 46,399 0.07
Kashmiri Arabic 4,381,000 41,876 0.06
Pushtu Arabic 9,585,000 41,479 0.06
Balochi Arabic 1,735,000 36,497 0.06

17 An Analysis of Asian Language Web Pages

The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008


Tibetan Tibetan 1,254,000 1,454 0.00
Cebuano Latin 15,230,000 1,107 0.00
Telugu Telugu 73,000,000 1,072 0.00
Saraiki Arabic 15,020,000 1,036 0.00
Lao Lao 4,000,000 799 0.00
Gujarati Gujarati 44,000,000 765 0.00
Pashto Arabic 9,585,000 259 0.00
Kannada Kannada 33,663,000 164 0.00
Urdu Arabic 54,000,000 70 0.00
Khmer Khmer 7,063,200 65 0.00
Hani Latin 747,000 63 0.00
Asian Languages total (A) 33,838,551 (51.2%)
Other Languages total (B) 32,293,912 (48.8%)
Identified pages total (A + B) 66,132,463 (61.7%)
Unidentified pages total (C) 41,009,216 (38.3%)
Matching ratio below threshold [1] 5,701,765 (5.3%)
Empty pages 273,187 (0.3%)
No matching pages 9,386 (0.0%)
Duplicated pages [2] 35,024,878 (32.7%)
Total downloaded Pages (A + B + C) 107,141,679 (100%)

Language Script Speaker population Total number of pages
No. of pages per 
1000 speakers

Turkmen Latin 5,397,500 32,156 0.05
Minangkabau Latin 6,500,000 20,766 0.03
Bikol Latin 4,000,000 18,509 0.03
Kyrgy� Arabic 2,631,420 15,606 0.02
Balinese Latin 3,800,000 14,584 0.02
Punjabi Arabic 25,700,000 14,544 0.02
Sindhi Arabic 19,675,000 12,945 0.02
Achehnese Latin 3,000,000 11,102 0.02
Sinhala Sinhala 13,218,000 10,770 0.02
Kapampangan Latin 2,000,000 10,094 0.02
Iloko Latin 8,000,000 9,180 0.01
Bengali & Assamese Bengali 196,000,000 8,590 0.01
Filipino Latin 14,850,000 5,511 0.01
Waray Latin 3,000,000 5,426 0.01
Bugisnese Latin 3,500,000 3,533 0.00
Burmese Burmese 31,000,000 3,285 0.00
Kurdish Latin 20,000,000 3,135 0.00
Tajiki Arabic 4,380,000 2,430 0.00
Azeri Cyrillic / Latin 13,869,000 3,767 0.00
Tamil Tamil 62,000,000 2,025 0.00
Hiligaynon Latin 7,000,000 1,935 0.00
Dhivehi Thaana 250,000 1,858 0.00
Bhojpuri Devanagari 25,000,000 1,756 0.00

[1] The threshold is set at 20% in this survey; [2] Almost one-third of the pages were found to be an exact copy of another page. We 
excluded duplicate pages from the language identification process.

S.T. Nandasara, Shigeaki Kodama, Chew Yew Choong, Razza Caminero 18
Ahmed Tarcan, Hammam Riza, Robin Lee Nagano, Yoshiki Mikami

October 2008  The International Journal on Advances in ICT for Emerging Regions 01 (01) 


Figure 2: Cross-border languages presence in Asian countries grouped by region 

GCC stands for the Gulf Coopeation Coucil, which consists of Bahrain, Kuwait, �man, �atar, Saudi Arabia and UAE Kuwait, �man, �atar, Saudi Arabia and UAEKuwait, �man, �atar, Saudi Arabia and UAE

Cross-Border Languages and their Dominance

Another aspect of the multilingualism in the region is 
the overwhelming presence of cross-border languages 
on the Web. Here we define two categories of languages. 
The first category is “local languages”, which are 
officially recogni�ed language(s) and home speakers’ 
languages of the state. The second category is “cross-
border languages”, such as English, French, Russian and 
Arabic, which are used as a language of communication 
among the peoples of different nations.

Arabic can be categorized in two ways. In the 
Middle East region, Arabic is recogni�ed as an official 
language in many countries, but it also functions as an 
important cross-border language. So we treat Arabic in 
two ways depending on the context of analysis; if it is 
an official language, it is counted as a local language 
in Figure 2, and if not, then as an ‘other cross-border 
language’ [12]. Figure 2 shows the relative share of 
these categories of language in each country domain. 
Countries are grouped by sub-region. We found that 
each sub-region shows clear characteristics in terms of 
the weight and the choice of cross-border languages.

In West Asia, two cross-border languages, English 
and Arabic, dominate the Web. Almost 99% of Web 
pages are written in these two cross-border languages, 

except in Cyprus, Iran and Israel. Local languages show 
a majority in several countries, such as in Israel (62.0% 
in Hebrew), Turkey (50.7% in Turkish) and Iran (50.6% 
in Farsi, Dari, Pashtu and Balochi). If we treat Arabic as 
a local language, the share of local languages becomes 
more than half in most countries in the region. A quite 
unique case is Cyprus, where Greek (36.6%) plays a key 
role.

In South Asia, the dominance of English is 
outstanding. Relatively high share of local languages is 
found only in Nepal (22.4% in Hindi or Nepali), India 
(21.7%   in  various   Indian  languages),  the  Maldives 

19 An Analysis of Asian Language Web Pages

The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008

(8.2% in Divehi) and Sri Lanka (5.1% in Sinhala).
In the Central Asia, there are two cross-border 

languages, English and Russian. As will be discussed 
in section “ The Waro Alphabets in Central Asia”, 
Russian dominates in Ka�akhstan (88.8% in Russian), 
Kyrgy�stan (86.3% in Russian), and U�bekistan (70.4% 
in Russian), while English dominates in Turkmenistan 
(94.4% in English). Tajikistan (45.9% in English and 
44.6% in Russian) has an equal balance of the two 
languages. Local languages show a substantially higher 
share (35.2% Mongolian) only in Mongolia.

In South East Asia, the situation is rather different. 
The share of local languages is far higher than in other 
sub-regions. Among them, local languages have a major 
share in Vietnam (69.8% in Vietnamese), Thailand  


(64.0% in Thai) and Indonesia (58.7% in various local 
languages including Javanese, Indonesia, Sundanese, 
Balinese, etc.). English dominance is also observed 
frequently in this sub-region.

SCRIPT AND ENCODING ISSUES 

Script Diversity in Asia

Asia is especially rich in scripts. The five basic scripts: 
Ideographic, Brahmi, Latin, Arabic and Cyrillic grew 
up in the region, each largely separated by mountains, 
ocean or deserts. In East Asia, the influence of Chinese 
ideographic script (han�i) is remarkable. In South Asia, in 
and around the Indian Subcontinent and in the continental 
part of Southeast Asia, scripts originating from Brahmi-
script are influential. The islands of Southeast Asia and 
Australasia have mostly adopted Latin scripts (some 
islands in the region still use Brahmi-originating scripts 
such as the Balinese script, or aksara Bali). In Central 
Asia, historically languages were written in the Arabic 
script under the influence of the Ottoman Empire but 
later transformed into Cyrillic. Lastly in the western part 
of Asia, Arabic script is widely used not only by Arabic 
speakers but also by non-Arabic speakers.

Table 4: Number of Pages in Domains of Central Asian 
Republics

(a) English, Russian, and Arabic pages by country

Country
English 

(A)
Russian 

(B)
Arabic 

(C)
(A + B + 

C)
Azerbaijan 553,168 534,913 3,081 1,091,162
Ka�akhstan 263,125 2,234,674 106 2,497,905
Kyrgy�stan 42,167 403,080 55 445,302
Tajikistan 48,300 45,178 27 93,505
Turkmenistan 1,398,708 5,922 4,004 1,408,634
Uzbekistan 255,782 922,188 15 1,177,985
Total 2,561,250 4,145,955 7,288 6,714,493

(b) Official Language pages by Script

Language
Latin 
(A)

Cyrillic 
(B)

Arabic (C)
(A + B + 

C)
Azeri 726 2,315 n/a 3,041
Ka�akh n/a 48,522 130 48,652
Kyrgy� n/a 12,680 2,962 15,642
Tajiki n/a n/a 2,430 2,430
Turkmen 32,156 n/a n/a 32,156
Uzbek 57,212 n/a n/a 57,212
Total 90,094 63,517 5,522 159,133

“n/a” means pages are not yet found, but it does not mean non-
existence of pages.

The War of Alphabets in Central Asia

As the Turkic language border extends from Europe 
to China, covering 12 million square kilometres, the 

languages  are  written in several scripts.  In Turkey the 
Latin alphabet has been used since 1928. In Central 
Asian republics, the Cyrillic script has been in use from 
about the same time. In some areas of Afghanistan 
Arabic script is used. It is said that Turkic languages 
such as Uyghur and Ka�akh are written even in Chinese 
script. Now the region is in a transition period. As an 
interesting example of script diversity, let us discuss this 
sub-region.

Since 1990 Turkey has invited thousands of 
students from new republics in the central Asia to Turkish 
universities by offering scholarships. The Turkish 
Education Minister Köksal Toptan, while attending a 
conference of the education ministers of Turkic republics 
and communities in Bishkek in September 1993, said, 
“the most important factor which will secure our unity 
and develop our language is a common alphabet” [13]. 
Azerbaijan, Turkmenistan, Uzbekistan, Tatarstan, and 
Gagauz have an agreement to complete the transition 
into Latin script before 2010.

But in reality, in symposiums and meetings 
between Turkic republics in Central Asia Russian is 
nearly the sole tool of communication between the Turkic 
peoples in the central Asia. “The local languages are 
used exclusively in indigenous film-making, scholarly 
publication, and in local trade and commerce” [13] 
Ka�akhstan and Kyrgy�stan have a significant Russian 
population. This fact increases again the influence of 
Russian language and the Cyrillic script as well. China 
and Iran are the other important actors in this sub-region: 
Kyrgy�stan and Ka�akhstan share borders with China, 
and Iran has an important influence on A�erbaijan and 
Turkmenistan.

Although our survey results can provide only a 
limited picture of this situation, they do make it clear that 
the choice of script used to write local languages seems 
influenced substantially by the script of the dominating 
language in the country (Table 4(a), 4(b)).

The Existence of Multiple Encodings

Indian language Web sites heavily rely on unique 
encodings or proprietary extensions of existing standard 
encodings [14]. One survey had found 24 such local 
encodings for Hindi alone, and 15 for Tamil, 14 for 
Marathi, 10 for Malayalam, and so on. The total number 
of these local encodings reaches well over 50 [15]. The 
existence of multiple local encodings is not specific 
only to Indian languages, but is widely seen in other 
languages which use non-Latin scripts or Latin script 
with significant extensions and/or additional diacritics. 
Vietnamese is a typical example of the latter. 

To  resolve this problem for scripts of the languages 
around the world, the International Organization for 
Standardi�ation (ISO) and the International Electro-
technical Commission (IEC) made efforts to develop 
a single comprehensive universal character set. The 
first version of “The Universal Multiple Octet Coded 
Character Set (UCS)” was published in 1991. Later the 

S.T. Nandasara, Shigeaki Kodama, Chew Yew Choong, Razza Caminero 20
Ahmed Tarcan, Hammam Riza, Robin Lee Nagano, Yoshiki Mikami

October 2008  The International Journal on Advances in ICT for Emerging Regions 01 (01) 


of UTF-8 (96.4%). It is followed by Mongolian (95.5%), 
Hindi and other Devanagari-based languages (78.4%), 
and Sinhala (44.5%). Hebrew (12.3%), Thai (2.7%), 
Burmese (0.7%), and Turkish (0.5%) are relatively or 
extremely low. These estimates of UTF-8 penetration 
should be considered as overestimated, because many 
local encodings are still missing from our training data.

DISCUSSION 

In this section, we will discuss issues from the viewpoint 
of how to reali�e the vision of an integrated observation-
collection instrument for Asian language resources from 
the Web. First we will discuss the availability and quality 
of language resources, and then we will focus on our 
agenda in the technical domains, how to deal with plural 
scripts and encodings and how to create efficient and 
workable solutions for collecting language resources.

Overall Assessment

When measured by the number of pages or by pages-
per-capita, most of the Asian languages are far less 
represented on the Web than European languages. This 
is not a surprising result, but their presence is even more 
limited than expected. Hindi and Bengali, for example, 
with almost four hundred million speakers between 
them – larger than the total population of the European 
Union – have only one hundred thousand or so pages 
on the Web. The degree of difference between them and 
European language representation is in the order of tens 
of thousands or hundreds of thousands. “The digital 
language divide” does definitely exist at a worrisome 
level.

When measured by volume of text, a one million 
document set contains roughly 5 gigabytes of text, 
assuming 5000 bytes as the average page size. But only 
ten Asian languages have above this amount of language 
resources with the remainder being far smaller.

When we evaluate the quality of documents 
as language resources, such factors as the variety of 
content category, language quality, and variety of style 
of documents should be evaluated. At this moment, we 
cannot tell much about these points. But at least one 
point can be mentioned here. It is likely that content 
category, quality and style of languages are biased, at 
least in “smaller” languages. The bias might stem from 
the specialization of usage in a multilingual environment. 
Multilingualism is the norm in most parts of Asia. In a 
multilingual environment, there is often specialization 
in discourse situations. For example, English for the 
occupational domain is an official language for public 
or educational domains and other local languages 
for personal domains. When such specialization is 
apparent, the language contents on the Web also may 
show specialization depending upon the domains of the 
language’s specialization. The outstanding dominance 
of cross-border languages in many country domains 
suggests that the specialization domains left for the local 
languages might be relatively limited.

work of  ISO / IEC and that of the Unicode Consortium 
became integrated and synchronized. The most 
recent version of the Unicode Standard (The Unicode 
Consortium, 2005) assigns a unique identifier to each 
of 97,720 characters (including 70,207 ideographic 
characters defined by national and industry standards of 
China, Japan, Korea, Taiwan, Vietnam and Singapore). 
But it was expected by the Unicode Consortium, that 
encoding based on UCS/Unicode, whose most commonly 
used form is UTF-8 (UTF-Unicode Transformation 
Format), would be used in parallel with the above-
mentioned local encodings.

Taking this plurality into account, we have 
tried to collect training data encoded in these local 
proprietary codes and in UTF-8. As shown in Table 5, 
we have trained our language identification engine LIM 
by using 9 encodings for Hindi, 4 encodings for Tamil 
and 2 encodings for Telugu. Also we have included 3 
encodings for Vietnamese and 2 encodings for Sinhala. 
This is still not sufficient to match the plurality in the 
real world, but we believe that this is the first ever 
attempt to identify actual usage of local encodings in the 
Web space.

As a result, we found that the use of UTF-8 in 
the Asian region is extremely low. Table 5 shows to 
what extent UTF-8 is used in selected languages (for 
other languages, we do not have sufficient training data 
prepared in different encodings). The table shows that 
Vietnamese is found to be the highest in the penetration 

Table 5: The Penetration of UTF-8 Encoding in 
Selected Languages

Language UTF-8 encodeddocuments
Document 
encoded

otherwise

Examples 
of other 

encodings 
found [1]

Vietnamese 1,934,392(96.4%) 72,077(3.6%) TCVN, VIQR, VPS
Mongolian 48,834(95.5%) 2,300 (4.5%) Latin-Cyrillic
Hindi, 
Bhojpuri, 
Magahi, 
Marathi, 
Nepali, 
Sanskrit, 
Tamang

81,800(78.4%) 22,544 (21.6%)

Agra, Arjun, 
Kiran, Kruti, 
Hungama, 
Naidunia, 
Shivaji, Shree, 
Shusha

Sinhala 4,793(44.5%) 5,977(55.5%) Metta, Kaputa

Arabic 400,933(24.0%) 1,270,189 (76.0%) Latin-Arabic

Telugu 178(16.6%) 894(83.4%) Shree, TLH

Tamil 566(14.9%) 3,232 (85.1%)
Amudham, 
Kumudam, 
Shree, Vikatan

Hebrew 1,468,344(12.3%) 10,488,970 (87.7%) Latin-Hebrew

Thai 207,901(2.7%) 7,544,884 (97.3%) TIS 620

Burmese 24(0.7%) 3,261(99.3%) WinResearcher

Turkish 20,591(0.5%) 3,938,737 (99.5%) Latin-Turkish
[1] Local proprietary encodings are shown in this table byLocal proprietary encodings are shown in this table byocal proprietary encodings are shown in this table by 

names of font (families).

21 An Analysis of Asian Language Web Pages

The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008


How to cope with the Growing Web

The current survey does not cover Web pages placed 
under generic domains like com, org or net. Many local 
language news sites, blog pages and chat-rooms are 
hosted in generic domains, whose size is almost ten times 
larger than the entire country code domains. Therefore, 
considering the growing speed of the Web, the question 
of how to implement an efficient crawler becomes a 
key issue in our vision. The current study consumes 
almost 652 gigabytes of disk storage and consumes 50 
to 80 Mbps bandwidth for almost one week. A simple 
calculation tells us that 65 terabytes of disc storage and 
100 weeks would be needed to collect the entire Web (10 
billion pages is assumed here). It seems impossible for 
most non-commercial entities.

In this context, several studies and attempts have 
been made in the field of language-focused crawling 
[14][16]. One of the assumptions behind the design of 
this approach is that pages written in a specific language 
may have a high likelihood of being linked to pages in 
the same language. We need to verify this assumption. 
A graph analysis to reveal the structure of sub-graphs 
of Web pages written in the same language should be 
tackled.

In the same context, a distributed crawling 
approach coupled with proximity-based allocation 
of tasks has been explored [17]. An advantage of this 
approach is the possibility of combining free-resources 
from any possible participant, and proximity-based 
allocation of tasks can improve the speed of crawling 
by reducing response time from an assigned server to a 
target  server.  The  Language  Observatory is offering a 
server to an experiment to test this approach, designed 
by Thai Computational Linguistics Laboratory (TCL).

CONCLUSION 

A detection technique for natural languages and 
their encoding schemes can also be used as an online 
language, script, and encoding scheme identifier and to 
develop tools such as multilingual search engines.  It 
will be difficult to install the shift-codon trained data 
into a Web browser due to the large amounts of shift-
codon required. However, online detection service 
and crawling for specific language groups could be 
implemented with limited knowledge, since the server 
manages the knowledge.

The survey presented, in spite of its limitations, 
is probably the first comprehensive survey of Asian 
languages on the Web. The results revealed the existence 
of a worrisome level of digital language divide and 
the dominance of cross-border languages in the Asian 
domains. Through the survey, an estimate of the size of 
language resources on the Web is given. Also the extent 
of plurality in scripts and encodings of Asian language 
documents is indicated. It may be premature to confirm 
the feasibility of a “Web as Corpus” scenario for Asian 
languages in a conclusive manner. 

Finally,  the  survey  has  identified  points  to be 

aware of and has given directions that can benefit anybody 
who tries to create a language resource collection.

Acknowledgement 

The study was made possible by the sponsorship of 
the Japan Science and Technology Agency through 
its RISTEX program and by the sponsorship of the 
Japanese Ministry of Education, Culture, Sports, Science 
and Technology (MEXT) through the Asian Language 
Resource Network project. Moreover, the authors 
would like to thank Chea Sok Huor (PAN Locali�ation, 
Cambodia), Valaxay Dalaloy (STEA, Lao), Huynh 
Quyet Thang (Hanoi University of Technology) and 
Eedenebat Chuluun (Mongolian University of Science 
and Technology) for their valuable advice and for 
providing us with training texts.

References 

1. Global Reach, (2006). Global Internet Statistics, 
August 20, 2006, http://global-reach.bi�/globstats/
index.php3

  
2. Alis, (1997, June). Technologies and the Internet 

Society’s survey Web Languages Hit Parade.  http://
alis.isoc.org/palmares.en.html

 
3. FUNREDES report. (2006). �bservatory on the 

linguistic and cultural diversity of the Internet, 
http://funredes.org/LC/english/medidas/sintesis.
htm

 
4. O’Neill E.T., Lavoie B.F., Bennett R. (2003, April). 

Trends in the Evolution of the Public Web 1998 - 
2002, D-Lib Magazine, Volume 9

 
5. Paolillo J., Pimienta D., Prado D. (2005). Measuring 

Linguistic Diversity on the Internet, UNESC� 
Institute for Statistics, Montreal Canada 

 
6. Mikami Y., Zavarsky P., Ro�an M.Z., Su�uki I., 

Takahashi M., Maki T., Ni�an Ayob I. Boldi P., 
Santini M., Vigna S. The Language �bservatory 
Project (L�P), www 2005, Proceedings, Chiba, 
Japan                       

 
7. Caminero R.C., Zavarsky P., Mikami Y. (2006), 

Status of  the African Web. WWW 2006, 
Proceedings, 869-870

 
8. Boldi P., Codenotti B., Santini M., & Vigna S. 

(2002). UbiCrawler: A Scalable Fully Distributed 
eb Crawler. Technical Report, University degli 
Studi di Milano, Departmento di Scienze 
dell’Informazione

 
9. Boldi, P., Codenotti, B., Santini, M., & Vigna, S. 

(2004). UbiCrawler:  A Scalable Fully  Distributed 

S.T. Nandasara, Shigeaki Kodama, Chew Yew Choong, Razza Caminero 22
Ahmed Tarcan, Hammam Riza, Robin Lee Nagano, Yoshiki Mikami

October 2008  The International Journal on Advances in ICT for Emerging Regions 01 (01) 


23 An Analysis of Asian Language Web Pages

The International Journal on Advances in ICT for Emerging Regions 01 (01) October 2008

 Web Crawler, Software: Practice & Experience. 
Vol. 34, No. 8, pp. 711-726

 
10. Su�uki I., Mikami Y., Ohsato A. (2002). A Language  

and Character Set Determination Method Based 
on N-gram Statistics. ACM Transactions on Asian 
Language Information Processing, Vol. 1. No. 3,  
 pp. 270-279

  
11. Ethnologue. (2005). Language of the World, SIL 

International 2005, 15th Edition
 
12. Extra, Guus, and Gorter D, (Eds.). (2001). The Other 

Languages of Europe: Demographic, Sociolinguistic 
and Educational Perspectives Multilingual Matters 
  

13. Bruce P. (1998, May). Turkey and Iran in Former 
Soviet Central Asia and A�erbaijan: The Battle for 

 Influence that Never Happened, Eisenhower 
Institute’s Center for Political and Strategic Studies, 
Volume 2 

 
14. Pingali P., Jagarlamudi J., Varma J. W.  (2006). 

Indian language IR from multiple character 
encodings. WWW 2006, Proceedings, pp. 801-809

15. Rohra A. and Ananda P. (2005). Collecting 
Language  Corpora: Indian Languages, The Second 
Language �bservatory Workshop Proceedings, 
Tokyo University of Foreign Studies, Tokyo  

16.  Somboonviwat K., Tamura T., and Kitsuregawa 
M. (2005). Simulation study of language specific  
Webcrawling, Proceedings of the SW�D’05

 
17. Tongchim S., Srichaivattana P.,Kruengkrai C.,  

Sornlertlamvanich V. and Isahara H., (2006).  
Collaborative Web Crawler over High-speed 
Research Network. Proceedings, KICSS 2006, 
Ayutthaya, Thailand