Search
  ..:: Articles ::.. Register  Login

Current Articles | Categories | Search | Syndication

Monday, September 30, 2002
Unicode - Changing the way we use Indian Languages on Computer
By kshemankar @ 12:00 PM :: 2146 Views :: 1 Comments ::

Unicode - Changing the way we use

Indian Languages on Computer

- By Kshemankar Bhore, AJS Industries, Nashik, India

About the Author

Kshemankar Bhore is Director, Kshem Software, Nashik, India and manages the Information Technology division at AJS Industries Nashik, India. AJS Industries is the authorized distributor for Deshweb.com Pvt Ltd’s Aksharamala Range of Products.

Correspondence:
1, Swastik Niwas, Adj Arihant Hospital, Near Nirmala Convent,
Gangapur Road, Nashik, MS, India PIN: 422 013.
Tel: +91-253-2570743
Email: ajs@kshemsoftware.com

Legal Notices

The information presented in this article represents the current views of Kshemankar Bhore on the issues discussed as of the date of publication of this document. Kshemankar Bhore cannot guarantee the accuracy of any information after the publication date of the article. Product and company names mentioned here may be trademarks of their respective owners.

Introduction

Indian Languages are being used on computer for over 15 years now. CDAC introduced Indian Languages on computers with their popular products like GIST, iLeap, ISM Office etc. Soon many companies forayed into the market with their popular products. Today, there are over fifty companies giving Indian Language solutions for your computer. Despite so many solution providers and products, the usage of Indian Languages on computers has been mostly limited to DTP and intra office communications.

Existing Technology and its limitations

Before the emergence of Unicode, almost all the products were based on Indian language fonts with ASCII or Hacked encoding. ASCII is a standard of encoding for English Language and supports 255 characters mapped to numbers from 0 to 254. The Indian Language (henceforth referred to as Indic) solutions mapped the graphics (images of characters) of Indian Languages (like Marathi or Tamil) to these numbers. Thus we had 255 graphics for each Indian Languages which is far less than the characters we have in our Indian Languages. Also, there was no standardization in India, as to which character is to be mapped to which number.

All this have some serious disadvantages, which are listed below.

Problems with the representation of Complex Conjuncts

Availability of limited number of character mapping points led to compromising the representation of conjuncts. Complex conjuncts, as we have in Sanskrit could not be represented properly.

Problems with exchange of data

In the absence of standardization in mapping Indic characters to ASCII range, each vendor mapped them differently. Due to this a barrier was created in the exchange of Indic Data from one user to another. Anyone who needs to read or edit the Indic Language data needs to have the particular font in which the data was written.

Dependency on a particular vendor

Since the data written in one particular font cannot be viewed or edited using a font from a different vendor, you become dependent on the particular software. It is extremely difficult and cumbersome to move to different Indic software keeping your old data usable.

Problems in Using Basic Functionalities on your computer

We have seen that all the solutions just mapped graphics of Indic characters to the ASCII range. Since the ASCII range is for English characters, the Operating System understands your Indic text as English data only. For example:

Text with Shivaji Font

Same text with Verdana Font

The OS Understands it as:

marazI

marazI

marazI

ihMdI

ihMdI

ihMdI

Due to this we cannot use the basic functionalities available in office suites (such as MS Office) like sorting, find and replace, Indic Numbered bullets etc, the reasons of which are explained below.

Sorting

If you need to arrange the above two words in the ascending order of the Marathi Characters (????????), we should get the order as ?????, ?????. But since the operating system understands the text as English, we get the result as ?????, ?????. The obvious reason for this incorrect output is because ‘i’ comes before ‘m’ in English character order.

Find and Replace

The Find and Replace dialog box is capable of displaying text in a predefined English font. So even if you type Indic text there, you will be able to see English characters only. This makes it very difficult to type the exact Indic word we want to find or replace.

Although, solutions to above problems are available from some vendors, they are costly, resource intensive and implementation of these solutions in your software adds to its cost. Also you need to rewrite your existing software using these solutions to incorporate the above functionalities.

Problems using Indic contents on the Internet

There are many issues such as incorrect spacing, missing characters etc while displaying Indic web pages in some browsers.

Problems with Localization of software

It is certainly not possible to have software which has its menu in Indic Language, or shows and processes day and date in Indic Language, using the ASCII based Indic fonts. The reasons for it are not discussed here as they are too technical.

UNICODE is the solution

With the rapid spread of computers and Internet across all countries in the world, a need aroused to have a standard and reliable way to display text in their respective languages. So all the leading IT solution providers came together and formed a non profit organization called Unicode Consortium. Unicode consortium evolved a standard called Unicode for the representation of all the written languages of the world.

Features of Unicode

As mentioned earlier, Unicode is a universal standard for the representation of text on computers. It provides a consistent way of displaying multilingual text on the computers. Unicode currently defines over 65000 characters and can accommodate over a million characters more.

Working of Unicode

As we have seen earlier, in ASCII all the characters, irrespective of their languages are mapped to numbers ranging from 0 to 254. Unicode changes this by assigning a unique number to each character of every language.

For example, let’s take (?) in Devanagari and (?) in Gujarati. In Unicode both are mapped separately to different numbers as:

Character

Unicode Number

 

Character

Unicode Number

?

U + 0915

 

?

U + 0A95

This standardization of characters is done after carefully studying the specialties of the particular language. Also Unicode with OpenType Font Technology provides the facility to represent extremely complex conjuncts using grammar rules of the particular language.

For example:

Combination of characters

Actual representation using grammar rules

? + ?? + ? + ?? + ? + ?

??????

The Devanagari font that comes with Windows 2000 and Windows XP operating system contains over 500 characters and is capable of displaying complex Sanskrit conjuncts correctly.

Benefits of Unicode

After understanding the working of Unicode, let us see its benefits against the limitations of ASCII.

Better Portability

As Unicode assigns a unique number to each character of Indic language, the text written in these languages becomes practically font independent. That means, the Devanagari letter ? is going to be mapped to the number U + 0915, irrespective of the vendor of the font. Thus Devanagari text written on one computer using ‘Unicode Devanagari font A’ is still going to be visible on another computer having ‘Unicode Devanagari font B’. This allows us to share documents, send emails or chat with friends in your own language without bothering about the presence of the font in which the text is written.

With Unicode, the email I received is readable in Marathi, but in a different Unicode based font present on my computer. This shows how easy it is to exchange data in Unicode based Indian Languages without depending upon a particular font or software.

Thus we eliminate following limitations of ASCII:

  • Problems with representation of complex conjuncts
  • Problems with exchange of data
  • Dependency on a particular vendor

True Indic Language Support

When we write text in Indic Languages using Unicode Standard, the computer (the operating system) understands it in that language only. That means, if we write the letter ? in Devanagari, the computer understands it as ? only and not any other character as in ASCII. Moreover, with Unicode the computer now understands the language grammar also. This provides us following benefits:

Sorting

With Unicode the computer now understands your language. This allows us to sort data in the character order of your language. The screenshots of one such sorting for Devanagari data are given below:

Apart from sorting you can now use the ‘find and replace’ facility to find and replace any Unicode based Indic language text.

Localized software development

The support for Unicode is now built into your operating system. Many new operating systems like Windows 2000 and Windows XP have locales for many Indic Languages. Due to this, you can now have numbers, dates, currency etc. displayed in your own Indian Language. It also allows you to give a filename in your own language. So now you can have a filename as ‘???? ????.doc’.

Unicode allows you to display multilingual text in a single activeX control (List box, combo box etc.). This cannot be achieved using ASCII based Indian Language text.

(Above screenshots are of a sample application developed in Visual Basic 6.0, by Deshweb.com (P) Ltd.).

This support also enables software developers to create true Indian Language software solutions without using any costly third party tools. Vendors have already started bringing out software like dictionaries and encyclopedia in Indian Languages. Microsoft Office XP now has a spell checker for Hindi and Marathi language[1]. Microsoft is even slated to bring out their Hindi Language version of Windows in early 2003.

Tools for Typing in Unicode

Since Unicode contains large number of characters for every language, you need software or Input Method Editors (IME) to type in Unicode based Indic Language text. Currently, Microsoft and Deshweb.com Pvt Ltd provide such tools.

Tools from Microsoft

With Windows 2000 and Windows XP, Microsoft provides IME to type in Hindi, Marathi, Gujarati etc. The limitation here is that the keyboard layouts are as per some Typewriters’ layout and hence it is difficult to use for normal users who do not know typing. Also, these IME are supported only on Windows 2000 and Windows XP.

Aksharamala from Deshweb.com Pvt Ltd

Aksharamala is an intuitive Input mechanism for typing in text into software applications using Indian Languages. With strong support for Unicode as well as ASCII-based fonts, Aksharamala comes in Professional, Developer and Enterprise editions to address the needs of diverse market segments. The software is hotkey-enabled and allows the user to key in text in Indian languages in compatible Windows-based software applications. Above all, Aksharamala runs on Windows 98, ME, 2000 and XP.

Aksharamala allow the user to:

  • Create multi-lingual documents
  • Send email (Using both online and offline mail software)
  • Chat using MSN Messenger
  • Create web content
  • Develop / localize software
  • Sort / search content
  • Store / Retrieve data in / from compatible databases
  • Develop Workflow processes
  • Convert content from ASCII to Unicode format
  • Support diverse Keyboards like Phonetic, DOE, Remington etc.

And much more, all in Indian languages! The languages supported are Assamese, Bengali, Gujarati, Gurumukhi, Hindi, Kannada, Malayalam, Marathi, Sanskrit, Telugu, Tamil and other Indian languages.

Aksharamala contains a rich suite of products that empowers users and developers alike for the pervasive use of Indian languages on computers. Aksharamala product suite components:

  • For companies, organizations and individuals interested in enabling their websites to accept and process Indian languages Deshweb offers their IE Companion Bar components software.
  • For developers seeking to build or customizing software applications to allow input and processing of Indian language content Deshweb provides a developer's toolkit to design just that as part of the developer edition or Enterprise edition(for advanced development needs).
  • For the individual user wanting to send email / chat / create documents / PowerPoint presentations / databases / web content in Indian languages the Aksharamala client software does the needful.

The Aksharamala software and its components are designed to exacting standards and with the intention of sufficiently meeting the diverse needs of users.

The products offered by Deshweb are highly useful to several market segments. To name a few:

  • Government initiatives for e-governance can immensely benefit the people it serves if they can be transacted in the local Indian languages.
  • Educational institutions can build Indian Language content for easy distribution, testing online and for encouraging the use of the languages themselves.
  • Banks and other institutions can increase their reach, productivity and profitability by using electronic means of communication in Indian languages.
  • Companies can incorporate Indian language use in their software to increase their market reach to include the non-English speaking populace, without re-writing the complete software.
  • The lawyer community can efficiently share documents with their clients.

Conclusion

Unicode has been implemented in software and Operating System in last couple of years. So you need to have some components and software updated so as to use Unicode based Indian Languages on your computer. One way of doing it is to install the latest version of Internet Explorer (version 6). In fact I will recommend users to always install the new version of internet explorer as and when it comes as it updates your operating system with the latest components and keeps it compatible with the new technologies. Also the latest versions all your favorite software are now supporting Unicode and many will be supporting in the near future. Microsoft’s Office XP fully supports all Indian Languages.

Undoubtedly, Unicode has changed the way we use Indian Languages on Computer. We can now look beyond DTP and intra office communication for usage of Indian Languages. It’s high time that we all start using Indian Languages on computer in its true sense.


[1] Proofing tools for Hindi and Marathi are available separately from Microsoft.

Comments
By Anonymous @ Monday, February 13, 2006 8:40 PM
sir i want the unicode for all words and alphabets for english to hindi translation and translitration purpose.

Click here to post a comment
Copyright 2006 by Srinivas Annam   Terms Of Use | Privacy Statement