Unicode
From Caliach Vision V4.0000 the character system known as Unicode is fully supported both in the program and data. When upgrading to this version from an older one your datafiles will automatically convert and can not be used by an older version thereafter. Take a backup of all datafiles, custom libraries and other site-specific files before attempting to use them in Caliach Vision V4 or later.
What is Unicode?
Unicode is a character representation system that supports the majority of world languages and the display of special characters, including scientific and mathematical symbols.
Unicode provides a mechanism for representing characters or symbols used in many of the languages in the world, as well as scientific and technical environments. The Unicode standard is maintained by the Unicode Consortium (www.unicode.org) who set the standards for Unicode and promote its worldwide use. They define Unicode as: “a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world.” In the context of database computing, Unicode allows the seamless exchange and processing of character data across different platforms, software products and programming environments.
The Unicode consortium provides information and resources concerning Unicode, including the standard definition and maintenance, character code tables, a locale identifier repository, and lists of Unicode enabled products. The last major version update of Unicode was version 6.1 which is capable of representing more than 110,000 characters covering 100 scripts, used in many different languages throughout the world. Many operating systems and software products have adopted Unicode, which is now universally accepted as the standard for character representation. For example, the latest versions of Windows 7 and Mac OS X, as well as all varieties of Linux, offer Unicode support. All web standards, such as the latest versions of HTML and XML support Unicode, as well as the latest versions of Internet Explorer and all Mozilla-based browsers. In addition, most mainstream software packages such as Open Office support Unicode.
The example above shows Thai, English and Arabic characters mixed in a Marketing company name. Of course, the font being used must be capable of representing the characters involved.
Together with the display of multiple scripts in Caliach Vision, the use of Unicode encoding affects the sort order of dynamic data, for example, in lists or reports, as well as the querying and retrieval of data from databases.
How does Unicode work?
Before Unicode characters were represented in computer memory by a single byte of 8 bits which allows only 256 different characters combinations. This system derived from the original development of computers where memory was in short supply, expensive and English-dominated and is known as ASCII. The first 128 are reserved for special control characters and Latin letters, numbers and punctuation. The upper 128, were made available for language-specific special characters, such as French accented characters, and are known as extended character sets. Different extended character sets, such as those for French and Thai, could not co-exist – the program had to decide on one or the other. With the spread of computing around the world chaos ensued, and Unicode was developed.
The developers of the Unicode standard realised that for it to work it had to accommodate the huge volume of legacy data out there, the large number of potential characters to represent, and the need for efficiency in storage and memory use. This meant having a multi-byte character representation for each of the numerous characters, but a variable length to make it efficient in storage. In practice they developed several character encoding standards to accommodate different needs and situations. UTF-8 and UTF-16 are common, the former being the most common. (UTF stands for Universal character set Transformational Format). In fact Caliach Vision in memory holds everything in UTF-32 which is a 4-byte format. This is a fixed-width standard that very fast to process. When storing data externally however Caliach Vision uses UTF-8 which can be 1, 2, 3 or four bytes depending on the character.
For example, this help page is stored in UTF-8 and because most of the character are Latin they only use 1 byte each. But other characters such as £Ấø± may use more than one byte. Technically the system reserves the upper-most bit of each byte to tell the program that there is a further byte to combine before it can pick the character to display. So for instance 00000000 to 01111111 are single-byte characters, and for backward compatibility are identical to ASCII characters. If the first bit is a 1 then the first byte is combined with the following byte. If that in turn has an upper 1 bit, the two bytes are combined with a third, and so on. Rather than seeing things in ones and zeros, it is more convenient to use hexadecimal to represent what is known as code points. This splits a byte into half and each half is represented by 0 to 9 then A to F. So for single-byte characters this is 00 to EF. The convention has arisen that a Unicode character code point is therefore represented by U+00 to U+10FFFF, which is the practical limit.
Unicode permits any international script character to coexist with any other. Under Unicode the letter “a” can coexist with East-Asian character “㌀“. For more information on the Unicode system see http://en.wikipedia.org/wiki/Unicode.
TIP: For languages such as Thai and Lao which do not visually separate words, the entire text becomes one word in the Keyword system. You can though use the special Zero Width Space (ZWSP) character in the text. This character is U+200B (you can find it in Windows Character Map utility for Arial font, Advanced view and Go to Unicode 200B). On the Mac use Option-200B. This character, invisible when viewed, is seen by the Keyword system which breaks the text at that point. For more on this see the Keyword System Settings topic.
Sort Order and Search
Searching and sorting of data involves comparison of character strings. Before this was in the range 0-256, but with the introduction of Unicode a much larger range is possible. Omnis Studio uses comparison of the UTF-8 values of the strings to achieve this.
This does mean that in some instances of character use the sort order may not follow national linguistic expectations. For instance the characters aADZàÀ will, when sorted come out in the order AaDZÀà as they will be sorted by their underlying Unicode value.
NOTE: As a consequence of this Caliach hare deprecated the use of National character fields (used in a few places before such as customer name and part description) and now all Character fields are of the subtype kSimplechar.
Extreme Character for Searches
When in Caliach Vision you use the search windows to select data for a report, global change, or whatever, if you do not enter a value in the to boxes, the program automatically places and extreme limit character and uses that to search against. This is typically a high value character that is higher that those ever used in the data so that all records are included in the search. However, this can be problematic if you are using no-English characters. For this reason there is an Extreme Character control setting in File — System Manager — Company Details, Localisation tab pane. You can set the Hex Unicode value for the extreme character you want to use and is appropriate for the characters you use. A value of 221E is the default.
Character Normalisation
Unicode allows a significant number of characters to be represented by more than one sequence of code points. For example, consider the letter E with circumflex and dot below, a character that occurs in Vietnamese (Ệ). This character has five possible representations in Unicode:
- U+0045 Latin capital letter E
U+0302 combining circumflex accent
U+0323 combining dot below - U+0045 Latin capital letter E
U+0323 combining dot below
U+0302 combining circumflex accent - U+00CA Latin capital letter E with circumflex
U+0323 combining dot below - U+1EB8 Latin capital letter E with dot
U+0302 combining circumflex accent - U+1EC6 Latin capital letter E with circumflex and dot below
A character represented by more than one individual character is referred to as a composite character. A character represented by a single character is referred to as a pre-composed
character.
As far as the end-user is concerned any of the above constructions of the visual character will look the same. But for a database system they will be seen quite differently. The construction will effect sorting, comparisons and will consume different amounts of space. In 1 above what appears as one character to the user will consume 3 characters in the field, when stored.
There is a process that deals with this known as canonicalization, which basically means converting composite characters into their pre-composed form or if there is not one available into a standard composite form that is consistent and therefore results of comparisons will be what the user expects. The process typically breaks down composite characters into their elements (canonical de-composition) and then re-combines them into a normalised form (canonical composition). Within Caliach Vision, this is automatically performed for all fields that are indexed, after the user enters their value (a Customer Name, for example). As Keyword search typically extracts words from non-indexed fields, the Keyword system performs this normalisation before it extracts the words.
TIP: Pasting from the clipboard into a field always normalises the incoming text, so one trick to force normalisation is to select all the text entered into a field, copy and then paste.
TIP: You can perform a wholesale normalisation of all non-indexed character field data in the File — Advanced — Re-unite Missing Relatives using the nfc() Normalise function.
Import/Export and Report File Encoding
Each character of text stored in a computer is represented by a binary code. Various coding systems have existed in the past and several Unicode coding systems exist. When importing or exporting data in and out of Caliach Vision some encoding must be made to either convert the importing data into the standard used within the program or encode exported text data into a form suitable for reading by other programs.
Typically, data saved to external text files, such as the Terminal.inf file or archive files, use the UTF-8 Unicode standard.
Importing and Exporting and report data written to text files and the port are controlled in File — Preferences — User Settings by three settings that map onto appropriate session preferences of Omnis Studio. If you are importing data generated by a program with a specific text encoding that cannot be automatically identified by a BOM then you will need to set the Import Encoding specifically. Likewise, if you are exporting data for use with a program that uses a particular encoding regime, you will need to set the Export Encoding.
Encoding for Import when no BOM: The encoding used for imported data when importing from port, or when the import file does not have a Unicode Byte Order Marker (BOM). Any of the kUniType… constants, except kUniTypeAuto, kUniTypeCharacter, kUniTypeBinary and the kUniTypeUTF32… values.
Encoding for Exporting: The encoding used for exporting data and printing to port or text file. Any of the kUniType… constants, except kUniTypeAuto, kUniTypeCharacter and kUniTypeBinary.
Set Byte Order Mark at the start of exported text: If set, and the Exporting preference identifies a Unicode encoding, a Unicode BOM is output at the start of the output file.
kUniType… name |
Description |
---|---|
kUniTypeAuto |
The encoding is automatically detected from the data, using the Unicode BOM if present; possible encodings are identified by some of the other kUniType… constants |
kUniTypeUTF8 |
The data is Unicode character data encoded using UTF-8 |
kUniTypeUTF16BE |
The data is Unicode character data encoded using UTF-16BE |
kUniTypeUTF16LE |
The data is Unicode character data encoded using UTF-16LE |
kUniTypeUTF16 |
The data is Unicode character data encoded using UTF-16LE if the machine is little-endian, or UTF-16BE if the machine is big-endian |
kUniTypeUTF32BE |
The data is Unicode character data encoded using UTF-32BE |
kUniTypeUTF32LE |
The data is Unicode character data encoded using UTF-32LE |
kUniTypeUTF32 |
The data is Unicode character data encoded using UTF-32LE if the machine is little-endian, or UTF-32BE if the machine is big-endian |
kUniTypeNativeCharacters |
Each byte in the data represents a character in the Latin 1 character set for the machine (Ansi on Windows, MacRoman on the Mac, ISO-8859-1 on Linux) |
kUniTypeAnsiThai |
Each byte in the data represents a character in the ANSI-874 Thai code page |
kUniTypeAnsiCentralEuropean |
Each byte in the data represents a character in the ANSI-1250 Central European code page |
kUniTypeAnsiCyrillic |
Each byte in the data represents a character in the ANSI-1251 Cyrillic code page |
kUniTypeAnsiLatin1 |
Each byte in the data represents a character in the ANSI-1252 Latin 1 code page |
kUniTypeAnsiGreek |
Each byte in the data represents a character in the ANSI-1253 Greek code page |
kUniTypeAnsiTurkish |
Each byte in the data represents a character in the ANSI-1254 Turkish code page |
kUniTypeAnsiHebrew |
Each byte in the data represents a character in the ANSI-1255 Hebrew code page |
kUniTypeAnsiArabic |
Each byte in the data represents a character in the ANSI-1256 Arabic code page |
kUniTypeAnsiBaltic |
Each byte in the data represents a character in the ANSI-1257 Baltic code page |
kUniTypeAnsiVietnamese |
Each byte in the data represents a character in the ANSI-1258 Vietnamese code page |
kUniTypeISO8859_1 |
Each byte in the data represents a character in the ISO-8859-1 character set (Latin-1 Western European) |
kUniTypeISO8859_2 |
Each byte in the data represents a character in the ISO-8859-2 character set (Latin-1 Central European) |
kUniTypeISO8859_3 |
Each byte in the data represents a character in the ISO-8859-3 character set (Latin-1 South European) |
kUniTypeISO8859_4 |
Each byte in the data represents a character in the ISO-8859-4 character set (Latin-1 North European) |
kUniTypeISO8859_5 |
Each byte in the data represents a character in the ISO-8859-5 character set (Latin/Cyrillic) |
kUniTypeISO8859_6 |
Each byte in the data represents a character in the ISO-8859-6 character set (Latin/Arabic) |
kUniTypeISO8859_7 |
Each byte in the data represents a character in the ISO-8859-7 character set (Latin/Greek) |
kUniTypeISO8859_8 |
Each byte in the data represents a character in the ISO-8859-8 character set (Latin/Hebrew) |
kUniTypeISO8859_9 |
Each byte in the data represents a character in the ISO-8859-9 character set (Latin-5 Turkish) |
kUniTypeISO8859_10 |
Each byte in the data represents a character in the ISO-8859-10 character set (Latin-6 Nordic) |
kUniTypeISO8859_11 |
Each byte in the data represents a character in the ISO-8859-11 character set (Latin/Thai) |
kUniTypeISO8859_13 |
Each byte in the data represents a character in the ISO-8859-13 character set (Latin-7 Baltic Rim) |
kUniTypeISO8859_14 |
Each byte in the data represents a character in the ISO-8859-14 character set (Latin-8 Celtic) |
kUniTypeISO8859_15 |
Each byte in the data represents a character in the ISO-8859-15 character set (Latin-9) |
kUniTypeISO8859_16 |
Each byte in the data represents a character in the ISO-8859-16 character set (Latin-10 South-Eastern European) |
kUniTypeOEM |
Each byte in the data represents a character in the OEM character set (MS-DOS code page 437) |
kUniTypeBinary |
The data is binary |
kUniTypeCharacter |
The data is Omnis character data (use this constant in conjunction with the uniconv() function to identify a source or destination variable of type character) |
Conversion of Libraries and Data
When you open a Library (including Ad Hoc reports) or Data previously run under a old non-Unicode version of Omnis Studio or Caliach Vision, the program will automatically convert the Library or Datafile to Unicode. This is an irreversible process and will expand the file sizes, so must be backed up.
Libraries
When you open a library Omnis Studio V5.2+ will convert the library to Unicode. Once converted it can not be opened by previous versions of Omnis Studio.
Data Files
Datafiles can be converted using two alternative methods.
- Full: whereby a full conversion of the Character based data in you Omnis data file takes place. The existing indexes are dropped and a new index of your data is built.
- Quick: whereby the indexes are dropped and rebuilt, but the Character data in you Omnis data file is not converted. This is OK for files containing only 7 bit data: Omnis does not check that the file contains only 7 bit data, so it is your responsibility to know whether
or not it is safe to run this conversion process.
You are strongly recommended to use the Full conversion.
NOTE: While the Omnis Studio Unicode conversion converts extended characters (greater or equal to 128 ASSCII) where the data is held in conventional character fields, but it cannot convert data held in complex fields such as Objects or Lists. Such data may need manual correction. Complex objects are mostly associated with Settings, so check your settings after conversion and correct manually if needed. For example, custom feature or report descriptions, ScratchPad column titles, Navigator function titles and help, etc.. WorkFlow Alert Checks and Triggers, Document Configuartion designs and Graph data are also complex.
See also: – |
Compiled in Program Version 5.10. Help data last modified 31 Dec 2014 00:49:00.00. No class.