Understanding key concepts

Connection methods

There are basically two types of connections: serial and TCP/IP. You need to find out which type you have. If you don't know which it is, the easiest thing is to ask someone who knows. Just in case you don't have anyone to turn to, here are some tips. If you have a choice between the two, select TCP/IP. It will allow you to use a wider range of software. If you use the Internet at home as well as in your office, you probably have to use two different methods of connection.

Encoding methods for Japanese

While it is relatively easy to deal with alphabetic languages like English, German, and Spanish on the computer, use of non-alphabetic languages like Japanese and Chinese poses a special challenge. Since there are thousands of distinct characters used in these languages, it is impossible to adopt a one-byte-per-character scheme as in the ASCII encoding scheme, which can only accommodate 256 distinct characters including so-called extended ASCII characters (ones with accents etc.).

As for Japanese, the solution adopted by most computer platforms is the use of two-byte encoding schemes such as Shift-JIS (Shift Japanese Industry Standards) and EUC (Extended Unix Code). In the following example, the character 'store, shop', sandwiched between 1-byte alphabetic characters a and b, is internally represented in Shift-JIS as character No. 147 (for which there is no standard graphical representation) followed by character No. 88, which is the character X. (A hyphen represents a character boundary.)
graphical output: ab
internal coding: a-#147-X-b

These schemes, however, are 8-bit systems using all the 16 bits in two consecutive bytes to represent a large number of characters. These character representations, therefore, will not work with e-mail because it has traditionally allowed only 7-bit characters to be transmitted. Although a so-called "8-bit clean" mail system is on the rise, it is uncertain when it will become the world standard.

For the purpose of e-mail exchange, the accepted standard is the 7-bit JIS encoding. It is a so-called modal scheme which employs a special code for turning on the 2-byte encoding mode and another for turning it off. The two codes are implemented as escape sequences as follows:
start 2-byte encoding: ESC-$-B
end 2-byte encoding: ESC-(-J
*ESC represents character No. 27, the escape character, for which there is no standard graphical representation.

Let us return to the above example and see how it would be encoded using this scheme.
graphical output: ab
internal coding: a-ESC-$-B-E-9-ESC-(-J-b

Notice that the JIS code foris equivalent to E and 9 put together. However, obviously we do not want to have them come out as E and 9 but as. The ESC-$-B sequence (called the shift-in or kanji-in code) ensures that whatever follows it is to be understood as (JIS encoded) 2-byte characters. Since the next character is a 1-byte alphabetic character, it is necessary, at this time, to turn off the 2-byte mode by inserting the ESC-(-J sequence (shift-out or kanji-out), ensuring that whatever follows it will be interpreted as 1-byte characters.

The Escape problem

Even from the simple example above, it is obvious that these escape sequence codes are crucial in 7-bit JIS encoding because without them it would be impossible to interpret characters correctly. The problem that many people experience using non-Japanese-ready host computers outside Japan is that the escape character (character No. 27), being a non-printable control code, is either deleted or replaced by another printable character. We have found the following alteration patterns through our survey of various host computers, all of which of course render the text unreadable.
ESC >> (deleted)
ESC >> space
ESC >> "
ESC >> ^[

We will be referring to this problem as the escape problem and use the word corrupted to describe this kind of unreadable text, following Lunde (1993). Here is an example of corrupted Japanese text.

^[$B;d$NL>A0$O?<ED=_$G$9!#$3$NF|K\8l!"FI$a$^$7$?^[(J

As mentioned above, personal computers invariably use Shift-JIS for Japanese, which means that some mechanism is needed to translate 7-bit JIS-coded text into Shift-JIS text. This is handled by e-mail or code conversion software, and there are a number of programs such as NCSA Telnet-J, NinjaTerm, nkf, and JConv which are capable of performing this translation. When the Japanese text is corrupted, however, these programs are helpless.

Technical References


Ken Lunde's "Understanding Japanese Information Processing" O'Reilly & Associates, Inc.

Ken Lunde's CJK info. document (ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf)