Understanding key concepts
Connection methods
There are basically two types of connections: serial and TCP/IP. You need
to find out which type you have. If you don't know which it is, the easiest
thing is to ask someone who knows. Just in case you don't have anyone to
turn to, here are some tips.
- If you can browse the Web with NetScape or Masaic, you have TCP/IP connection.
- If your computer is directly wired rather than connected through a phone
line via a modem, you are likely to have TCP/IP-type connection.
- If you are connected through a modem and names such as PPP and SLIP
show up when you make a connection, you have TCP/IP-type connection. (Thus,
a modem alone does not determine the type of connection you have.)
- Internet connection is the same as TCP/IP connection. If you are connected
to the Internet through an Internet provider, you have TCP/IP connection.
- If you have to use Kermit or X-modem for downloading and uploading files,
you have serial connection.
- If you use a communication program (terminal emulation program) that
doesn't support TCP/IP, you have serial connection.
If you have a choice between the two, select TCP/IP. It will allow you to
use a wider range of software. If you use the Internet at home as well
as in your office, you probably have to use two different methods of connection.
Encoding methods for Japanese
While it is relatively easy to deal with alphabetic languages like English,
German, and Spanish on the computer, use of non-alphabetic languages like
Japanese and Chinese poses a special challenge. Since there are thousands
of distinct characters used in these languages, it is impossible to adopt
a one-byte-per-character scheme as in the ASCII encoding scheme, which can
only accommodate 256 distinct characters including so-called extended ASCII
characters (ones with accents etc.).
As for Japanese, the solution adopted by most computer platforms is the
use of two-byte encoding schemes such as Shift-JIS (Shift Japanese Industry
Standards) and EUC (Extended Unix Code). In the following example, the character
'store, shop', sandwiched between 1-byte alphabetic characters a
and b, is internally represented in Shift-JIS as character No. 147
(for which there is no standard graphical representation) followed by character
No. 88, which is the character X. (A hyphen represents a character boundary.)
graphical output: a
b
internal coding: a-#147-X-b
These schemes, however, are 8-bit systems using all the 16 bits in two consecutive
bytes to represent a large number of characters. These character representations,
therefore, will not work with e-mail because it has traditionally allowed
only 7-bit characters to be transmitted. Although a so-called "8-bit
clean" mail system is on the rise, it is uncertain when it will become
the world standard.
For the purpose of e-mail exchange, the accepted standard is the 7-bit JIS
encoding. It is a so-called modal scheme which employs a special code for
turning on the 2-byte encoding mode and another for turning it off. The
two codes are implemented as escape sequences as follows:
start 2-byte encoding: ESC-$-B
end 2-byte encoding: ESC-(-J
*ESC represents character No. 27, the escape character, for which there
is no standard graphical representation.
Let us return to the above example and see how it would be encoded using
this scheme.
graphical output: a
b
internal coding: a-ESC-$-B-E-9-ESC-(-J-b
Notice that the JIS code for
is equivalent to E and 9 put together.
However, obviously we do not want to have them come out as E and 9 but as
. The
ESC-$-B sequence (called the shift-in or kanji-in code) ensures that whatever
follows it is to be understood as (JIS encoded) 2-byte characters. Since
the next character is a 1-byte alphabetic character, it is necessary, at
this time, to turn off the 2-byte mode by inserting the ESC-(-J sequence
(shift-out or kanji-out), ensuring that whatever follows it will be interpreted
as 1-byte characters.
The Escape problem
Even from the simple example above, it is obvious that these escape sequence
codes are crucial in 7-bit JIS encoding because without them it would be
impossible to interpret characters correctly. The problem that many people
experience using non-Japanese-ready host computers outside Japan is that
the escape character (character No. 27), being a non-printable control code,
is either deleted or replaced by another printable character. We have found
the following alteration patterns through our survey of various host computers,
all of which of course render the text unreadable.
ESC >> (deleted)
ESC >> space
ESC >> "
ESC >> ^[
We will be referring to this problem as the escape problem and use
the word corrupted to describe this kind of unreadable text, following
Lunde (1993). Here is an example of corrupted Japanese text.
^[$B;d$NL>A0$O?<ED=_$G$9!#$3$NF|K\8l!"FI$a$^$7$?^[(J
As mentioned above, personal computers invariably use Shift-JIS for Japanese,
which means that some mechanism is needed to translate 7-bit JIS-coded text
into Shift-JIS text. This is handled by e-mail or code conversion software,
and there are a number of programs such as NCSA Telnet-J, NinjaTerm, nkf,
and JConv which are capable of performing this translation. When the Japanese
text is corrupted, however, these programs are helpless.
Technical References
Ken Lunde's "Understanding Japanese Information Processing" O'Reilly
& Associates, Inc.
Ken Lunde's CJK info. document (ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf)