Notes on using UTF-8 and locales on Linux systems.
Table of Contents
For some time now, the default shell environments shipped with many Linux distributions use UTF-8 (a.k.a. “Unicode”) locale information. This can be a bit confusing, especially for those accustomed to the old-style ASCII sorting order.
The starting point for documentation on the issue is the locale(1) man page. The locale command issued without arguments will provide a summary of your current environment.
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
For my purposes, two of these environment variables have more impact on my day-to-day work than the others. LC_CTYPE
specifies “character classification and case conversion,” in other words,
font information. LC_COLLATE
influences sorting order.
If you're accustomed to ASCII sorting, then the results of ls or sort might initially be confusing in a modern locale. Take a look at the difference the locale makes in these two directory listings. The first uses the old-time raw ASCII sort order. In the second, however, the locale “knows” that 'C' and 'c' are the same letter and that leading dots shouldn't influence the sorting order.
$LC_COLLATE="C" ls -a
. .. .CCC .ccc AAA BBB aaa bbb $LC_COLLATE="en_US" ls -a
. .. aaa AAA bbb BBB .ccc .CCC
Since I prefer the old sorting order, the first item of business was to alter LC_COLLATE
in my
shell environment. It appears I could achieve my desired results by setting it to a null value, “POSIX,” or “C.” I use the latter because that's what the Fedora init scripts
use.
# in my .profile script LC_COLLATE="C" export LC_COLLATE
The full explanation of getting unicode characters to display correctly in xterm windows is somewhat lengthy, but a quick-start recipe is pretty easy.
Download and save to disk Markus Kuhn's UTF-8 demo text file.
wget http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
Make sure your LC_CTYPE
environment variable is set to use UTF-8 locale-specific
characters.
LC_CTYPE="en_US.UTF-8" export LC_CTYPE
Invoke xterm using a ISO-10646-1 typeface.
xterm -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
Take a peek at the test file in a unicode-capable application like less.
less UTF-8-demo.txt
Test your local man installation if you'd like to see larger blocks of text.
# Greek's always fun LANG=el_GR.UTF-8 man man # German is widely supported LANG=de_DE.UTF-8 man man # As is Spanish LANG=es_ES.UTF-8 man man # How about something more exotic, like Hebrew or Korean? LANG=he_IL.UTF-8 man man LANG=ko_KR.UTF-8 man man
Here's a little script that'll print the locale-specific names of all the days and months for all the UTF-8 locales available on your system. It'll allow you to see the locales for which you do and don't have local font support.
#!/bin/bash LANG=C for loc in $(locale -a | grep utf8 | sort); do echo "Locale: $loc" # Aug 1, 2004 was a Sunday, Aug 7 a Saturday for n in $(seq 1 7); do LANG="$loc" date +"%A (%a)" -d 2004/8/${n} done for n in $(seq 1 12); do LANG="$loc" date +"%B (%b)" -d 2004/${n}/1 done echo done
You might also try saving the script's output to a file and then viewing that file with a web browser. On many of my systems, the browsers have better UTF-8 support than xterm and its system font.
A great starting place for UTF-8/Linux information is Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux. Markus is also the author of the helpful unicode(7) and utf-8(7) man pages that are found on many Linux systems.
Other helpful pages include Using UTF-8 with Gentoo, The Unicode HOWTO at the Linux Documentation Project, and Jan Stumpel's UTF-8 on Linux.
This article is licensed under a Creative Commons License.