Notes on UTF-8 and locales

Paul Heinlein

Initial publication: July 19, 2004
Most recent revision: March 23, 2005

Notes on using UTF-8 and locales on Linux systems.


Table of Contents

Introduction
Old-time sorting
UTF-8 fonts in xterms
Testing UTF-8 support with GNU date
Useful links

Introduction

For some time now, the default shell environments shipped with many Linux distributions use UTF-8 (a.k.a. “Unicode”) locale information. This can be a bit confusing, especially for those accustomed to the old-style ASCII sorting order.

The starting point for documentation on the issue is the locale(1) man page. The locale command issued without arguments will provide a summary of your current environment.

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

For my purposes, two of these environment variables have more impact on my day-to-day work than the others. LC_CTYPE specifies “character classification and case conversion,” in other words, font information. LC_COLLATE influences sorting order.

Old-time sorting

If you're accustomed to ASCII sorting, then the results of ls or sort might initially be confusing in a modern locale. Take a look at the difference the locale makes in these two directory listings. The first uses the old-time raw ASCII sort order. In the second, however, the locale “knows” that 'C' and 'c' are the same letter and that leading dots shouldn't influence the sorting order.

$ LC_COLLATE="C" ls -a
.  ..  .CCC  .ccc  AAA  BBB  aaa  bbb
$ LC_COLLATE="en_US" ls -a
.  ..  aaa  AAA  bbb  BBB  .ccc  .CCC

Since I prefer the old sorting order, the first item of business was to alter LC_COLLATE in my shell environment. It appears I could achieve my desired results by setting it to a null value, “POSIX,” or “C.” I use the latter because that's what the Fedora init scripts use.

# in my .profile script
LC_COLLATE="C"
export LC_COLLATE

UTF-8 fonts in xterms

The full explanation of getting unicode characters to display correctly in xterm windows is somewhat lengthy, but a quick-start recipe is pretty easy.

  • Download and save to disk Markus Kuhn's UTF-8 demo text file.

    wget http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt
    
  • Make sure your LC_CTYPE environment variable is set to use UTF-8 locale-specific characters.

    LC_CTYPE="en_US.UTF-8"
    export LC_CTYPE
    
  • Invoke xterm using a ISO-10646-1 typeface.

    xterm -fn -misc-fixed-medium-r-normal--14-130-75-75-c-70-iso10646-1
    
  • Take a peek at the test file in a unicode-capable application like less.

    less UTF-8-demo.txt
    
  • Test your local man installation if you'd like to see larger blocks of text.

    # Greek's always fun
    LANG=el_GR.UTF-8 man man
    
    # German is widely supported
    LANG=de_DE.UTF-8 man man
    
    # As is Spanish
    LANG=es_ES.UTF-8 man man
    
    # How about something more exotic, like Hebrew or Korean?
    LANG=he_IL.UTF-8 man man
    LANG=ko_KR.UTF-8 man man
    

Testing UTF-8 support with GNU date

Here's a little script that'll print the locale-specific names of all the days and months for all the UTF-8 locales available on your system. It'll allow you to see the locales for which you do and don't have local font support.

#!/bin/bash
LANG=C
for loc in $(locale -a | grep utf8 | sort); do
  echo "Locale: $loc"
  # Aug 1, 2004 was a Sunday, Aug 7 a Saturday
  for n in $(seq 1 7); do
    LANG="$loc" date +"%A (%a)" -d 2004/8/${n}
  done
  for n in $(seq 1 12); do
    LANG="$loc" date +"%B (%b)" -d 2004/${n}/1
  done
  echo
done

You might also try saving the script's output to a file and then viewing that file with a web browser. On many of my systems, the browsers have better UTF-8 support than xterm and its system font.

Useful links

A great starting place for UTF-8/Linux information is Markus Kuhn's UTF-8 and Unicode FAQ for Unix/Linux. Markus is also the author of the helpful unicode(7) and utf-8(7) man pages that are found on many Linux systems.

Other helpful pages include Using UTF-8 with Gentoo, The Unicode HOWTO at the Linux Documentation Project, and Jan Stumpel's UTF-8 on Linux.

This article is licensed under a Creative Commons License.