Special characters on this Website

Any attempt to present information from a wide array of non-English languages that use the Roman alphabet inevitably requires the use of so-called "special characters" in order to be accurate.  This page describes how we do that on this Website, starting from the basic English language in which all text is written.  After explaining the basics, we touch on problems of computer representation before delving into what we can or cannot do with non-English languages in various parts of this Website.

NOTE:  This essay over-simplifies the uses of characters in human languages.  For a painfully exhaustive treatment of that subject, see the English Wikipedia article on the English alphabet and the hundreds of linked articles in that online encyclopedia.

The basics of English language representation

Spelling any native word in the English language requires only "ordinary" Latin (or Roman) letters.  There are 26 of these, each of which has an UPPER CASE VERSION and a lower case version.  In "alphabetic order," these characters are as follows:
  Upper case:  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
  Lower case:  a b c d e f g h i j k l m n o p q r s t u v w x y z

Expressing numbers concisely requires the use of ordinary Arabic digits.  In "numeric order," these digits are as follows:
  Numerals:  0 1 2 3 4 5 6 7 8 9

Combining English words into sentences and including some minimal indications of expression requires the use of "ordinary" punctuation characters.  In no particular order, they are as follows:
  Punctuation:  . , ; : ' " ! ? ( ) /

Finally, there are several other common non-linguistic characters that appear on most American-made typewriters and computer keyboards.  In no particular order, they are as follows:
  Special:  ` ~ @ # $ % ^ & * - _ = + \ | [ ] { } < >

The character shapes that you see above, collectively called "glyphs," are examples of how the English-trained human eye recognizes and distinguishes characters.  These glyphs can have many stylistic variations, as well as many variations in size, and still be recognizably the same characters.  (You may have noticed this already, comparing the glyphs in the lists above with those which are used in the text that you are now reading.)  A set of glyphs with a particular style and size is called a font; fonts of similar style make up a font family; and similar font families can be grouped into categories (such as serif or sans-serif).  But for the purposes of the present discussion, we are not concerned with styles or fonts — merely with the idea of distinct characters used for linguistic and related purposes.

Representing characters on computers

Computers don't understand glyphs — they only process bits and bytes.  However, their output devices can translate groups of bits (i.e., bytes) into glyphs for us to see.  Thus every computer manufacturer has had a system for encoding common glyphs as binary numbers, and for displaying those binary numbers as the corresponding glyphs on various output devices (printers, monitors, etc.).

The earliest machine-independent American standard for conversion of bytes to glyphs is ASCII — the American Standard Code for Information Interchange.  It uses 7-bit bytes, which correspond to decimal numbers 0 through 127.  ASCII defines the human meaning of each of those 128 numbers — the 94 visible glyphs shown in the lists above plus "space" plus 33 control characters, each of which has a particular function in controlling various aspects of computer-related hardware.  Looking at this in the opposite direction, ASCII defines how each of these common glyphs and control functions is to be encoded as a binary number for storage within and transmission between computers.

Almost all modern computers use ASCII as the basis for encoding the most commonly used characters, and this is quite sufficient for handling the English language.  However, because such computers typically are constructed to use 8-bit bytes, there is an obvious opportunity to define the association of 128 more numbers (decimal 128 through 255) with glyphs, thus enabling computers to handle information in languages that require various diacritical marks attached to or associated with the ordinary Latin letters, as well as digraphs and additional punctuation marks and special characters.  Unfortunately, these "extended ASCII" characters have not been defined consistently between the various computer manufacturers, resulting in translation difficulties when information is transferred from one computer to another.  These difficulties were eventually addressed in two different ways.

One way arose out of a desire for a machine-independent standard that would encompass all known characters in all human-readable languages, plus the entire spectrum of character-like glyphs that have been used for special purposes in printing documents of many kinds.  This desire led to the development of Unicode, which contains encoding definitions for hundreds, thousands, or even tens of thousands of different characters.  Each Unicode character is associated with a numeric value, called a code point, which can be expressed as either a decimal number or a hexadecimal number.  (How those numbers are handled within a computer is not our concern here.)

The other way arose out of the development of the World Wide Web and the concept of a Web browser — a software program that can display information retrieved from a distant computer (a Web server) in a consistent human-readable form regardless of how that information is otherwise encoded in the host computer or the visitor's computer.  The HyperText Markup Language (HTML) which is used to present readable pages to Web visitors is based strictly on the graphical characters defined by ASCII.  Glyphs for characters that are not included in ASCII are defined in HTML by character references, of which there are two types — "character entity references" and "numeric character references"; both are short strings of ASCII characters in a special format.

Character entity references use predefined entity names, which are composed of a few ASCII letters in which case is significant.  Because of the requirement for predefinition, the number of characters that have associated entity names is relatively small; most are the combinations of letters with accents or diacritical marks as found in the most widely used Western languages, but various important currency marks (e.g., €, £ and ¥) and other common symbols are included.  In use, these entity names are always framed as character strings that begin with an ampersand (&) and end with a semicolon (;).  In addition, three ordinary ASCII characters (<, >, &) have equivalent entity names, so that they can be used either as HTML markup codes or as normal glyphs.  Those entity names are, respectively, "&lt;", "&gt;" and "&amp;".  (You don't need to remember them, but they make good illustrations of the concept of entity names.)  As another example of the use of entity names to represent non-English characters, consider Ç and ç — capital C and lower case c with a cedilla.  Those characters are not part of ASCII; they were displayed in the preceding sentence by using the entity names "&Ccedil;" and "&ccedil;".  (Those entity names are displayed here by nesting the entity name for ampersand. ;-)

Numeric character references, by contrast, are not predefined; instead, they are formed with the numeric value, or code point, of any of the tens of thousands of characters and symbols that are included in Unicode.  They are similar in general appearance to character entity references, since they begin with an ampersand and end with a semicolon.  But between those framing characters is a numeric value which is preceded by either "#" (if the value is expressed in decimal digits) or "#x" (if the value is expressed in hexadecimal digits).

As an example, the Euro currency mark (€) can be coded in HTML as  &euro; (€)  or  &#8364; (€)  or  &#x20AC; (€).  If you do not see the Euro sign (like the capital letter C overlaid with the two bars of an equal sign) in all four pairs of parentheses in the preceding sentence, then your Web browser must be an old one that does not handle all of the varieties of Unicode and entity names.  Incidentally, the first Euro sign in that sentence was not coded at all; it was expressed directly in the native character set of the computer which prepared this page.  That works because of two facts:  Firstly, the HTML for this page includes an assertion that it is encoded with the character set UTF-8, which is one of the standard methods for defining the correspondence between Unicode code points and the internal workings of computers.  And secondly, UTF-8 is one of the internal coding methods of the computer on which this article was written.

The complexities of non-English language representation

Because the historical events involved in the development of carillons and closely-related tower bell instruments took place primarily in Western Europe (and eventually in North America, where the author of this essay has always lived), information about that development and those instruments is readily expressed in English, and relevant place names are either in English or in a few other languages that also use the Latin alphabet.  As indicated above, the English language uses only characters that can be encoded and displayed with ASCII, which is so widely used that no translation (or to be more accurate, transcoding) is required.  That is the case for the several computers used to develop and maintain this Website, so all English-language information presented here should always display correctly (aside from human-caused typographical errors).

Those other languages (primarily French and German) use a relatively small number of diacritical marks combined with various characters of the Latin alphabet.  Since all of those combined characters are contained in most of the various "extended ASCII" encoding conventions, it should be possible to manage them properly on our computers in order to express information in those languages correctly.  However, some special translation efforts are required when transferring information between computers, because the different computers follow different conventions for associating bytes with glyphs.  This has some impact on whether or not you will see in our Webpages just what we intended for you to see.

Extended ASCII characters are encoded in our database computer so as to print correctly on a Hewlett-Packard LaserJet III-P printer that gave good service for many years but no longer exists.  The database extraction program on that computer produces output files on disk in three forms — print files destined to become PDFs (in which extended ASCII has been transcoded from the HP convention to the Macintosh convention), HTML files destined to become Webpages (in which extended ASCII has been converted to HTML entities) and XML files destined to drive the display of regional locator maps (which must be fixed by hand before uploading). 

For languages that require Latin-based characters that are not included in extended ASCII (e.g., Polish, Maltese and Czech), those characters cannot be incorporated into the database.  Instead, Anglicized versions of those characters (or the names that would contain them) are used in the database, and may be manually replaced by the proper versions before files are uploaded to the Website.  For names which are natively expressed in languages that do not use Latin-based characters (e.g., Russian, Japanese, Chinese), we make no attempt to record or display such names.  Instead, we use either an English equivalent name or an English transliteration of the native name (e.g., Moscow or Moskva for the capitol city of Russia).

The Links section of each site data page is not derived from the database, but is entirely constructed by hand (with copy-and-paste of various source material, of course).  The same holds true for the pages that list great bells.  As such, it is fairly easy to incorporate HTML entities in order to display non-ASCII characters properly.

Summary

What you will see on this Website falls into three broad categories, as follows:

In hyperlinks to information elsewhere on the Web, it is occasionally necessary to use numeric character references in order to be sure that Web browsers can handle those links properly.

In spite of our best efforts to cope with the complexities of encoding non-English languages as outlined above, it is possible that this Website includes some characters that we have failed to translate or transcode properly, or for which we have not used the correct HTML entity.  Therefore:

If you see an incorrect character on any page,
  whether it is a typographic error,
    a misplaced diacritical mark,
      a malformed entity name,
        or something else,
please use the email link at the bottom of that page to tell us about it!

Otherwise, we might not notice it for a long time, and it could cause confusion for other visitors who are not as perceptive as you are.
Thanks in advance for your help in making this Website more useful for all its visitors.


[Tower Bells Home Page] [Site data top page] [Credits and Disclaimers] [Feedback]

This page was created 2020/03/03 and last revised 2024/01/11.

Please send comments or questions about this page to csz_stl@swbell.net