At the beginning of an XML document, the XML declaration can optionally declare the document’s encoding format. This typically looks something like this:
<?xml version="1.0" encoding="UTF-8"?>
Sometimes you’ll see the encoding as “UTF-8″ or “UTF-16″ (all caps), sometimes as “utf-8″ or “utf-16″ (lowercase). Which is correct? Or are both correct? The short answer is that the uppercase variant is preferred, but both are allowed, though that does not ensure that both variants are widely supported. This suggests the following recommended approach:
Be forgiving when reading, strict when writing. When consuming XML, you are fully standards-compliant by supporting case-insensitive parsing of the encoding format. When producing XML, you are still standards-compliant by generating an uppercase encoding format, while also more likely to be readable by potential consumers.
Often the journey is more interesting than the destination when it comes to deciphering Internet standards; read on for the gory details.
At first blush, the lowercase usage appears consistent with XHTML (which requires elements and attributes to be lowercase) – but does this convention apply to an XML Processing Instruction (which is metadata, not content)?
According to the W3C Recommendation for Extensible Markup Language (XML) 1.0 (Fourth Edition) section 4.3.3 Character Encoding in Entities:
“XML processors SHOULD match character encoding names in a case-insensitive way and SHOULD either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown.”
Looking up the values in the Internet Assigned Numbers Authority (IANA) registry for the official spellings of the encoding values, you will find “UTF-8″ and “UTF-16″ – listed in uppercase. IANA also cross-references RFC-3629 which also goes with all caps. And all of the examples around the XML Recommendation seem to use uppercase exclusively.
So the uppercase versions appear to be the “right” ones.
But are the lowercase versions actually wrong? They might be. The meaning of the word “SHOULD” in the above quoted text is governed by RFC 2119 where it is defined to mean:
“… that there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course.”
If the writers of the XML specification wanted to insist that processors always treat this in a case-insensitive manner, the word “MUST” would have been used from RFC 2119.
So a processor can choose to ignore the part of the XML Recommendation where case-insensitive processing is suggested, and still be within the standard. A processor must always support uppercase; further, a processor only supporting uppercase is perfectly legal. Even if an uppercase only processor seems unlikely, I’m going to standardize on all caps when I create XML files.
What about other character encodings, or what if one is not specified?
A character encoding need not be explicitly specified; if it is not specified, UTF-8 is default.
UTF-8 and UTF-16 are “universally” supported by XML parsers (by standard requirement); ISO-8859-1 is also often supported, but that char set is less complete (e.g., euro symbol missing).
“The root element can be preceded by an optional XML declaration. This element states what version of XML is in use (normally 1.0); it may also contain information about character encoding and external dependencies.
The specification requires that processors of XML support the pan-Unicode character encodings UTF-8 and UTF-16 (UTF-32 is not mandatory). The use of more limited encodings, such as those based on ISO/IEC 8859, is acknowledged and is widely used and supported.”
These details and conventions are important to anyone generating XML files, such as for bloggers and podcasters publishing in the RSS and ATOM formats.
In summary, if you are producing XML files, it is best to output uppercase “UTF-8″ and “UTF-16″ since that is always known to be supported. If you are consuming XML files, it is advisable to accept both uppercase and lowercase variants since both are permissible within a strict interpretation of the, uh, “letter” of the standards. And if you are consuming XML files, be sure to handle the case where the optional encoding is not specified at all; the default value is “UTF-8″ if nothing else is specified.
Also of interest:
- RFC 3629
Note added 29-March-2011:
Above it states “Be forgiving when reading, strict when writing.” This is similar to Postel’s Law (aka Robustness Principle), which states:
Be conservative in what you send; be liberal in what you accept.
Since mine is consistent with this, while also more specific, I will consider it simply an appropriate specialization of Postel’s Law and leave it as is.
Blogged with Flock