One issue many international Webmasters face is how to properly manage documents written in languages containing accented and other special, non-English, characters. Does it matter how the special characters are written? Do HTML documents need to contain both accented and non-accented words to be found in search engines?
Continuing our series on website internationalization for search engine visibility, we’ll take a look at how special characters can be specified in a document and how these characters are managed by search engines such as Google, Yahoo, Ask and Microsoft’s MSN.
In the early days of computing, engineers mapped each of the letters of the latin alphabet used by the English language to a specific numeric code. This mapping became known as the ASCII character set. Unfortunately, no provision was made for accented and other special characters found in the many languages which share the roman alphabet.
Eventually various computer hardware manufacturers added support for the special characters, each using a different mapping system. Unfortunately, these mappings are not generally compatible from one system to another. This problem is sometimes seen today when strange characters appear in text files, messages and web pages viewed on computer systems different from which they were written.
Tips for Inserting Special Characters in HTML Documents
Websites containing pages in languages other than English need to pay particular attention to how special characters are managed. Correct character management impacts both site usability and search engine optimization.
Several approaches are available, all of which are compatible with search engines, as we will see later. They can be grouped as:
- Avoid the use of special characters.
- Insert characters directly from the keyboard.
- Use HTML Entity References.
Avoid the use of Special Characters.
Instead of using an accented character, the accent is placed after the character, i.e. sara’ or sara` instead of sarà. This approach is often seen in Italy. While this approach is fine, use of accented characters can give a document a more professional look.
Insert Characters Directly from the Keyboard.
Often website content is copied into html from word processing software, such as the OpenOffice Writer, or directly inserted in an html form. In these situations, special characters will often be specified by an operating system specific encoding. If the correct character encoding is not specified in the web page or web server, a user on a different operating system may end up seeing lots of strange characters.
The solution is to ensure that web pages specify the character encoding used in the page. The best approach is at the web server level. Apache provides the AddCharset directive for this purpose. A lesser approach is to add a meta tag in the html page:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252" />
This tag should appear in the <head> section, before other tags such as the <title>, which may contain special characters. Microsoft’s developer network lists the common character set values.
Use HTML Entity References.
The best approach is to use special notation to specify special characters in html. This notation uses basic ASCII characters to referrer to special characters, removing the problems associated with getting a html document’s character set encoding right. The basic special notation is called numeric character reference. Every special character is specified using a prefix composed of the ampersand and number/hash sign, &#, a 3 or 4 digit number to indicate the character of interest, and a semicolon ; as a suffix. Thus, è is represented as è. Some of the numeric character entities have corresponding character entity values, i.e. è can be written as è (egrave replaces #232). Similarly é can be written both as é or é.
While character entity values are much easier to remember and to read, we strongly recommend the use of numeric character references to avoid several potential problems:
- Not all of the character entity values which are part of the html 4.0 standard are recognized by all of the software and programs used on the world wild web. This is particularly true with newer symbols such as that for the Euro, €.
- Much HTML content is being used in XML format files, such as blog and sitemap RSS feeds. The XML standard only recognizes 5 character entities (", &, ', <, >), one of which, ', is not even part of the HTML standard.
Search Engines Preferences
Search engines are designed to process any type of HTML available on the world wide web. As long as your website users see the right characters on Windows, Macintosh and Linux computers, you can be fairly certain search engines will not have any difficulty with how you have used special characters in your HTML. Yahoo does seem to have trouble processing some of the newer characters in the html 4.0 standard such as the left and right arrow quotation marks, « and ». However, this problem is limited to Yahoo and is independent of the use of numeric or character entity references.
What about Special Characters and Search Engine Queries?
If you’re still digesting the above look at how special characters can be specified in a HTML document, you’ll be relieved to know that search engines hide all of this complexity when a user performs a search.
In general, all of the major search engines will correctly return results for words containing special characters, even if a user did not type the special character! To illustrate this concept, we will consider a specific example.
After the German spelling reform, is street Strasse, or Straße? You need not worry. Each of the major search engines recognizes both variants. You can easily verify this by noting that both variants are highlighted in the search results.
| Search Engine | Simple ASCII | Special Character |
|---|---|---|
| Ask.com | Strasse | Straße |
| Ask Deutschland | Strasse | Straße |
| Google.de | Strasse | Straße |
| Google.com | Strasse | Straße |
| MSN | Strasse | Straße |
| MSN Deutschland | Strasse | Straße |
| Yahoo! Deutschland | Strasse | Straße |
| Yahoo! | Strasse | Straße |
Still not convinced? Compare Google searches for attivita and attività, the Italian word for activity. Both queries will probably list the Ministero delle Attività Produttive as the top result.
Behind the scenes search engines have mapped accented and special characters to their plain ASCII equivalents, where possible. Thus ö is usually equivalent to oe, à to a, etc.
Slightly different emphasis may be given to words with and without special characters based on a combination of factors including the user’s search language. A user’s search language can be detected from the user’s search interface language and the country variant of the search engine being used, i.e. www.google.it or it.ask.com.
You can usually specify the language of your search interface and the number of results to return. Yahoo also has a Show Instant Search results feature, similar to Google’s Suggest. Each of the major search engines, Google, Yahoo, Ask and Microsoft MSN support search interface personalization.
Disambiguation: meta vs. metà
There are many cases where an accent or special character changes a word’s meaning, such as in the case of the italian word meta. Meta without an accent means goal or aim. With an accent, metà means half or middle. Fortunately, you can specify your exact intent in Google, by using an advance search operator as a prefix to the word. To specify you mean metà and not meta, just prefix metà with a +, i.e. +metà. Yahoo says it supports exact character search, just place the word in double quotes, i.e. “straßen” o “strassen”. Unfortunately, it doesn’t seem to really work. Try “metà”.
Related Resources in this Website
You may be interested in the other articles we have written on website localization and search engine optimization.
- How Search Engines Detect Html Document Language.
- Search Engines and Site Localization: UK and US English Dialect Considerations for Site Internationalization
What’s your experience?
What experience have you had resolving internationalization issues?
Contact Us with feedback on your experience or to let us help you with your Search Engine Optimization and Web Analytics needs.
Similar Posts:
- How to Specify an HTML Web Document Language for good SEO
- Internationalization of Web Sites at ZenaCamp, Genoa (Genova)
- Extra long descriptions showing up in Google search results: test in progress?
- Search engine optimization for websites in multiple languages
- Audio & Video Multimedia Search Engine Optimization
If you're new here, you might subscribe to my feed by Email, RSS feed and/or follow me on Twitter, which is updated on a more frequent – and more meaningless – basis in English and Italian. Finally, if you're a Sphinn user, Sphinn love is welcome :-). Thanks for visiting!
Share


2 responses so far ↓
1 Andrea // Mar 2, 2009 at 9:43:00
I know that this is an old post, but I’ve ran into it just now and I’ve found this extremely wrong sentence “Instead of using an accented character, the accent is placed after the character, i.e. sara’ or sara` instead of sarà. This approach is often seen in Italy. While this approach is fine, use of accented characters can give a document a more professional look.”.
I’m Italian and I can assure you that is the use of “apostrophes” as a replacement of accents is wrong!. This is just a typical behavior of lazy people.
Also consider that not all accents are at the end of the word and that there is a difference between accent acute and accent grave, that you can’t simplify with a mere apostrophe.
For instance, how would you write “Vìola”? “Vi`ola”?
So, next time, before writing wrong information, make some serious investigation.
2 sean // Mar 2, 2009 at 15:23:21
Andrea, you raise an interesting point.
In principle, you’re right, accented characters should be used instead of placing an apostrophe after the letter in question. Yet search engines must deal with what they find in the real world, especially when a particular usage is fairly common.
As you will surely know, Italians often use an apostrophe instead of typographically correct accented character. This is especially so with E’ instead of È…. so, linguistically, right or wrong, the search engines cannot and don’t generally ignore what is going on in the wild.
In the end, the point of this article was to get people to use numeric html entities as the preferred approach; perhaps I could have been clearer on this point.
Leave a Comment
Warning: Comments are welcome insofar as they add something to the discussion. Anonymous and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).