One issue many international Webmasters face is how to properly manage documents written in languages containing accented and other special, non-English, characters. Does it matter how the special characters are written? Do HTML documents need to contain both accented and non-accented words to be found in search engines?
Contents
Continuing our series on website internationalization for search engine visibility, we'll take a look at how special characters can be specified in a document and how these characters are managed by search engines such as Google, Yahoo, Ask and Microsoft's MSN.
In the early days of computing, engineers mapped each of the letters of the latin alphabet used by the English language to a specific numeric code. This mapping became known as the ASCII character set. Unfortunately, no provision was made for accented and other special characters found in the many languages which share the roman alphabet.
Eventually various computer hardware manufacturers added support for the special characters, each using a different mapping system. Unfortunately, these mappings are not generally compatible from one system to another. This problem is sometimes seen today when strange characters appear in text files, messages and web pages viewed on computer systems different from which they were written.
Websites containing pages in languages other than English need to pay particular attention to how special characters are managed. Correct character management impacts both site usability and search engine optimization.
Several approaches are available, all of which are compatible with search engines, as we will see later. They can be grouped as:
Instead of using an accented character, the accent is placed after the character, i.e. sara' or sara` instead of sarà. This approach is often seen in Italy. While this approach is fine, use of accented characters can give a document a more professional look.
Often website content is copied into html from word processing software, such as the OpenOffice Writer, or directly inserted in an html form. In these situations, special characters will often be specified by an operating system specific encoding. If the correct character encoding is not specified in the web page or web server, a user on a different operating system may end up seeing lots of strange characters.
The solution is to ensure that web pages specify the character encoding used in the page. The best approach is at the web server level. Apache provides the AddCharset directive for this purpose. A lesser approach is to add a meta tag in the html page:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252" />
This tag should appear in the <head> section, before other tags such as the <title>, which may contain special characters. Microsoft's developer network lists the common character set values.
The best approach is to use special notation to specify special characters in html. This notation uses basic ASCII characters to referrer to special characters, removing the problems associated with getting a html document's character set encoding right. The basic special notation is called numeric character reference. Every special character is specified using a prefix composed of the ampersand and number/hash sign, &#, a 3 or 4 digit number to indicate the character of interest, and a semicolon ; as a suffix. Thus, è is represented as è. Some of the numeric character entities have corresponding character entity values, i.e. è can be written as è (egrave replaces #232). Similarly é can be written both as é or é.
While character entity values are much easier to remember and to read, we strongly recommend the use of numeric character references to avoid several potential problems:
Search engines are designed to process any type of HTML available on the world wide web. As long as your website users see the right characters on Windows, Macintosh and Linux computers, you can be fairly certain search engines will not have any difficulty with how you have used special characters in your HTML. Yahoo does seem to have trouble processing some of the newer characters in the html 4.0 standard such as the left and right arrow quotation marks, « and ». However, this problem is limited to Yahoo and is independent of the use of numeric or character entity references.
If you're still digesting the above look at how special characters can be specified in a HTML document, you'll be relieved to know that search engines hide all of this complexity when a user performs a search.
In general, all of the major search engines will correctly return results for words containing special characters, even if a user did not type the special character! To illustrate this concept, we will consider a specific example.
After the German spelling reform, is street Strasse, or Straße? You need not worry. Each of the major search engines recognizes both variants. You can easily verify this by noting that both variants are highlighted in the search results.
| Search Engine | Simple ASCII | Special Character |
|---|---|---|
| Ask.com | Strasse | Straße |
| Ask Deutschland | Strasse | Straße |
| Google.de | Strasse | Straße |
| Google.com | Strasse | Straße |
| MSN | Strasse | Straße |
| MSN Deutschland | Strasse | Straße |
| Yahoo! Deutschland | Strasse | Straße |
| Yahoo! | Strasse | Straße |
Still not convinced? Compare Google searches for attivita and attività, the Italian word for activity. Both queries will probably list the Ministero delle Attività Produttive as the top result.
Behind the scenes search engines have mapped accented and special characters to their plain ASCII equivalents, where possible. Thus ö is usually equivalent to oe, à to a, etc.
Slightly different emphasis may be given to words with and without special characters based on a combination of factors including the user's search language. A user's search language can be detected from the user's search interface language and the country variant of the search engine being used, i.e. www.google.it or it.ask.com.
You can usually specify the language of your search interface and the number of results to return. Yahoo also has a Show Instant Search results feature, similar to Google's Suggest. Each of the major search engines, Google, Yahoo, Ask and Microsoft MSN support search interface personalization.
There are many cases where an accent or special character changes a word's meaning, such as in the case of the italian word meta. Meta without an accent means goal or aim. With an accent, metà means half or middle. Fortunately, you can specify your exact intent in Google, by using an advance search operator as a prefix to the word. To specify you mean metà and not meta, just prefix metà with a +, i.e. +metà. Yahoo says it supports exact character search, just place the word in double quotes, i.e. "straßen" o "strassen". Unfortunately, it doesn't seem to really work. Try "metà".
You may be interested in the other articles we have written on website localization and search engine optimization.
What experience have you had resolving internationalization issues?
Contact Us with feedback on your experience or to let us help you with your Search Engine Optimization and Web Analytics needs.
To better understand the nuances of Search Engine Optimization and Web Marketing, let Antezeta help you with your Search Engine Marketing Needs!
Contact us today to find out more about this topic and the rest of the Web Ecosystem!
Was this resource helpful? If so, feel free to put a link to this page on your site! Just copy this code:
Accented Characters, Symbols and Special Characters in HTML Documents:
<a href="http://www.antezeta.com/international/accented-characters.html">
Considerations for Search Engine Optimization, Usability and XML Feeds</a>
Bookmark this page with your bookmark service (hover over a logo to see service name):
Link broken? Let us know the correct link!
The use of the term Merit-based™ in conjunction with Search Engine Optimization is a Trademark of Antezeta.