Gabi und Sascha
Kategorien : Alle | Berlin | Bücher | Fotografie | Java | Linkhalde | Weichware | Verfassung

I18N URIs are nice to produce encoding problems in different environments. In the german language we have some characters in the alphabet called Umlaute (äöü) and Szet (ß). Other language have accents (e.g. in the french alphabet) an much more other extensions to the latin characterset. In a weblog it is nice to have a pronuncing URI other than a nonspeaking timestamp. It gives additional information to a search engine entry like Google. For that you can take the title of a weblog entry, remove all non alphabetic and number characters and add an .html postfix. But if you have so call Umlaut or accent characters the URI is problematic to read cause in RFC 1234 non ASCII characters are forbidden and must shown with one or more triplets, beginning with a percent sign (%), following with an hexadecimal octet value. For example, the space character is encoded as %20.

A solution for weblog software may be to transform the special characters like Umlaut, Szet or characters with accents in simple ASCII characters. To do this there must only be a transformation table for all problematic characters to primitiv ASCII (or in this case simple latin) characters. It makes no sence to transform characters from the greek or cyrilic characterset to the latin characters.

Following a sinmple Java code example:


   static final char[] ORIG = {'ß', 'ä', 'ö', 'ü', ''...};
   static final String[] REPLACE = {"ss", "ae", "oe", "ue", ...};

   public String latinBasedToAsciiTransformation(final String title) {

       final char[] titleArray = title.toCharArray();
       final int titleLength = title.length();
       final StringBuilder sb = new StringBuilder(titleLength);

       for (int i = 0; i < titleLength; i++) {
           boolean replace = false;
           for (int j = 0; j < ORIG.length; j++) {
               if (titleArray[i] == ORIG[j]) {
                   sb.append(REPLACE[j]);
                   replaced = true;
                   break;
               }
           }
           if (! replace) {
               sb.append(titleArray[i]);
           }
       }

       String newTitle = sb.toString();
       newTitle = newTitle.replaceAll("[^a-zA-Z0-9_]", "");
       newTitle = newTitle.replaceAll("_+", "_");
       newTitle = newTitle.replaceAll("^_*", "");
       newTitle = newTitle.replaceAll("_*$", "");

       return newTitle
   }

I'll write later in this week a simple LatinBasedTitlePermalinkProvider for the Pebble weblog software.

I have had the main idea to this solution on a drive from Berlin to Nordenham on Good Friday 2005. It is my own idea. But what if there is a company in the world who had this idea too and now holds a patent on it? For me it is open source and I've not enought time and money to check such a patent.