For some time, Delphi has had a little-know type called UTF8String. It was little-know, because it didn’t really work as advertised. Try this in Delphi 2007:
var S: UTF8String;
S := "Tiburón";
WriteLn(Length(S))
Though S is declared as UTF8String, it stores the string using the default Windows code page, instead of UTF-8, with a length of 7 bytes. That’s because in Delphi 2007, you’ll find this declaration in System.pas:
type UTF8String = type string;
This means that
in Delphi 2007, there’s really no difference between UTF8String and AnsiString. In Delphi 2009, however, you’ll find this declaration:
type UTF8String = type AnsiString(65001);
65001 is the code page number for UTF-8 on the Windows platform. You can declare your own string types this way using any code page understood by the WideCharToMultiByte() and MultiByteToWideChar()
API calls. E.g. if you assign a UnicodeString to a UTF8String, WideCharToMultiByte(65001) is called to convert the string from UTF-16 to UTF-8. This is no different than Delphi 2007 (or 2009) calling WideCharToMultiByte(0) when you assign a WideString to an AnsiString.
In Delphi 2009, the code snippet at the top of this post will convert “Tiburón” to UTF-8 at compile time. At runtime, 8 bytes are loaded directly into S. There will be no call to WideCharToMultiByte() at runtime for this literal assignment. The accented ó takes up two bytes when encoded as UTF-8. Length(S) will return 8.
You can easily declare your own typed AnsiStrings in Delphi 2009. If UTF8String is too modern for you, try this:
type EBCDICString = type AnsiString(37);