C string preprocessor for Emoji and Asian texts

The preprocessor is an input and output text string converter inside IconEdit. The input processor finds the C strings and makes the font. The output processor modifies the C strings to use the font on a left to right display. The converted C strings can be used directly by the compiler and the display.

Input converter for hexadecimal characters, Asiatic alphabets and classic 8-bit texts

The input reader converts C text strings to internal 16-bit Unicode.

  • Convert UTF-8 hexadecimal text strings to 16-bit Unicode.
  • Convert UTF-16 hexadecimal numbers in strings to 16-bit Unicode.
  • Convert UTF-32 hexadecimal numbers for emoji in strings to high plane Unicode.
  • Combine surrogate characters to find high plane characters such as emoji.
  • Move high plane characters to the private area in 16-bit Unicode.
  • Find combinations of characters, ligatures, and diacritics to make combined characters.
  • Find and add Arabic presentation characters.
  • Convert classic 8-bit encoded text strings to 16-bit Unicode.

After the input conversion, IconEdit creates all characters.

In this example read a CPP file with only the two lines

L"Smiley স্মাইলি \U0001F603 !"

L"Cable Car ಕೇಬಲ್ ಕಾರು \U0001F6A1"

The input converter ignores anything outside the double quotes:

High Plane Emoji in 16-bit Font

Above is the resulting font optimized for the text strings. The input converter moves high plane and combined characters to the private area E700 to F8FF. IconEdit always orders the characters in the font alphabetically according to Unicode. The new Unicode character value is shown above each character.

High Plane Emoji in 16-bit Text

This is how the text will look with the font. Only the text inside the string is in the font, the rest is there for orientation.

Output converter for hexadecimal characters

The output converter writes the internal 16-bit Unicode as C text strings to a file. Texts with right to left characters are prepared for left to right displays. The output file is linked to the font and the two should be used together by the compiler.

Convert the input text L"Smiley স্মাইলি \U0001F603 !" to one of the following output formats:

  • L"Smiley ই  !" Pure Unicode with all characters written as 16-bit Unicode. The private characters are not defined in Unicode but shown as block characters.
  • L"Smiley \xE702ই\xE703 \xE700 !" Pure Unicode with private characters as 16-bit hexadecimal. This makes the file easier to read for humans but makes no difference to the compiler.
  • L"Smiley \xE702\x0987\xE703 \xE700 !" UTF-16 hexadecimal for old editors that can not read Unicode. This is still Unicode to the compiler.
  • "Smiley \uEE\u9C\u82\uE0\uA6\u87\uEE\u9C\u83 \uEE\u9C\u80 !" UTF-8 hexadecimal for old compilers that can not understand Unicode strings. To the compiler, this is an 8-bit classic text. Use the UTF-8 option in the RAMTEX driver library to display the text as Unicode.

Memory consumption for different output formats

UTF-16 hexadecimal and pure Unicode always uses 2.0 byte per character ROM space.

UTF-8 heaxdecimal take up different amounts of ROM space per character depending on language and alphabet:

  • 1.0 byte per character: Amecican English.
  • 1.1 - 1.3 byte per character: Other languages written with the Latin alphabet.
  • 2.0 - 2.2 byte per character: Other European and Middle Eastern languages except Arabic.
  • 2.6 - 2.9 byte per character: Arabic and South Asiatic languages.
  • 3.0 byte per character: Chinese, Japanese, and Korean.

Trace characters through the process with the mouse help and blue marks

Blue marks can be set by the mouse and follow the character through all windows. Use mouse help in all windows to see how the character is created:

High Plane Emoji in 16-bit Font

High Plane Emoji in 16-bit Text

High Plane Emoji in 16-bit Text

High Plane Emoji in 16-bit Text

Both mouse help and blue marks can be turned off and on at any time.

Other editing and conversion functions