You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

521 lines
18KB

  1. /**
  2. \page unicode Unicode and UTF-8 Support
  3. This chapter explains how FLTK handles international
  4. text via Unicode and UTF-8.
  5. Unicode support was only recently added to FLTK and is
  6. still incomplete. This chapter is Work in Progress, reflecting
  7. the current state of Unicode support.
  8. \section unicode_about About Unicode, ISO 10646 and UTF-8
  9. The summary of Unicode, ISO 10646 and UTF-8 given below is
  10. deliberately brief, and provides just enough information for
  11. the rest of this chapter.
  12. For further information, please see:
  13. - http://www.unicode.org
  14. - http://www.iso.org
  15. - http://en.wikipedia.org/wiki/Unicode
  16. - http://www.cl.cam.ac.uk/~mgk25/unicode.html
  17. - http://www.apps.ietf.org/rfc/rfc3629.html
  18. \par The Unicode Standard
  19. The Unicode Standard was originally developed by a consortium of mainly
  20. US computer manufacturers and developers of multi-lingual software.
  21. It has now become a defacto standard for character encoding,
  22. and is supported by most of the major computing companies in the world.
  23. Before Unicode, many different systems, on different platforms,
  24. had been developed for encoding characters for different languages,
  25. but no single encoding could satisfy all languages.
  26. Unicode provides access to over 100,000 characters
  27. used in all the major languages written today,
  28. and is independent of platform and language.
  29. Unicode also provides higher-level concepts needed for text processing
  30. and typographic publishing systems, such as algorithms for sorting and
  31. comparing text, composite character and text rendering, right-to-left
  32. and bi-directional text handling.
  33. <i>There are currently no plans to add this extra functionality to FLTK.</i>
  34. \par ISO 10646
  35. The International Organisation for Standardization (ISO) had also
  36. been trying to develop a single unified character set.
  37. Although both ISO and the Unicode Consortium continue to publish
  38. their own standards, they have agreed to coordinate their work so
  39. that specific versions of the Unicode and ISO 10646 standards are
  40. compatible with each other.
  41. The international standard ISO 10646 defines the
  42. <b>Universal Character Set</b> (UCS)
  43. which contains the characters required for almost all known languages.
  44. The standard also defines three different implementation levels specifying
  45. how these characters can be combined.
  46. <i>There are currently no plans for handling the different implementation
  47. levels or the combining characters in FLTK.</i>
  48. In UCS, characters have a unique numerical code and an official name,
  49. and are usually shown using 'U+' and the code in hexadecimal,
  50. e.g. U+0041 is the "Latin capital letter A".
  51. The UCS characters U+0000 to U+007F correspond to US-ASCII,
  52. and U+0000 to U+00FF correspond to ISO 8859-1 (Latin1).
  53. ISO 10646 was originally designed to handle a 31-bit character set
  54. from U+00000000 to U+7FFFFFFF, but the current idea is that 21-bits
  55. will be sufficient for all future needs, giving characters up to
  56. U+10FFFF. The complete character set is sub-divided into \e planes.
  57. <i>Plane 0</i>, also known as the <b>Basic Multilingual Plane</b>
  58. (BMP), ranges from U+0000 to U+FFFD and consists of the most commonly
  59. used characters from previous encoding standards. Other planes
  60. contain characters for specialist applications.
  61. \todo
  62. Do we need this info about planes?
  63. The UCS also defines various methods of encoding characters as
  64. a sequence of bytes.
  65. UCS-2 encodes Unicode characters into two bytes,
  66. which is wasteful if you are only dealing with ASCII or Latin1 text,
  67. and insufficient if you need characters above U+00FFFF.
  68. UCS-4 uses four bytes, which lets it handle higher characters,
  69. but this is even more wasteful for ASCII or Latin1.
  70. \par UTF-8
  71. The Unicode standard defines various UCS Transformation Formats.
  72. UTF-16 and UTF-32 are based on units of two and four bytes.
  73. UCS characters requiring more than 16-bits are encoded using
  74. "surrogate pairs" in UTF-16.
  75. UTF-8 encodes all Unicode characters into variable length
  76. sequences of bytes. Unicode characters in the 7-bit ASCII
  77. range map to the same value and are represented as a single byte,
  78. making the transformation to Unicode quick and easy.
  79. All UCS characters above U+007F are encoded as a sequence of
  80. several bytes. The top bits of the first byte are set to show
  81. the length of the byte sequence, and subseqent bytes are
  82. always in the range 0x80 to 0x8F. This combination provides
  83. some level of synchronisation and error detection.
  84. <table summary="Unicode character byte sequences" align="center">
  85. <tr>
  86. <td>Unicode range</td>
  87. <td>Byte sequences</td>
  88. </tr>
  89. <tr>
  90. <td><tt>U+00000000 - U+0000007F</tt></td>
  91. <td><tt>0xxxxxxx</tt></td>
  92. </tr>
  93. <tr>
  94. <td><tt>U+00000080 - U+000007FF</tt></td>
  95. <td><tt>110xxxxx 10xxxxxx</tt></td>
  96. </tr>
  97. <tr>
  98. <td><tt>U+00000800 - U+0000FFFF</tt></td>
  99. <td><tt>1110xxxx 10xxxxxx 10xxxxxx</tt></td>
  100. </tr>
  101. <tr>
  102. <td><tt>U+00010000 - U+001FFFFF</tt></td>
  103. <td><tt>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
  104. </tr>
  105. <tr>
  106. <td><tt>U+00200000 - U+03FFFFFF</tt></td>
  107. <td><tt>111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
  108. </tr>
  109. <tr>
  110. <td><tt>U+04000000 - U+7FFFFFFF</tt></td>
  111. <td><tt>1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx</tt></td>
  112. </tr>
  113. </table>
  114. Moving from ASCII encoding to Unicode will allow all new FLTK
  115. applications to be easily internationalized and used all
  116. over the world. By choosing UTF-8 encoding, FLTK remains
  117. largely source-code compatible to previous iteration of the
  118. library.
  119. \section unicode_in_fltk Unicode in FLTK
  120. \todo
  121. Work through the code and this documentation to harmonize
  122. the [<b>OksiD</b>] and [<b>fltk2</b>] functions.
  123. FLTK will be entirely converted to Unicode using UTF-8 encoding.
  124. If a different encoding is required by the underlying operating
  125. system, FLTK will convert the string as needed.
  126. It is important to note that the initial implementation of
  127. Unicode and UTF-8 in FLTK involves three important areas:
  128. - provision of Unicode character tables and some simple related functions;
  129. - conversion of char* variables and function parameters from single byte
  130. per character representation to UTF-8 variable length sequences;
  131. - modifications to the display font interface to accept general
  132. Unicode character or UCS code numbers instead of just ASCII or Latin1
  133. characters.
  134. The current implementation of Unicode / UTF-8 in FLTK will impose
  135. the following limitations:
  136. - An implementation note in the [<b>OksiD</b>] code says that all functions
  137. are LIMITED to 24 bit Unicode values, but also says that only 16 bits
  138. are really used under linux and win32.
  139. <b>[Can we verify this?]</b>
  140. - The [<b>fltk2</b>] %fl_utf8encode() and %fl_utf8decode() functions are
  141. designed to handle Unicode characters in the range U+000000 to U+10FFFF
  142. inclusive, which covers all UTF-16 characters, as specified in RFC 3629.
  143. <i>Note that the user must first convert UTF-16 surrogate pairs to UCS.</i>
  144. - FLTK will only handle single characters, so composed characters
  145. consisting of a base character and floating accent characters
  146. will be treated as multiple characters;
  147. - FLTK will only compare or sort strings on a byte by byte basis
  148. and not on a general Unicode character basis;
  149. - FLTK will not handle right-to-left or bi-directional text;
  150. \todo
  151. Verify 16/24 bit Unicode limit for different character sets?
  152. OksiD's code appears limited to 16-bit whereas the FLTK2 code
  153. appears to handle a wider set. What about illegal characters?
  154. See comments in %fl_utf8fromwc() and %fl_utf8toUtf16().
  155. \section unicode_illegals Illegal Unicode and UTF-8 sequences
  156. Three pre-processor variables are defined in the source code that
  157. determine how %fl_utf8decode() handles illegal UTF-8 sequences:
  158. - if ERRORS_TO_CP1252 is set to 1 (the default), %fl_utf8decode() will
  159. assume that a byte sequence starting with a byte in the range 0x80
  160. to 0x9f represents a Microsoft CP1252 character, and will instead
  161. return the value of an equivalent UCS character. Otherwise, it
  162. will be processed as an illegal byte value as described below.
  163. - if STRICT_RFC3629 is set to 1 (not the default!) then UTF-8
  164. sequences that correspond to illegal UCS values are treated as
  165. errors. Illegal UCS values include those above U+10FFFF, or
  166. corresponding to UTF-16 surrogate pairs. Illegal byte values
  167. are handled as described below.
  168. - if ERRORS_TO_ISO8859_1 is set to 1 (the default), the illegal
  169. byte value is returned unchanged, otherwise 0xFFFD, the Unicode
  170. REPLACEMENT CHARACTER, is returned instead.
  171. %fl_utf8encode() is less strict, and only generates the UTF-8
  172. sequence for 0xFFFD, the Unicode REPLACEMENT CHARACTER, if it is
  173. asked to encode a UCS value above U+10FFFF.
  174. Many of the [<b>fltk2</b>] functions below use %fl_utf8decode() and
  175. %fl_utf8encode() in their own implementation, and are therefore
  176. somewhat protected from bad UTF-8 sequences.
  177. The [<b>OksiD</b>] %fl_utf8len() function assumes that the byte it is
  178. passed is the first byte in a UTF-8 sequence, and returns the length
  179. of the sequence. Trailing bytes in a UTF-8 sequence will return -1.
  180. - \b WARNING:
  181. %fl_utf8len() can not distinguish between single
  182. bytes representing Microsoft CP1252 characters 0x80-0x9f and
  183. those forming part of a valid UTF-8 sequence. You are strongly
  184. advised not to use %fl_utf8len() in your own code unless you
  185. know that the byte sequence contains only valid UTF-8 sequences.
  186. - \b WARNING:
  187. Some of the [OksiD] functions below use still use %fl_utf8len() in
  188. their implementations. These may need further validation.
  189. Please see the individual function description for further details
  190. about error handling and return values.
  191. \section unicode_fltk_calls FLTK Unicode and UTF-8 functions
  192. This section currently provides a brief overview of the functions.
  193. For more details, consult the main text for each function via its link.
  194. int fl_utf8locale()
  195. \b FLTK2
  196. <br>
  197. \par
  198. \p %fl_utf8locale() returns true if the "locale" seems to indicate
  199. that UTF-8 encoding is used.
  200. \par
  201. <i>It is highly recommended that your change your system so this does return
  202. true!</i>
  203. int fl_utf8test(const char *src, unsigned len)
  204. \b FLTK2
  205. <br>
  206. \par
  207. \p %fl_utf8test() examines the first \p len bytes of \p src.
  208. It returns 0 if there are any illegal UTF-8 sequences;
  209. 1 if \p src contains plain ASCII or if \p len is zero;
  210. or 2, 3 or 4 to indicate the range of Unicode characters found.
  211. int fl_utf_nb_char(const unsigned char *buf, int len)
  212. \b OksiD
  213. <br>
  214. \par
  215. Returns the number of UTF-8 character in the first \p len bytes of \p buf.
  216. int fl_unichar_to_utf8_size(Fl_Unichar)
  217. <br>
  218. int fl_utf8bytes(unsigned ucs)
  219. <br>
  220. \par
  221. Returns the number of bytes needed to encode \p ucs in UTF-8.
  222. int fl_utf8len(char c)
  223. \b OksiD
  224. <br>
  225. \par
  226. If \p c is a valid first byte of a UTF-8 encoded character sequence,
  227. \p %fl_utf8len() will return the number of bytes in that sequence.
  228. It returns -1 if \p c is not a valid first byte.
  229. unsigned int fl_nonspacing(unsigned int ucs)
  230. \b OksiD
  231. <br>
  232. \par
  233. Returns true if \p ucs is a non-spacing character.
  234. <b>[What are non-spacing characters?]</b>
  235. const char* fl_utf8back(const char *p, const char *start, const char *end)
  236. \b FLTK2
  237. <br>
  238. const char* fl_utf8fwd(const char *p, const char *start, const char *end)
  239. \b FLTK2
  240. <br>
  241. \par
  242. If \p p already points to the start of a UTF-8 character sequence,
  243. these functions will return \p p.
  244. Otherwise \p %fl_utf8back() searches backwards from \p p
  245. and \p %fl_utf8fwd() searches forwards from \p p,
  246. within the \p start and \p end limits,
  247. looking for the start of a UTF-8 character.
  248. unsigned int fl_utf8decode(const char *p, const char *end, int *len)
  249. \b FLTK2
  250. <br>
  251. int fl_utf8encode(unsigned ucs, char *buf)
  252. \b FLTK2
  253. <br>
  254. \par
  255. \p %fl_utf8decode() attempts to decode the UTF-8 character that starts
  256. at \p p and may not extend past \p end.
  257. It returns the Unicode value, and the length of the UTF-8 character sequence
  258. is returned via the \p len argument.
  259. \p %fl_utf8encode() writes the UTF-8 encoding of \p ucs into \p buf
  260. and returns the number of bytes in the sequence.
  261. See the main documentation for the treatment of illegal Unicode
  262. and UTF-8 sequences.
  263. unsigned int fl_utf8froma(char *dst, unsigned dstlen, const char *src, unsigned srclen)
  264. \b FLTK2
  265. <br>
  266. unsigned int fl_utf8toa(const char *src, unsigned srclen, char *dst, unsigned dstlen)
  267. \b FLTK2
  268. <br>
  269. \par
  270. \p %fl_utf8froma() converts a character string containing single bytes
  271. per character (i.e. ASCII or ISO-8859-1) into UTF-8.
  272. If the \p src string contains only ASCII characters, the return value will
  273. be the same as \p srclen.
  274. \par
  275. \p %fl_utf8toa() converts a string containing UTF-8 characters into
  276. single byte characters. UTF-8 characters do not correspond to ASCII
  277. or ISO-8859-1 characters below 0xFF are replaced with '?'.
  278. \par
  279. Both functions return the number of bytes that would be written, not
  280. counting the null terminator.
  281. \p destlen provides a means of limiting the number of bytes written,
  282. so setting \p destlen to zero is a means of measuring how much storage
  283. would be needed before doing the real conversion.
  284. char* fl_utf2mbcs(const char *src)
  285. \b OksiD
  286. <br>
  287. \par
  288. converts a UTF-8 string to a local multi-byte character string.
  289. <b>[More info required here!]</b>
  290. unsigned int fl_utf8fromwc(char *dst, unsigned dstlen, const wchar_t *src, unsigned srclen)
  291. \b FLTK2
  292. <br>
  293. unsigned int fl_utf8towc(const char *src, unsigned srclen, wchar_t *dst, unsigned dstlen)
  294. \b FLTK2
  295. <br>
  296. unsigned int fl_utf8toUtf16(const char *src, unsigned srclen, unsigned short *dst, unsigned dstlen)
  297. \b FLTK2
  298. <br>
  299. \par
  300. These routines convert between UTF-8 and \p wchar_t or "wide character"
  301. strings.
  302. The difficulty lies in the fact \p sizeof(wchar_t) is 2 on Windows
  303. and 4 on Linux and most other systems.
  304. Therefore some "wide characters" on Windows may be represented
  305. as "surrogate pairs" of more than one \p wchar_t.
  306. \par
  307. \p %fl_utf8fromwc() converts from a "wide character" string to UTF-8.
  308. Note that \p srclen is the number of \p wchar_t elements in the source
  309. string and on Windows and this might be larger than the number of characters.
  310. \p dstlen specifies the maximum number of \b bytes to copy, including
  311. the null terminator.
  312. \par
  313. \p %fl_utf8towc() converts a UTF-8 string into a "wide character" string.
  314. Note that on Windows, some "wide characters" might result in "surrogate
  315. pairs" and therefore the return value might be more than the number of
  316. characters.
  317. \p dstlen specifies the maximum number of \b wchar_t elements to copy,
  318. including a zero terminating element.
  319. <b>[Is this all worded correctly?]</b>
  320. \par
  321. \p %fl_utf8toUtf16() converts a UTF-8 string into a "wide character"
  322. string using UTF-16 encoding to handle the "surrogate pairs" on Windows.
  323. \p dstlen specifies the maximum number of \b wchar_t elements to copy,
  324. including a zero terminating element.
  325. <b>[Is this all worded correctly?]</b>
  326. \par
  327. These routines all return the number of elements that would be required
  328. for a full conversion of the \p src string, including the zero terminator.
  329. Therefore setting \p dstlen to zero is a way of measuring how much storage
  330. would be needed before doing the real conversion.
  331. unsigned int fl_utf8from_mb(char *dst, unsigned dstlen, const char *src, unsigned srclen)
  332. \b FLTK2
  333. <br>
  334. unsigned int fl_utf8to_mb(const char *src, unsigned srclen, char *dst, unsigned dstlen)
  335. \b FLTK2
  336. <br>
  337. \par
  338. These functions convert between UTF-8 and the locale-specific multi-byte
  339. encodings used on some systems for filenames, etc.
  340. If fl_utf8locale() returns true, these functions don't do anything useful.
  341. <b>[Is this all worded correctly?]</b>
  342. int fl_tolower(unsigned int ucs)
  343. \b OksiD
  344. <br>
  345. int fl_toupper(unsigned int ucs)
  346. \b OksiD
  347. <br>
  348. int fl_utf_tolower(const unsigned char *str, int len, char *buf)
  349. \b OksiD
  350. <br>
  351. int fl_utf_toupper(const unsigned char *str, int len, char *buf)
  352. \b OksiD
  353. <br>
  354. \par
  355. \p %fl_tolower() and \p %fl_toupper() convert a single Unicode character
  356. from upper to lower case, and vice versa.
  357. \p %fl_utf_tolower() and \p %fl_utf_toupper() convert a string of bytes,
  358. some of which may be multi-byte UTF-8 encodings of Unicode characters,
  359. from upper to lower case, and vice versa.
  360. \par
  361. Warning: to be safe, \p buf length must be at least \p 3*len
  362. [for 16-bit Unicode]
  363. int fl_utf_strcasecmp(const char *s1, const char *s2)
  364. \b OksiD
  365. <br>
  366. int fl_utf_strncasecmp(const char *s1, const char *s2, int n)
  367. \b OksiD
  368. <br>
  369. \par
  370. \p %fl_utf_strcasecmp() is a UTF-8 aware string comparison function that
  371. converts the strings to lower case Unicode as part of the comparison.
  372. \p %flt_utf_strncasecmp() only compares the first \p n characters [bytes?]
  373. \section unicode_system_calls FLTK Unicode versions of system calls
  374. - int fl_access(const char* f, int mode)
  375. \b OksiD
  376. - int fl_chmod(const char* f, int mode)
  377. \b OksiD
  378. - int fl_execvp(const char* file, char* const* argv)
  379. \b OksiD
  380. - FILE* fl_fopen(cont char* f, const char* mode)
  381. \b OksiD
  382. - char* fl_getcwd(char* buf, int maxlen)
  383. \b OksiD
  384. - char* fl_getenv(const char* name)
  385. \b OksiD
  386. - char fl_make_path(const char* path) - returns char ?
  387. \b OksiD
  388. - void fl_make_path_for_file(const char* path)
  389. \b OksiD
  390. - int fl_mkdir(const char* f, int mode)
  391. \b OksiD
  392. - int fl_open(const char* f, int o, ...)
  393. \b OksiD
  394. - int fl_rename(const char* f, const char* t)
  395. \b OksiD
  396. - int fl_rmdir(const char* f)
  397. \b OksiD
  398. - int fl_stat(const char* path, struct stat* buffer)
  399. \b OksiD
  400. - int fl_system(const char* f)
  401. \b OksiD
  402. - int fl_unlink(const char* f)
  403. \b OksiD
  404. \par TODO:
  405. \li more doc on unicode, add links
  406. \li write something about filename encoding on OS X...
  407. \li explain the fl_utf8_... commands
  408. \li explain issues with Fl_Preferences
  409. \li why FLTK has no Fl_String class
  410. \htmlonly
  411. <hr>
  412. <table summary="navigation bar" width="100%" border="0">
  413. <tr>
  414. <td width="45%" align="LEFT">
  415. <a class="el" href="advanced.html">
  416. [Prev]
  417. Advanced FLTK
  418. </a>
  419. </td>
  420. <td width="10%" align="CENTER">
  421. <a class="el" href="index.html">[Index]</a>
  422. </td>
  423. <td width="45%" align="RIGHT">
  424. <a class="el" href="enumerations.html">
  425. FLTK Enumerations
  426. [Next]
  427. </a>
  428. </td>
  429. </tr>
  430. </table>
  431. \endhtmlonly
  432. */