c++ - Unicode, UTF-8, UTF-16 and UTF-32 questions -


i read lot unicode, ascii, code pages, history, invention of utf-8, utf-16 (ucs-2), utf-32 (ucs-4) , use them , on, still having questions tried hardly find answers couldn't , hope me.

1 - unicode standard encoding characters , specify code point each character. u+0000 (example). imagine have file has code points (\u0000), in point of application i'm going use it?

this might silly question don't know in point of application i'm going use it. i'm creating application can read file has code points using escape \u , know can read it, decode next question.

2 - character set (code page) need convert it? saw c++ libraries uses name utf8_to_unicode or utf8-to-utf16 , utf8_decode, , makes me confuse.

i don't know if appear answers this, might say: need convert code pages going use, if application needs internationalized?

3 - wondering, in c++ if try display non-ascii characters on terminal got confusing words. question is: makes characters displayed fonts?

#include <iostream>  int main() {     std::cout << "ö" << std::endl;      return 0; } 

the output (windows):

├Â

4 - in part of process encoding enter? encodes, takes code point , try find word equal on fonts?

5 = webkit engine rendering web pages in web browsers, if specify charset utf-8 works nicely characters, if specify charset doesn't, doesn't matter font i'm using, happen?

<html> <head>     <meta charset="iso-8859-1">  </head> <body>     <p>ö</p> </body> </html> 

the output:

ö

works using:

<meta charset="utf-8"> 

6 - imagine read file, encode it, have code points , need save file again. need save encoded (\u0000) or need decode first transform again characters , save?

7 - why word "unicode" bit overloaded , understood mean utf-16? (source)

that's now. in advance.

i'm creating application can read file has code points using escape \u , know can read it, decode next question.

if you're writing program processes kind of custom escapes, such \uxxxx, it's entirely when convert these escapes unicode code points.

to character set (code page) need convert it?

that depends on want do. if you're using other library requires specific code page it's convert data 1 encoding encoding required library. if don't have hard requirements imposed such third party libraries there may no reason conversion.

i wondering, in c++ if try display non-ascii characters on terminal got confusing words.

this because various layers of technology stack use different encodings. sample output give, "├Â" can see what's happening compiler encoding string literal utf-8, console using windows codepage 850. when there encoding problems console can fix them setting console output codepage correct value, unfortunately passing utf-8 through std::cout has unique problems. using printf instead worked me in vs2012:

#include <cstdio> #include <windows.h>  int main() {     setconsoleoutputcp(cp_utf8);     std::printf("%s\n", "ö"); } 

hopefully microsoft fixes c++ libraries if haven't done in vs 14.

in part of process encoding enter? encodes, takes code point , try find word equal on fonts?

bytes of data meaningless unless know encoding. encoding matters in parts of process.

i don't understand second question here.

if specify charset utf-8 works nicely characters, if specify charset doesn't, doesn't matter font i'm using, happen?

what's going on here when write charset="iso-8859-1" have convert document encoding. you're not doing , instead you're leaving document utf-8 encoded.

as little exercise, have file contains following 2 bytes:

0xc3 0xb6 

using information on utf-8 encoding , decoding, codepoint bytes decode to?

now using this 8859-1 codepage, same bytes decode to?

as exercise, save 2 copies of html document, 1 using charset="iso-8859-1" , 1 charset="utf-8". use hex editor , examine contents of both files.

imagine read file, encode it, have code points , need save file again. need save encoded (\u0000) or need decode first transform again characters , save?

this depends on program need read file. if program expects non-ascii characters escaped have save file way. escaping characters \u not normal thing do. see done in few places, such json data , c++ source code.

why word "unicode" bit overloaded , understood mean utf-16?

largely because microsoft uses term way. historical reasons: when added unicode support named options , setting "unicode" encoding supported utf-16.


Comments

Popular posts from this blog

google api - Incomplete response from Gmail API threads.list -

Installing Android SQLite Asset Helper -

Qt Creator - Searching files with Locator including folder -