What Unicode characters can be rendered in the Command Window?

I would like to display "blackboard/double-struck/open-face" characters in the command window. The following examples are failed attempts to show the number "1" which should be U+1D7D9. Is there a reference or listing for which unicode characters can be rendered in the command window?
>> sprintf('\x1D7D9')
Warning: The hex value specified is outside the range of the character set.
ans =
1×0 empty char array
>> char(hex2dec('1D7D9'))
ans =
'[]'

 Accepted Answer

The maximum character code that (currently) can be used in MATLAB is:
+char(Inf)
ans = 65535
The value you are attempting to convert is above this, we can see that it gets mapped to the highest code:
num = hex2dec('1D7D9')
num = 120793
+char(num)
ans = 65535
The MATLAB documentation states that it uses UTF16 for storing characters:
This was written by someone who does not realize that UTF16 is by definition a variable length encoding and therefore can use one or two blocks of 16 bit data. MATLAB uses exactly one block of 16 bit data, so either the documentation is incorrect and/or misleading.
I remember somewhere it used to state the maximum supported char code, but cannot find it right now.

15 Comments

Thanks I submitted a bug report, which has case number 06393603.
Hi, I work at MathWorks and I wanted to let you all know that we have forwarded this feedback to the relevant development team. We will consider this request and it may be included in a future release of MATLAB.
MATLAB does use UTF16... it is just difficult to prove it because there is so little interface for it.
For example:
U+1D7D9 Mathematical Double-Struck Digit One https://www.compart.com/en/unicode/U+1D7D9 has a UTF-8 encoding of 0xF0 0x9D 0x9F 0x99
B = [0xF0 0x9D 0x9F 0x99]
B = 1×4
240 157 159 153
Bu = native2unicode(B,'utf8')
Bu = '𝟙'
dec2hex(Bu + 0)
ans = 2×4 char array
'D835' 'DFD9'
which is the UTF16 encoding.
So if you manage to import characters with sufficiently large code points using native2unicode() then it will be UTF16 that is stored.
"MATLAB does use UTF16... it is just difficult to prove it because there is so little interface for it."
Then perhaps I should modify my answer to "For no obvious reason MATLAB prevents us from converting values above 65535, even though it is claimed that it can store them. For no obvious reason TMW does not document this, and for no obvious reason TMW forces us to work in term of bytes when defining unicode characters (as if we are still in the early 1990s and TMW aren't quite sure how to handle this new-fangled Unicode thing...)".
What is the point in the documentation claiming that it can store them if it does not let us convert them?
Hopefully the answer is not to "use bytes".
B = [0xF0 0x9D 0x9F 0x99];
Bu = native2unicode(B,'utf8');
text(0.1, 0.1, Bu, 'interpreter', 'none')
text(0.1, 0.2, Bu, 'interpreter', 'tex')
text(0.1, 0.3, Bu, 'Interpreter', 'latex')
Warning: Error updating Text.

String scalar or character vector must have valid interpreter syntax:
𝟙
Huh, so it really is recognizing the UTF-16, at least for 'none' and 'tex'. But as usual latex craps on anything outside of 0 to 254.
I speculate that Mathworks wanted to preserve the semantics that char() of a scalar numeric value always returns a scalar character. The UFT16 encoded variable Bu in my example requires a pair of char.
That said, they sure could have made UTF16 easier to deal with!
I already had some evidence about 6 years ago that UTF16 was being used internally, but for whatever reason I was having trouble understanding how it was working. I saw some results that went beyond the Standard Model but couldn't figure out at the time what the results meant... I don't know why I overlooked using native2unicode() for testing.
"I speculate that Mathworks wanted to preserve the semantics that char() of a scalar numeric value always returns a scalar character"
But the Unicode character U+1D7D9 that the OP asked about is one scalar character, not two as you wrote (you are confusing the encoding with the number of characters). How it is stored in memory (normalization form, encoding, etc) is an implementation detail for the computer to manage (not for me, the user, to have to fight with bytes and by calling obscure NATIVE2UNICODE). Sure, Unicode is not simple, it involves lots of checking canonical equivalence, normalization form, and so forth, but computers are good at doing these kind of things (other languages can manage this, just MATLAB can't).
I thank my lucky stars for the developers of Python 3, who finally realized that a Unicode character is a Unicode character is a Unicode character, and abandoned the half-baked approach that TMW follows ("we still write our code with fond memories of when characters had exactly seven bits ... what do you mean, there are users who speak other languages who might like to use their computers?")
Unicode has existed for more than thirty years. Catch up, TMW.
I will just point out that whether or not the native2unicode approach discussed above will successfully render the double-struck "1" depends on the choice of command window font. For example, "Monospaced" fails to render, but "STIX Two Math" succeeds. When it fails, it returns a pair of empty strings.
I speculate that Mathworks wanted to preserve the semantics that char() of a scalar numeric value always returns a scalar result while using a fixed size storage per character and avoiding the overkill of assigning 4 bytes per location.
Indexing into an array in which the entries are variable-width requires that either a "marginal index" be used (a table mapping index position to offset) or else that every time the indexing process would have to start from the beginning and run through the byte representations to figure out how many positions each entry was, taking a cumulative sum for offset until the desired location was found. That is not efficient.
Variable-width characters are fine if all you are doing with them is using entire strings or concatenating entire strings, but are a performance loss otherwise.
Latex seems to work (though with a warning) if there isn't a typo. Or, maybe it is falling back to 'none' because of the warning.
B = [0xF0 0x9D 0x9F 0x99];
Bu = native2unicode(B,'utf8');
text(0.1, 0.1, Bu, 'Interpreter', 'none')
text(0.1, 0.2, Bu, 'Interpreter', 'tex')
text(0.1, 0.3, Bu, 'Interpreter', 'latex')
Warning: Error updating Text.

String scalar or character vector must have valid interpreter syntax:
𝟙
Ah, yes, sadly my cross-checking is not so great at 4 in the morning..
"I speculate that Mathworks wanted to preserve the semantics that char() of a scalar numeric value always returns a scalar result while using a fixed size storage per character and avoiding the overkill of assigning 4 bytes per location."
That closely matches my speculation.
The problem is that this does not match user expectations (mine or the OP's) or the documentation. Either
  • the documentation needs to be updated to make it clear that CHAR is limited to 16 bits** and that any larger character code requires special magic incantations. Perhaps a warning would help too, rather than just silently mapping larger values to 65535.
  • or the character type is updated so that it works as younger users expect (whether fixed bytes or a "marginal index" is irrelevant to me, the user).
** currently the closest it gets is with the statement "However, the integers from 0 to 65535 also correspond to Unicode® characters", which is an odd statement to make about Unicode, it being as equally true as stating "the integers from 0 to N also correspond to Unicode® characters" for any 0<=N<=MaxUnicodeCharCode: factually true but not helpful (and not actually a statement about CHAR, the function).
Semantically, it might not have been bad for string() to have been a container for variable-width characters. It might not have been possible to maintain the interface that string_object{1} refers to a char vector stored at string_object(1) but probably something reasonable could have been designed.
@Walter Roberson: perhaps rather than variable width, it might be possible to use fixed-width. This would allow any one char array to have a constant stride (giving efficient linear indexing as you state) and implement the entire code point range without special incantations. Here are two ways this could work invisibly to the user:
  1. apparently Unicode does not even use all 32 bits of UTF-32, only 21 bits per character is required to cover all characters. Perhaps three bytes could be used, which is not a big step up from the two bytes currently used. Benefit: fixed, constant everything, linear access. Cost: 50% increase in memory (which given modern computer memory would anyone even notice this? Who uses 25 GiBi char arrays?)
  2. store the stride in the array header (e.g. 2 bits), then store the char array using either 8/16/32 bits per character depending on the array content. This would require checking the content (slow) and some kind of process similar to the existing floating point write-on-real-to-complex to copy the array when "larger" characters are written to an existing array. Benefit: minimal memory for any one array (could also decrease memory as most(?) text uses the basic latin). Cost: checking array content.
However.... this very interesting manifesto:
makes the case that it is anyway a mistake to think of Unicode characters as simply an extension to the ASCII set, because of e.g. combining characters which by definition must be considered equivalent to the combined character. So this necessarily breaks the correspondence of the concept "character" to something that can be simply linearly indexed, making this requirement rather moot. Perhaps it is not even meaningful to require a constant indexing stride: if we loop over two character vectors, one of which has some "latin letter with diacritic" composite character, the other has a "latin letter" character and a combining diacritic character, what is the expected behavior from MATLAB? From both the user and Unicode perspectives, they are equivalent.
One major difference MATLAB has to the supporting examples given in that manifesto is that MATLAB tends to be used to analyze file content, less often to process document/file text itself. So the argument based on meta-data (e.g. XML tags) used throughout that manifesto is perhaps less relevant for MATLAB (than for a general programming language which is used e.g. to write an XML parser).
Although fixed-width solves some things (linear indexing) it only gives the illusion of Unicode support: perhaps variable-width encoding backed by an appropriate Unicode library (for equivalences, combining characters, case conversions etc) is the only robust solution to full Unicode support. Currently MATLAB fails my basic Unicode compliance tests, e.g. for canonical equivalence:
one = sprintf('\x45E') % CYRILLIC SMALL LETTER SHORT U
one = 'ў'
numel(one)
ans = 1
two = sprintf('\x443\x306') % CYRILLIC SMALL LETTER U & COMBINING BREVE
two = 'ў'
numel(two)
ans = 2
strcmp(one,two) % not Unicode compliant
ans = logical
0
I wonder the extent to which the Text Analytics Toolbox is expected (by customers) to cope with such things... and if so then what Mathworks has implemented.

Sign in to comment.

More Answers (0)

Categories

Products

Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!