![]() |
|
![]() |
|
TBH Gvim and most editors did the same on saving prompts, but for sure you could edit that under Emacs with M-x configure, and Emacs supported weirdly encoded files on the spot.
|
![]() |
|
Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance! https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :) |
![]() |
|
Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python. Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-... |
![]() |
|
Since Java 9, the Java JRE does something similar: if a string contains only characters in ISO-8859-1 then it is stored as such, else the usual storage format (int16) is used.
|
![]() |
|
Yeah, I started writing about what you found (the answer to (2) for Python) and I realised that's a huge rabbit hole I was venturing down and decided to stop short and post, so, apologies I guess.
|
![]() |
|
> And many other popular programming languages, including Node.js, Go, Rust, and Java uses UTF-8 by default. Oh, I missed Java moving from UTF-16 to UTF-8. |
![]() |
|
The PyUnicode object is what represents a str. If the UTF-8 bytes are ever requested, then a bytes object is created on demand and cached as part of the PyUnicode, being freed with the PyUnicode itself is freed. Separately from that the codepoints making up the string are stored in a straight forward array allowing random access. The size of each codepoint can be 1, 2, or 4 bytes. When you create a PyUnicode you have to specify the maximum codepoint value which is rounded up to 127, 255, 65535, or 1,114,111. That determines if 1, 2, or 4 bytes is used. If the maxiumum codepoint value is 127 then that array representation can be used for the UTF-8 directly. So the answer to your question is that many strings are stored as UTF-8 because all the codepoints are <= 127. Separately from that, advancing through strings should not be done by codepoints anyway. A user perceived character (aka grapheme cluster) is made up of one or more codepoints. For example an e with an accent could be the e codepoint followed by a combining accent codepoint. The phoenix emoji is really the bird emoji, a zero width joiner, and then fire emoji. Some writing systems used by hundreds of millions of people are similar to having consonants, with combining marks to represent vowels. This - - is 5 codepoints. There is a good blog post diving into it and how various languages report its "length". https://hsivonen.fi/string-length/ Source: I've just finished implementing Unicode TR29 which covers this for a Python C extension. |
![]() |
|
At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.
|
![]() |
|
It's the behavior when using the default `Encoding.UTF8` static. You have to create your own instance as `new UTF8Encoding(false)` if you don't want a BOM.
|
![]() |
|
It's a different overload. Encoding is not an optional parameter: https://learn.microsoft.com/en-us/dotnet/api/system.io.file.... Enforcing an overload of the highest arity of arguments sounds like a really terrible rule to have. Culture-sensitivity is strictly different to locale as it does not act like a C locale (unsound) but simply follows delimiters/dates/currency/etc. format for parsing and formatting. It is also in many places considered to be undesirable as it introduces environment-dependent behavior where it is not expected hence the analyzer will either suggest you to specify invariant culture or alternatively you can specify that in the project through InvariantGlobalization prop (to avoid CultureInfo.InvariantCulture spam). This is still orthogonal to text encoding however. |
![]() |
|
Nice. Now the only thing we need is JS to switch to UTF-8. But of course JS can't improve, because unlike any other programming language, we need to be compatible with code written in 1995.
|
![]() |
|
In addition to ApiFunctionA and ApiFunctionW, introduce ApiFunction8? (times whole API surface) Introduce a #define UNICODE_NO_REALLY_ALL_UNICODE_WE_MEAN_IT_THIS_TIME ? |
![]() |
|
Yes: https://learn.microsoft.com/en-us/windows/win32/sbscs/applic... > On Windows 10, this element forces a process to use UTF-8 as the process code page. For more information, see Use the UTF-8 code page. On Windows 10, the only valid value for activeCodePage is UTF-8. > This element was first added in Windows 10 version 1903 (May 2019 Update). You can declare this property and target/run on earlier Windows builds, but you must handle legacy code page detection and conversion as usual. This element has no attributes. |
![]() |
|
Hm TIL, I thought that the string encoding argument to .decode() and .encode() was required, but now I see it defaults to "utf-8". Did that change at some point?
|
![]() |
|
> ChatGPT4 says it's always been that way since the beginning of Python3 This is not a reliable way to look up information. It doesn't know when it's wrong. |
![]() |
|
> In 3.1 it was the default encoding of string (the type str I guess). No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code). At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python. Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32). See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-... |
I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.