(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=40168242

在 20 世纪 90 年代末,Linux 程序通常不完全支持基本多语言平面 (BMP) 之外的多字节 Utf-8 字符,其中包括标准拉丁字母和其他常用字符。 然而,由于 Utf-8 能够对任何可能的字符进行编码,因此成为现代的主导编码。 Rhapsody 等早期计算系统使用换行符 (LF),而不是回车符 (CR) 来换行。 到了 2000 年代中期,macOS、UNIX 以及除 Windows 之外的几乎所有系统都采用了 LF。 Linux 对多字节字符的处理差异很大,而 Vim 和 Emacs 等编辑器中的多字节字符处理就足够了。 超过 Utf-8 的代码页很普遍,导致 Git 等程序在文件 I/O 操作期间出现问题。 Python 中字符串的存储格式取决于创建时指定的最大代码点值。 由于大多数代码点位于前 127 个值内,因此大多数字符串都存储为 Utf-8。 这简化了字符串操作过程,因为它们可以将多字节序列视为称为字素的单个单元。 UTF-8 的广泛采用使其成为跨各种平台(包括 Python)读取文本文件的默认编码。 尽管 Utf-8 已成为事实上的标准,但了解各个平台的特性对于有效处理文本数据仍然至关重要。

相关文章

原文


Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.



With system-default code pages on Windows, it's not only platform-dependent, it's also System Locale dependent.

Windows badly dropped the ball here by not providing a simple opt-in way to make all the Ansi functions (TextOutA, etc) use the UTF-8 code page, until many many years later with the manifest file. This should have been a feature introduced in NT4 or Windows 98, not something that's put off until midway through Windows 10's development cycle.



I suspect that is a symptom of Microsoft being an enormously large organization. Coordinating a change like this that cuts across all apps, services and drivers is monumental. Honestly it is quite refreshing to see them do it with Copilot integration across all things MS. I don’t use it though, just admire the valiant effort and focus it takes to pull off something like this.

Of course - goes without saying, only works when the directive comes from all the way at the top. Otherwise there will be just too many conflicting incentives for any real change to happen.

While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.

Also..(sorry, this is becoming a long post) civil and industrial engineering firms routinely pull off projects like that. But the point I wanted to emphasize is that it’s very uncommon in tech which prides on having decentralized and semi-autonomous teams vs centralized and highly aligned teams.



> While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.

Apple has a walled garden approach to managing their ecosystem, and within the confines of their garden they just do what's necessary. AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.

Edit: even though it's expensive, it's possible to conduct such ecosystem-wide changes if you hold all cards in your hand. Microsoft was able to reengineer the graphical subsystem somewhere between XP and 8. Doing something like this is magnitudes more difficult on Linux (Wayland says hi). Google could maybe do it withij their Android corner, but they generally give a sh*t about backwards compatibility.



Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.



It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.



The BOM cases are at best a consequence of trying to use poor quality Windows software to do stuff it's not suited to. It's true that in terms of Unicode text it's valid for a UTF-8 string to have a BOM, but just because that's true in the text itself doesn't magically change file formats which long pre-dated that.

Most obviously shebang (the practice of writing #!/path/to/interpreter at the start of a script) is specifically defined on those first two bytes. It doesn't make any sense have a BOM here because that's not the format, and inventing a new rule later which says you can do it doesn't make that true, any more than in 2024 the German government can decide Germany didn't invade Poland in 1939, that's not how Time's Arrow works.



poor quality Windows software to do stuff it's not suited to

Depends how wide your definition of "poor quality" is. All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.



Interestingly, Python is one of those programs.

You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").

I never understood why ignoring this special character requires a totally separate encoding.



> I never understood why ignoring this special character requires a totally separate encoding.

Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.



Really? In my experience it's pretty rare for Linux programs not to understand any multibyte utf-8 (which would be anything that isn't ascii). What is somewhat common is failing on code points outside the basic multilingual plane (codepoints that don't fit in 16 bits).


> Not just platform-dependent, but user’s preferred locale-dependent.

Historically it made sense to be locale-dependent, but even then it was annoying to be platform-dependent.

One is not a subset of the other.



Not sure what you mean by that with regard to encodings. The C APIs were explicitly designed to abstract from that, and together with libraries like iconv is was rather straightforward. You only needed to be aware that there is a difference between internal and external encoding, and maybe decide between char and wchar_t.


By which time XP was already in the middle of releasing, so it was too late to get Windows on board.

It's too bad, with a bit more planning and an earlier realization that Unicode cannot in fact fit into 16 bits then Windows might have used UTF-8 internally.



Unless I’m mistaken, Rhapsody (released 1997) used LF, not CR. At that point it was pretty clear Mac was moving towards Unix through NeXTSTEP, meaning every OS except windows would be using LF. Microsoft would’ve had around 6 years before the release of XP, and probably would’ve had time to start the transition with Win2K at the end of 1999.


Linux was definitely not uniformly UTF-8 twenty years ago. It was one of the many available locales, but it was still common to use other encodings, and plenty of software didn't handle multibyte well in general.


My experience was that brittleness around text encoding in Emacs (versions 22 and 23 or so) was a constant source of annoyance for years.

IIRC, the main way this brittleness bit me was that every time a buffer containing a non-ASCII character was saved, Emacs would engage me in a conversation (which I found tedious and distracting) about what coding system I would like to use to save the file, and I never found a sane way to configure it to avoid such conversations even after spending hours learning about how Emacs does coding systems: I simply had to wait (a year or 3) for a new version of Emacs in which the code for saving buffers worked better.

I think some people like engaging in these conversations with their computers even though the conversations are very boring and repetitive and that such conversation-likers are numerous among Emacs users or at least Emacs maintainers.



TBH Gvim and most editors did the same on saving prompts, but for sure you could edit that under Emacs with M-x configure, and Emacs supported weirdly encoded files on the spot.


A different one that just bit me the other day was implicitly changing line endings. Local testing on my corporate laptop all went according to plan. Deploy to linux host and downstream application cannot consume it because it requires CRLF.

Just one of those stupid little things you have to remember from time to time. Although, why does newly written software require a specific line terminator is a valid question.



Not relying on flaky system defaults is a good thing. These things have a way of turning around and being different than what you assume them to be. A few years ago I was dealing with Ubuntu and some init.d scripts. One issue I ran into was that some script we used to launch Java (this was before docker) was running as root (bad, I know) and with a shell that did not set UTF-8 as the default like would be completely normal for regular users. And of course that revealed some bad APIs that we were using in Java that use the os default. Most of these things have variants that allow you to set the encoding at this point and a lot of static code checkers will warn you if you use the wrong one. But of course it only takes one place for this to start messing up content.

These days it's less of an issue but I would simply not rely on the os to get this right ever for this. Most uses of encodings other than UTF-8 are extremely likely to be unintentional at this point. And if it is intentional, you should be very explicit about it and not rely on weird indirect configuration through the OS that may or may not line up.

So, good change. Anything that breaks over this is probably better off with the simple fix added. And it's not worth leaving everything else as broken as it is with content corruption bugs just waiting to happen.



I was using .gitignore generated by an aliased touch function in powershell. Despite my best efforts, I could not get git to respect its gitignore. Figured out the touched text file was utf-16 and basically not respected at all. Lesson learned I uuchanged a system default to utf-8 but just rely on my text editor now.


The following heuristic has become increasingly true over the last couple of decades: If you have some kind of "charset" configuration anywhere, and it's not UTF-8, it's wrong.

Python 2 was charset agnostic, so it always worked, but the improvement with Python 3 was not only an improvement – how to tell a Python 3 script from a Python 2 script?

* If it contains the string "utf-8", it's Python3.

* If it only works if your locale is C.UTF-8, it's Python3.

Needless to say, I welcome this change. The way I understand it, it would "repair" Python 3.



You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')



Python has two types of strings: byte strings (every character is in the range of 0-255) and Unicode strings (every character is a Unicode codepoint). In Python 2.x, "" maps to a byte string and u"" maps to a Unicode string; in Python 3.x, "" maps to a unicode string and b"" maps to a byte string.

If you typed in "éķů" in Python 2.7, what you get is a string consisting of the hex chars 0xC3 0xA9 0xC4 0xB7 0xC5 0xAF, which if you printed it out and displayed it as UTF-8--the default of most terminals--would appear to be éķů. But "éķů"[1] would return a byte string of \xa9 which isn't valid UTF-8 and would likely display as garbage.

If you instead had used u"éķů", you'd instead get a string of three Unicode code points, U+00E9 U+0137 U+016F. And u"éķů"[1] would return u"ķ", which is a valid Unicode character.



> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'



> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "", line 1, in 
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "", line 1, in 
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.


> `str` are sequence of unicode codepoints [...] without reference to any encoding

I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!



There are two distinct questions here, to which implementations can provide different answers

1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.

2. Representation: What is actually stored (in memory) ? Layout goes here.

So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.



Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance!

https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)



Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...



Since Java 9, the Java JRE does something similar: if a string contains only characters in ISO-8859-1 then it is stored as such, else the usual storage format (int16) is used.


Yeah, I started writing about what you found (the answer to (2) for Python) and I realised that's a huge rabbit hole I was venturing down and decided to stop short and post, so, apologies I guess.


The Python source code is utf-8 by default in Python 3. But it says nothing about a character encoding used to save to a file. It is locale-dependent by default.
    # string literals create str objects using utf-8 by default
    Path("filenames use their own encoding").write_text("file content encoding uses yet another encoding")
The corresponding encodings are:

- utf-8 [tokenize.open] - sys.getfilesystemencoding() [os.fsencode] - locale.getpreferredencoding() [open]



> And many other popular programming languages, including Node.js, Go, Rust, and Java uses UTF-8 by default.

Oh, I missed Java moving from UTF-16 to UTF-8.



With Java, the default encoding when converting bytes to strings was originally platform independent, but now it's UTF-8. UTF-16 and latin-1 encodings are (still*) used internally by the String class, and the JVM uses a modified UTF-8 encoding like it always has.

* The String class originally only used UTF-16 encoding, but since Java 9 it also uses a single-byte-per-character latin-1 encoding when possible.



Depends on what you define as "file I/O", though. NTFS filenames are UTF-16 (or rather UCS2). As far as file contents, there isn't really a standard, but FWIW for a long time most Windows apps - Notepad being the canonical example when asked to save anything as "Unicode" would save it as UTF-16.


I'm talking about the default behavior of Microsoft's C runtime (MSVCRT.DLL) that everyone is/was using.

UTF-16 text files are rather rare, as is using Notepad's UTF-16 options. The only semi-common use I know of is *.reg files saved from regedit. One issue with UTF-16 is that it has two different serializations (BE and LE), and hence generally requires a BOM to disambiguate.



Then you're talking about the C stdlib, which, yeah, is meant to use the locale-specific encoding on any platform, so it's not really a Windows thing specifically. But even then someone could use the CRT but call wfopen() rather than fopen() etc - this was actually not uncommon for Windows software precisely because it let you handle Unicode without having to work with Win32 API directly.

Microsoft's implementation of fopen() also supports "ccs=..." to open Unicode text files in Unicode, and interestingly "ccs=UNICODE" will get you UTF-16LE, not UTF-8 (but you can do "ccs=UTF-8"). .NET also has this weird naming quirk where Encoding.Unicode is UTF-16, although there at least UTF-8 is the default for all text I/O classes like StreamReader if you don't specify the encoding. Still, many people didn't know better, and so some early .NET software would use UTF-16 for text I/O for no reason other than its developers believing that Encoding.Unicode is obviously what they are supposed to be using to "support Unicode", and so explicitly passing it everywhere.



Is the internal encoding in CPython UTF-8 yet?

You can index through Python strings with a subscript, but random access is rare enough that it's probably worthwhile to lazily index a string when needed. If you just need to advance or back up by 1, you don't need an index. So an internal representation of UTF-8 is quite possible.



The PyUnicode object is what represents a str. If the UTF-8 bytes are ever requested, then a bytes object is created on demand and cached as part of the PyUnicode, being freed with the PyUnicode itself is freed.

Separately from that the codepoints making up the string are stored in a straight forward array allowing random access. The size of each codepoint can be 1, 2, or 4 bytes. When you create a PyUnicode you have to specify the maximum codepoint value which is rounded up to 127, 255, 65535, or 1,114,111. That determines if 1, 2, or 4 bytes is used.

If the maxiumum codepoint value is 127 then that array representation can be used for the UTF-8 directly. So the answer to your question is that many strings are stored as UTF-8 because all the codepoints are <= 127.

Separately from that, advancing through strings should not be done by codepoints anyway. A user perceived character (aka grapheme cluster) is made up of one or more codepoints. For example an e with an accent could be the e codepoint followed by a combining accent codepoint. The phoenix emoji is really the bird emoji, a zero width joiner, and then fire emoji. Some writing systems used by hundreds of millions of people are similar to having consonants, with combining marks to represent vowels.

This - - is 5 codepoints. There is a good blog post diving into it and how various languages report its "length". https://hsivonen.fi/string-length/

Source: I've just finished implementing Unicode TR29 which covers this for a Python C extension.



At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.


Only reason I used it was to force MSVC to understand my u8"" literals. Should've forced /utf8 in our build system, in retrospective.

For UTF-16/32, knowing the endianness doesn't seem to be a frivolous functionality. And in fact, having to use heuristics-based detection via uchardet is a big mess, some kind of header should have been standardized since the start.



Some editors used them to help detect UTF-8 encoded files. Since they are also valid zero length space characters they also served as a nice easter egg for people who ended up editing their linux shell scripts with a windows text editor.


An attempt to store the encoding needed to decode the data with the data, rather than requiring the reader to know it somehow. Your program wouldn't have to care if its source data had been encoded as UTF-8, UTF-16, UTF-32 or some future standard. The usual sort of compromise that comes out of committees, in this case where every committee member wanted to be able to spit their preferred in-memory Unicode string representation to disk with no encoding overhead.


Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.


It's the behavior when using the default `Encoding.UTF8` static. You have to create your own instance as `new UTF8Encoding(false)` if you don't want a BOM.


This is true for `UTF8Encoding` used as an encoder (e.g. within transcoding stream, not often used today).

Other APIs, however, like File.WriteAllText, do not write BOM unless you explicitly pass encoding that does so (by returning non-empty preamble).



I actually did not know that File.WriteAllText/new StreamWriter defaulted to UTF-8 without BOM if no encoding was specified. I always passed in an encoding to those functions, and "Encoding.UTF8" has a BOM by default. Without specifying any encoding, I just assumed it would pick your system locale, because all the default String <-> Number conversion functions will indeed do that.

There are some coding standards for C# that mandate passing in the maximum number of parameters to a function, and never allow you to use the default parameter to be used. Sometimes this is a big win (prevents all that Current Culture nonsense when converting between numbers and strings, you need Invariant Culture almost all the time), and other times introduces bugs (Using the wrong value when creating Message Boxes to put them on the logon desktop instead of the user's screen).



It's a different overload. Encoding is not an optional parameter: https://learn.microsoft.com/en-us/dotnet/api/system.io.file....

Enforcing an overload of the highest arity of arguments sounds like a really terrible rule to have.

Culture-sensitivity is strictly different to locale as it does not act like a C locale (unsound) but simply follows delimiters/dates/currency/etc. format for parsing and formatting.

It is also in many places considered to be undesirable as it introduces environment-dependent behavior where it is not expected hence the analyzer will either suggest you to specify invariant culture or alternatively you can specify that in the project through InvariantGlobalization prop (to avoid CultureInfo.InvariantCulture spam). This is still orthogonal to text encoding however.



Indeed it would, but since codecs are only used for files that are semantically text, and in such files BOM is basically a legacy no-op marker, it's not actually a problem. Naive code using text I/O APIs would also have this issue with line endings, for example, so there's precedent for not providing the perfect roundtrip experience (that's what bytes I/O is for).


On UTF-8, the Linux framebuffer should had had a good utf8 support (a proper one, not 256/512 glyphs) long ago. Even GNU Hurd since 2007 or so it had a better 'terminal console' with UTF8 support. It's 2024.


Nice. Now the only thing we need is JS to switch to UTF-8. But of course JS can't improve, because unlike any other programming language, we need to be compatible with code written in 1995.


This is about when you ask Python to open a file "as text", what encoding it will use by default. The internal representation of strings is a different matter and, like JavaScript, Python doesn't "just use UTF-8" for that.


> Additionally, many Python developers using Unix forget that the default encoding is platform dependent. They omit to specify encoding="utf-8" when they read text files encoded in UTF-8

"forget" or possibly simply aren't made well enough aware? I genuinely thought that python would only use UTF-8 for everything unless you explicitly ask it to do otherwise.



It actually depends!

`bytes.decode` (and `str.encode`) have used UTF-8 as a default since at least Python 3.

However, the default encoding used for decoding the name of files use ` sys.getfilesystemencoding()`, which is also UTF-8 on Windows and macos, but will vary with the locale on linux (specifically with CODESET).

Finally, `open` will directly use `locale.getencoding()`.



In addition to ApiFunctionA and ApiFunctionW, introduce ApiFunction8? (times whole API surface)

Introduce a #define UNICODE_NO_REALLY_ALL_UNICODE_WE_MEAN_IT_THIS_TIME ?



Yes: https://learn.microsoft.com/en-us/windows/win32/sbscs/applic...

> On Windows 10, this element forces a process to use UTF-8 as the process code page. For more information, see Use the UTF-8 code page. On Windows 10, the only valid value for activeCodePage is UTF-8.

> This element was first added in Windows 10 version 1903 (May 2019 Update). You can declare this property and target/run on earlier Windows builds, but you must handle legacy code page detection and conversion as usual. This element has no attributes.



You're thinking of the global setting that is enabled by the user and applies to all apps that operate in terms of "current code page" - if enabled, that codepage becomes 65001 (UTF-8).

However, on Win10+, apps themselves can explicitly opt into UTF-8 for all non-widechar Win32 APIs regardless of the current locale/codepage.



Hm TIL, I thought that the string encoding argument to .decode() and .encode() was required, but now I see it defaults to "utf-8". Did that change at some point?


> ChatGPT4 says it's always been that way since the beginning of Python3

This is not a reliable way to look up information. It doesn't know when it's wrong.



> In 3.1 it was the default encoding of string (the type str I guess).

No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).

At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...



The simple thing to remember is that for all versions of Python going back 12 years, there's no such thing as "default encoding of string". A Python string is defined as a sequence of 32-bit Unicode codepoints, and that is how Python code perceives it in all respects. How it is stored internally is an implementation detail that does not affect you.


You're right, the docs just say "Unicode codepoints", and standard facilities like "\U..." or chr() will refuse anything above U+10FFFF. However I'm not sure that still holds true when third-party native modules are in the picture.
联系我们 contact @ memedata.com