I don’t know how Python 3.10’s string works internally. Is it choosing between 8-bit, 16-bit, and 32-bit per character in runtime?

For example:

for line in open('read1.py'):
    print(line)

Can the line string be an 8-bit, 16-bit, or 32-bit character string in each iteration? Should the line be 8-bit by default and become a 32-bit string if that line has an emoji?

  • nachtigall@feddit.de
    link
    fedilink
    arrow-up
    4
    ·
    edit-2
    2 years ago

    Python strings are UTF-8 encoded by default. UTF-8 is a variable width format where each character can be of different width.

    An decoder would first check the very first character bit and if that is 0, then it is an 8-bit ASCII character. 16-bit characters would always start with 110 and the second byte would start with 10. A 24-bit character would start with 1110 and the following bytes would start with 10 again. And for the largest 32-bit character, it would start with 11110 and, again, the following three bytes start with 10.

    The Wikipedia page explains and visualizes it quite nicely.

    • vi21OP
      link
      fedilink
      arrow-up
      1
      ·
      2 years ago

      If they used UTF-8 internally, they wouldn’t need 4 versions of the split function.

              case PyUnicode_1BYTE_KIND:
                  if (PyUnicode_IS_ASCII(self))
                      return asciilib_split_whitespace(
                          self,  PyUnicode_1BYTE_DATA(self),
                          len1, maxcount
                          );
                  else
                      return ucs1lib_split_whitespace(
                          self,  PyUnicode_1BYTE_DATA(self),
                          len1, maxcount
                          );
              case PyUnicode_2BYTE_KIND:
                  return ucs2lib_split_whitespace(
                      self,  PyUnicode_2BYTE_DATA(self),
                      len1, maxcount
                      );
              case PyUnicode_4BYTE_KIND:
                  return ucs4lib_split_whitespace(
                      self,  PyUnicode_4BYTE_DATA(self),
                      len1, maxcount
                      );
      

      https://github.com/python/cpython/blob/1402d2ceca8ccef8c3538906b3f547365891d391/Objects/unicodeobject.c#L9757