CPython 中的驻留机制


原文标题:Interning in CPython
本文链接:https://www.white-winds.com/post/interning in cpython

CPython 有一个并不为人所知的有趣特性,它可能令刚接触的人抓狂。举个例子:

Perhaps not every Python developer knows an interesting property of the CPython that makes newcomers go crazy. An example speaks louder than words:

>>> x = 255
>>> y = 255
>>> x == y
>>> x is y



Python 中的赋值是什么

What is an assignment in Python

Python 中简单赋值的语法如下:

The general syntax of a simple assignment in Python looks as follows:

<expression left> = <expression right>


For example:

x = "Dummy string"

现在,我们可以通过变量 x 在程序的其他位置使用这个字符串。

So now, we can use the x variable to use our string in another part of the program.

在 CPython 中,我们可以使用内建的 id() 函数显式查看变量 x 所在的内存地址。

If you use CPython, we can even explicitly check the memory address where the x variable is located using the built-in id() function.

>>> id(x)

可以这样理解:变量 x 指向的值,存储在系统存储器地址 140400709562064 处。在 C 语言中,它被称为指针。如果两个变量的指针相等,则表示它们指向内存中的相同地址。这里的重点是“指向”,而非“包含”或者“等于”,因为变量只是一个指向内存中对象的名称,可以有很多这样的名称,但实际的值位于内存中的相同位置。

This should be interpreted as — the variable x points to a value that is stored in system memory at address 140400709562064. In C language terms, it is termed as a pointer. If the pointers of two variables are equal, it means they point to the same address in memory. The critical word here is "points" and not "contains" or "equal because a variable is just a name that points to an object in memory, and there can be many such names but the actual value lies in the same place in memory.

例如,可以将存储在 140400709562064 的值赋值给另外两个变量:

For example, we can assign a value stored at 140400709562064 to two more variables:

>>> z = y = x
>>> id(x), id(y), id(z)
(140400709562064, 140400709562064, 140400709562064)


We see that the object in memory has not changed at all, it is the same object and three variables are pointing to it.


因此,Python 中的赋值操作不会复制对象,它只创建一个指向内存某处实际对象的引用或指针。

So the assignment operation in Python does not copy the object, it only creates a reference or a pointer to the actual object stored somewhere in the RAM.

# 这个语句会创建一个新的独立对象
>>> x += "!"  # this statement creates a new independent object
>>> x
'Dummy string!'
# 之前的对象没有变化
>>> y  # the old one remains unharmed
'Dummy string'
>>> id(x), id(y)
(140534150379248, 140400709562064)

如果我们修改原始对象,那么将在内存中的另一个位置创建一个新对象,并且我们的新变量将指向这个新对象,原始对象保持不变。这是因为 Python 中的字符串是不可变的,每次更改都会创建一个新对象。

Now, if we change the original object, a new object will be created in another place in memory, and our new variable will point to it. The original object remains untouched. This is because strings in Python are immutable and each change creates a new object.

Immutable object assignment


With immutable types, everything is simple — they are almost always recreated in memory. But let's imagine that the expression on the right is a mutable object like a list:

>>> x = [1, 2, 3]
[1, 2, 3]


If we try to change the original mutable object, a very confusing situation arises:

>>> y = x  # both variables point to the same object
>>> y
[1, 2, 3]
>>> id(x), id(y)
(140534150306864, 140534150306864)
>>> x.append(4)
>>> x
[1, 2, 3, 4]
>>> y
[1, 2, 3, 4]
>>> id(x), id(y)
(140534150306864, 140534150306864)


So when we change a variable pointing to a mutable object, we are not creating a copy of it like with immutable objects — here we are changing the original object the variable pointing to.


All right, now with the basics in mind, let's go to our example at the beginning of the post.


开头的示例中,可以看到我们创建了两个值等于 255 的独立对象,并期望它们引用内存中的不同对象:

Going back to our example at the beginning of the post, we can see that we created two separate objects with the value equal to 255 expecting them to reference different objects in memory:

>>> x = 255
>>> y = 255
>>> x == y
>>> x is y

这里我们使用 == 运算符比较对象的值,is 比较两个对象的身份。如果 is 两侧的变量指向同一对象,则结果为 True,反之为 False

Here we are comparing the values of the objects using the == operator. And is operator compares the identity of two objects. The is operator evaluates to True if the variables on either side of the operator point to the same object and false otherwise.


But how are they literally the same?

实际上,CPython 中对小整数(从 -5256)进行了优化。当解释器启动时,这些小整数对象被加载到解释器的内存中。这会制造一个小的内部缓存。每次我们尝试在这个范围内创建整数对象时,CPython 都会自动引用内存中的这些对象,而不是创建新的整数对象。因此,具有相同值的变量指向了同一对象,is 运算的结果为真。

There is actually an optimization in CPython regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Every time we try to create an integer object in this range, CPython automatically refers to these objects in memory instead of creating new integer objects. Because of this, variables with the same values point to the same object, and the result is true.

这被称为驻留-按需重用对象,它并不创建新对象。从上面的代码中可以看到,xy 都引用了内存中的同一个对象。这是由于 CPython 没有创建新对象,取而代之的是引用了之前的内存部分。所有这一切都是由于整数驻留机制。

This is called interning — reusing the objects on-demand instead of creating new objects. As you can see from the above code, both x and y refer to the same object in memory. This is because CPython does not create a new object but instead refers to a memory section of a. All this happens because of integer interning.

数字大于 256 的类似示例将如预期所示:

A similar example for number more than 256 works as expected:

>>> x = 257
>>> y = 257
>>> x == y
>>> x is y

这种优化策略的原因很简单:-5256 之间的整数使用频率更高,因此将它们存储在主内存中是有意义的。CPython 在启动时将它们预加载到内存中,以优化速度和内存。

The reason for this optimization strategy is simple: integers between -5 and 256 are used more often, so it makes sense to store them in the main memory. CPython preloads them into memory on startup to optimize speed and memory.



The same principle is also applied for strings:

>>> a = "the"
>>> b = "the"
>>> a == b
>>> a is b

字符串驻留是 CPython 中非常有用的机制,它可以更快地比较字符串。但与此同时,有时初学者会被由此导致的意外结果所迷惑。

String interning is a very useful mechanism in CPython that allows you to compare strings much faster. Unfortunately, newcomers can sometimes be confused by unexpected results which can be caused by this.


As we have already discussed, the rules for implicit string interning can be different. Relying on the rules can lead to unexpected errors. After all, rules can change so quickly as language is constantly evolving, but developers still need to ensure the stability and robustness of their programs. Not to mention that remembering the rules creates an unnecessary burden for developers.

举个例子,在 Python 3.7 之前,窥孔优化用于驻留字符串,所有长度超过 20 个字符的字符串都不进行驻留。然而之后的算法被更改为AST 优化器,并且阈值长度变为 4096 而不是 20。

For example, before Python 3.7, peephole optimization was used to intern strings, and all strings longer than 20 characters were not interned. However, then the algorithm was changed to AST optimizer, and the threshold length became 4096 instead of 20.

# version used: Python 3.6.8
>>> x = "K"*20
>>> y = "K"*20
>>> x == y, x is y
(True, True)
>>> x2 = "K"*21
>>> y2 = "K"*21
>>> x2 == y2, x2 is y2
(True, False)

# version used: Python 3.9.5
>>> x = "K"*4096
>>> y = "K"*4096
>>> x == y, x is y
(True, True)
>>> x2 = "K"*4097
>>> y2 = "K"*4097
>>> x2 == y2, x2 is y2
(True, False)

如果出于某种原因我们需要驻留字符串,可以使用内置模块 sys 中的 intern() 函数来显式的驻留字符串。

So, if for some reason we need to intern a string, there is a built-in function we can use to intern the string explicitly — intern() function in module sys.


Let's see how it works with a simple example:

>>> import sys
>>> x = sys.intern("K"*4097)
>>> y = sys.intern("K"*4097)
>>> x == y, x is y
(True, True)

如上所示,使用 intern(),无论隐式的规则是什么,我们都可以生成 Python 驻留字符串。

As shown above, using intern() we can make Python intern strings no matter what the implicit rules are.

如果我们需要比较多个长字符串,并且同一个值可能出现多次,建议显式使用 intern() 函数来加快比较。

In practice, if we need to compare several long strings and the same value may appear many times, using the intern() function explicitly is the recommended way to speed up string comparison.


Why do developers need interning?

CPython 中的字符串驻留是一种在内存中只存储字符串值的一个副本的机制。如果有多个字符串变量的值相同,则 CPython 将隐式地对它们进行内部处理,并将引用内存中的同一对象。

String interning in CPython is a mechanism for storing only one copy of a string value in memory. If there are multiple string variables whose values are the same, they will be interned by CPython implicitly and will refer to the same object in memory.


  1. 节省内存空间。
  2. 加快比较。
  3. 字典的查找更加迅速。

There are several advantages to this mechanism:

  1. Saving memory space.
  2. Fast comparisons
  3. Fast dictionary lookups.


The first advantage is obvious because it takes less space to store only one copy than to store all copies. The second and third are because if two strings refer to the same object, they are unambiguously equal and there is no need to compare their characters one by one.


Nothing is perfect, and interning and maintaining a pool of cached objects also takes time. If a string is very long and will never be compared to others, it will just take an extra time to intern it. So use this functionality wisely.

驻留机制是 CPython 中一块隐藏的宝石。CPython 会隐式地驻留一些字符串。同时,开发者也可以显式地使用它,如果使用得当,它将显著地提高应用程序的速度。

Interning mechanism is a hidden gem in CPython. CPython will intern some strings implicitly. However, developers can use it explicitly and if used properly, it will make a huge difference in increasing the speed of applications.




There is more to it.

如果必要,你可以访问内存并更改值。比如将字面量 4 改为值 5。就因为这段文字,我可能会被很多人指责,但下面是一个示例:

You can access this memory if you want and change the value. Say, change literal 4 to value 5. I'll probably be damned by many people for this post, but here is a sample code:

>>> import ctypes
>>> ctypes.memmove(id(4) + 24, id(5) + 24, 8)
>>> print(2 * 2) # 5

也可以用 1 替换 0。祝你调试愉快!

You can also replace 0 with 1. Have fun debugging!