跳转至

CPython 中的驻留机制

原文信息

标题:Interning in CPython
链接:https://luminousmen.com/post/interning-in-cpython
作者:luminousmen

CPython 有一个并不为人所知的有趣特性,它可能令刚接触的人抓狂。举个例子:

Perhaps not every Python developer knows an interesting property of the CPython that makes newcomers go crazy. An example speaks louder than words:

>>> x = 255
>>> y = 255
>>> x == y
True
>>> x is y
True

什么?

Whaaat?

Python 中的赋值是什么

What is an assignment in Python

Python 中简单赋值的语法如下:

The general syntax of a simple assignment in Python looks as follows:

<expression left> = <expression right>

例如:

For example:

x = "Dummy string"

现在,我们可以通过变量 x 在程序的其他位置使用这个字符串。

So now, we can use the x variable to use our string in another part of the program.

在 CPython 中,我们可以使用内建的 id() 函数显式查看变量 x 所在的内存地址。

If you use CPython, we can even explicitly check the memory address where the x variable is located using the built-in id() function.

>>> id(x)
140400709562064

可以这样理解:变量 x 指向的值,存储在系统存储器地址 140400709562064 处。在 C 语言中,它被称为指针。如果两个变量的指针相等,则表示它们指向内存中的相同地址。这里的重点是“指向”,而非“包含”或者“等于”,因为变量只是一个指向内存中对象的名称,可以有很多这样的名称,但实际的值位于内存中的相同位置。

This should be interpreted as — the variable x points to a value that is stored in system memory at address 140400709562064. In C language terms, it is termed as a pointer. If the pointers of two variables are equal, it means they point to the same address in memory. The critical word here is "points" and not "contains" or "equal because a variable is just a name that points to an object in memory, and there can be many such names but the actual value lies in the same place in memory.

例如,可以将存储在 140400709562064 的值赋值给另外两个变量:

For example, we can assign a value stored at 140400709562064 to two more variables:

>>> z = y = x
>>> id(x), id(y), id(z)
(140400709562064, 140400709562064, 140400709562064)

内存中的对象根本没有改变,它是同一个对象,但有三个变量指向它。

We see that the object in memory has not changed at all, it is the same object and three variables are pointing to it.

Assignment

因此,Python 中的赋值操作不会复制对象,它只创建一个指向内存某处实际对象的引用或指针。

So the assignment operation in Python does not copy the object, it only creates a reference or a pointer to the actual object stored somewhere in the RAM.

# 这个语句会创建一个新的独立对象
>>> x += "!"  # this statement creates a new independent object
>>> x
'Dummy string!'
# 之前的对象没有变化
>>> y  # the old one remains unharmed
'Dummy string'
>>> id(x), id(y)
(140534150379248, 140400709562064)

如果我们修改原始对象,那么将在内存中的另一个位置创建一个新对象,并且我们的新变量将指向这个新对象,原始对象保持不变。这是因为 Python 中的字符串是不可变的,每次更改都会创建一个新对象。

Now, if we change the original object, a new object will be created in another place in memory, and our new variable will point to it. The original object remains untouched. This is because strings in Python are immutable and each change creates a new object.

Immutable object assignment

对于不可变类型,一切都很简单——它们几乎总是在内存中重新创建。但想象一下如果右边的表达式是一个像列表一样的可变对象:

With immutable types, everything is simple — they are almost always recreated in memory. But let's imagine that the expression on the right is a mutable object like a list:

>>> x = [1, 2, 3]
[1, 2, 3]

试图修改原始的可变对象,就会出现一种非常令人困惑的情况:

If we try to change the original mutable object, a very confusing situation arises:

>>> y = x  # both variables point to the same object
>>> y
[1, 2, 3]
>>> id(x), id(y)
(140534150306864, 140534150306864)
>>> x.append(4)
>>> x
[1, 2, 3, 4]
>>> y
[1, 2, 3, 4]
>>> id(x), id(y)
(140534150306864, 140534150306864)

因此,当修改指向可变对象的变量时,不会像使用不可变对象那样创建它的副本。

So when we change a variable pointing to a mutable object, we are not creating a copy of it like with immutable objects — here we are changing the original object the variable pointing to.

好了,运用学到的基础知识,回到文章开头的示例。

All right, now with the basics in mind, let's go to our example at the beginning of the post.

整数的驻留机制

开头的示例中,可以看到我们创建了两个值等于 255 的独立对象,并期望它们引用内存中的不同对象:

Going back to our example at the beginning of the post, we can see that we created two separate objects with the value equal to 255 expecting them to reference different objects in memory:

>>> x = 255
>>> y = 255
>>> x == y
True
>>> x is y
True

这里我们使用 == 运算符比较对象的值,is 比较两个对象的身份。如果 is 两侧的变量指向同一对象,则结果为 True,反之为 False

Here we are comparing the values of the objects using the == operator. And is operator compares the identity of two objects. The is operator evaluates to True if the variables on either side of the operator point to the same object and false otherwise.

但它们是如何在字面量上相同呢?

But how are they literally the same?

实际上,CPython 中对小整数(从 -5256)进行了优化。当解释器启动时,这些小整数对象被加载到解释器的内存中。这会制造一个小的内部缓存。每次我们尝试在这个范围内创建整数对象时,CPython 都会自动引用内存中的这些对象,而不是创建新的整数对象。因此,具有相同值的变量指向了同一对象,is 运算的结果为真。

There is actually an optimization in CPython regarding small integers (-5 to 256 inclusive). These objects are loaded into the interpreter's memory when the interpreter starts up. This results in a small internal cache. Every time we try to create an integer object in this range, CPython automatically refers to these objects in memory instead of creating new integer objects. Because of this, variables with the same values point to the same object, and the result is true.

这被称为驻留-按需重用对象,它并不创建新对象。从上面的代码中可以看到,xy 都引用了内存中的同一个对象。这是由于 CPython 没有创建新对象,取而代之的是引用了之前的内存部分。所有这一切都是由于整数驻留机制。

This is called interning — reusing the objects on-demand instead of creating new objects. As you can see from the above code, both x and y refer to the same object in memory. This is because CPython does not create a new object but instead refers to a memory section of a. All this happens because of integer interning.

数字大于 256 的类似示例将如预期所示:

A similar example for number more than 256 works as expected:

>>> x = 257
>>> y = 257
>>> x == y
True
>>> x is y
False

这种优化策略的原因很简单:-5256 之间的整数使用频率更高,因此将它们存储在主内存中是有意义的。CPython 在启动时将它们预加载到内存中,以优化速度和内存。

The reason for this optimization strategy is simple: integers between -5 and 256 are used more often, so it makes sense to store them in the main memory. CPython preloads them into memory on startup to optimize speed and memory.

字符串的驻留机制

字符串也有同样的机制:

The same principle is also applied for strings:

>>> a = "the"
>>> b = "the"
>>> a == b
True
>>> a is b
True

字符串驻留是 CPython 中非常有用的机制,它可以更快地比较字符串。但与此同时,有时初学者会被由此导致的意外结果所迷惑。

String interning is a very useful mechanism in CPython that allows you to compare strings much faster. Unfortunately, newcomers can sometimes be confused by unexpected results which can be caused by this.

正如我们已经讨论过的,隐式的字符串驻留规则可能改变,而依赖规则就会导致错误。毕竟随着语言的不断发展,规则可能会迅速变化,但开发人员仍需要确保其程序的稳定性和健壮性。更不用说强背规则会给开发人员带来不必要的负担。

As we have already discussed, the rules for implicit string interning can be different. Relying on the rules can lead to unexpected errors. After all, rules can change so quickly as language is constantly evolving, but developers still need to ensure the stability and robustness of their programs. Not to mention that remembering the rules creates an unnecessary burden for developers.

举个例子,在 Python 3.7 之前,窥孔优化用于驻留字符串,所有长度超过 20 个字符的字符串都不进行驻留。然而之后的算法被更改为AST 优化器,并且阈值长度变为 4096 而不是 20。

For example, before Python 3.7, peephole optimization was used to intern strings, and all strings longer than 20 characters were not interned. However, then the algorithm was changed to AST optimizer, and the threshold length became 4096 instead of 20.

# version used: Python 3.6.8
>>> x = "K"*20
>>> y = "K"*20
>>> x == y, x is y
(True, True)
>>> x2 = "K"*21
>>> y2 = "K"*21
>>> x2 == y2, x2 is y2
(True, False)

# version used: Python 3.9.5
>>> x = "K"*4096
>>> y = "K"*4096
>>> x == y, x is y
(True, True)
>>> x2 = "K"*4097
>>> y2 = "K"*4097
>>> x2 == y2, x2 is y2
(True, False)

如果出于某种原因我们需要驻留字符串,可以使用内置模块 sys 中的 intern() 函数来显式的驻留字符串。

So, if for some reason we need to intern a string, there is a built-in function we can use to intern the string explicitly — intern() function in module sys.

让我们来看下面这个示例:

Let's see how it works with a simple example:

>>> import sys
>>> x = sys.intern("K"*4097)
>>> y = sys.intern("K"*4097)
>>> x == y, x is y
(True, True)

如上所示,使用 intern(),无论隐式的规则是什么,我们都可以生成 Python 驻留字符串。

As shown above, using intern() we can make Python intern strings no matter what the implicit rules are.

如果我们需要比较多个长字符串,并且同一个值可能出现多次,建议显式使用 intern() 函数来加快比较。

In practice, if we need to compare several long strings and the same value may appear many times, using the intern() function explicitly is the recommended way to speed up string comparison.

为什么开发者需要驻留机制?

Why do developers need interning?

CPython 中的字符串驻留是一种在内存中只存储字符串值的一个副本的机制。如果有多个字符串变量的值相同,则 CPython 将隐式地对它们进行内部处理,并将引用内存中的同一对象。

String interning in CPython is a mechanism for storing only one copy of a string value in memory. If there are multiple string variables whose values are the same, they will be interned by CPython implicitly and will refer to the same object in memory.

该机制有几个优点:

  1. 节省内存空间。
  2. 加快比较。
  3. 字典的查找更加迅速。

There are several advantages to this mechanism:

  1. Saving memory space.
  2. Fast comparisons
  3. Fast dictionary lookups.

第一个优点是显而易见的,因为只存储一个副本比存储所有副本所需的内存空间更少。第二个和第三个是因为如果两个字符串是引用的同一个对象,那么它们必定是相等的,不需要逐个比较它们的字符。

The first advantage is obvious because it takes less space to store only one copy than to store all copies. The second and third are because if two strings refer to the same object, they are unambiguously equal and there is no need to compare their characters one by one.

但没有什么是完美的,驻留和维护对象缓存池也需要时间。如果一个字符串很长,并且永远不会与其他字符串进行比较,那么驻留它只会带来额外的时间开销。因此请明智地使用这个功能。

Nothing is perfect, and interning and maintaining a pool of cached objects also takes time. If a string is very long and will never be compared to others, it will just take an extra time to intern it. So use this functionality wisely.

驻留机制是 CPython 中一块隐藏的宝石。CPython 会隐式地驻留一些字符串。同时,开发者也可以显式地使用它,如果使用得当,它将显著地提高应用程序的速度。

Interning mechanism is a hidden gem in CPython. CPython will intern some strings implicitly. However, developers can use it explicitly and if used properly, it will make a huge difference in increasing the speed of applications.

附言

P.S.

还有更多。

There is more to it.

如果必要,你可以访问内存并更改值。比如将字面量 4 改为值 5。就因为这段文字,我可能会被很多人指责,但下面是一个示例:

You can access this memory if you want and change the value. Say, change literal 4 to value 5. I'll probably be damned by many people for this post, but here is a sample code:

>>> import ctypes
>>> ctypes.memmove(id(4) + 24, id(5) + 24, 8)
>>> print(2 * 2) # 5

也可以用 1 替换 0。祝你调试愉快!

You can also replace 0 with 1. Have fun debugging!