跳转至

Go 中的字符串、字节、rune 和字符

原文信息

标题:Strings, bytes, runes and characters in Go
链接:https://go.dev/blog/strings
作者:Rob Pike

简介

Introduction

之前的文章介绍了 Go 中的切片是如何工作的,使用了大量的例子来解释其背后的实现机制。本文将在此前提下讨论 Go 中的字符串。对于一篇文章的主题而言,字符串似乎比较简单, 但是为了很好的使用它们,不仅需要理解它们是如何工作的,还需要了解字节、字符、rune 之间的区别,以及 Unicode 和 UTF-8 之间的区别,字符串和字符串字面量之间的区别,甚至还有一些更细微的区别。

The previous blog post explained how slices work in Go, using a number of examples to illustrate the mechanism behind their implementation. Building on that background, this post discusses strings in Go. At first, strings might seem too simple a topic for a blog post, but to use them well requires understanding not only how they work, but also the difference between a byte, a character, and a rune, the difference between Unicode and UTF-8, the difference between a string and a string literal, and other even more subtle distinctions.

引出今天主题的方式是思考一个被频繁提出的问题:"当使用索引 n 访问字符串时,为什么没有获取到第 n 个位置的字符?" 这引导我们去思考更多 Go 中文本是如何工作的细节。

One way to approach this topic is to think of it as an answer to the frequently asked question, “When I index a Go string at position n, why don’t I get the nth character?” As you’ll see, this question leads us to many details about how text works in the modern world.

Joel Spolsky 的一篇著名博文“每个软件开发人员绝对、绝对必须了解 Unicode 和字符集的最低要求(没有借口!)”对这些问题进行了很全面的介绍,虽然这些介绍独立于 Go 语言,但他提出的许多观点都将在本文得到验证。

An excellent introduction to some of these issues, independent of Go, is Joel Spolsky’s famous blog post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Many of the points he raises will be echoed here.

字符串是什么?

What is a string?

我们先从基础开始。

Let’s start with some basics.

在 Go 中,字符串实际上是只读的字节切片,如果你完全不了解什么是字节切片以及它的工作原理,请先阅读切片这篇文章;在这里我们假定你已经理解了切片。

In Go, a string is in effect a read-only slice of bytes. If you’re at all uncertain about what a slice of bytes is or how it works, please read the previous blog post; we’ll assume here that you have.

理解一个字符串可以包含任意的字节非常重要。先不要求它保存 Unicode 编码的文本,UTF-8 编码的文本,或则其他任何预定义格式的文本。就字符串的内容而言,它完全等同于一个字节切片。

It’s important to state right up front that a string holds arbitrary bytes. It is not required to hold Unicode text, UTF-8 text, or any other predefined format. As far as the content of a string is concerned, it is exactly equivalent to a slice of bytes.

下面这个字符串(很快下文会详谈)通过 \xNN 表示法定义了一个字符串常量,其中包含一些特殊的字节值 (字节的范围从十六进制 0x00 到 0xFF)。

Here is a string literal (more about those soon) that uses the \xNN notation to define a string constant holding some peculiar byte values. (Of course, bytes range from hexadecimal values 00 through FF, inclusive.)

    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

打印字符串

Printing strings

由于上例字符串中的一些字节不是有效的 ASCII 码,甚至不是合法的 UTF-8 码,直接打印将会得到一些丑陋的输出。简单打印一下:

Because some of the bytes in our sample string are not valid ASCII, not even valid UTF-8, printing the string directly will produce ugly output. The simple print statement

    fmt.Println(sample)

产生以下混乱的输出(具体显示依赖于环境):

produces this mess (whose exact appearance varies with the environment):

��=� ⌘

要想找出这个字符串真正的含义,我们需要将字符串进行分割并且检查每一个片段。有几种方法可以做到,最容易理解的是遍历字符串并逐个取出单个字节,就像下面的 for 循环:

To find out what that string really holds, we need to take it apart and examine the pieces. There are several ways to do this. The most obvious is to loop over its contents and pull out the bytes individually, as in this for loop:

    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }

前面谈到索引一个字符串访问的是单个字节而非字符。我们将在下面详谈这个主题。现在让我们继续使用字节。下面是逐字节的遍历输出:

As implied up front, indexing a string accesses individual bytes, not characters. We’ll return to that topic in detail below. For now, let’s stick with just the bytes. This is the output from the byte-by-byte loop:

bd b2 3d bc 20 e2 8c 98

值得注意的是单个字节和在字符串中使用十六进制转义的定义是如何匹配的。

Notice how the individual bytes match the hexadecimal escapes that defined the string.

为杂乱的字符串生成直观输出的捷径是使用 fmt.Printf%x(十六进制)。它只是将字符串的连续字节转储为十六进制数字,每个字节有两个数字。

A shorter way to generate presentable output for a messy string is to use the %x (hexadecimal) format verb of fmt.Printf. It just dumps out the sequential bytes of the string as hexadecimal digits, two per byte.

    fmt.Printf("%x\n", sample)

将其输出与上面输出进行比较:

Compare its output to that above:

bdb23dbc20e28c98

有一个技巧:在格式化字符串中使用“空格”标志,即在 %x 之间添加一个空格。对比此处的格式化字符串和上面所使用的,

A nice trick is to use the “space” flag in that format, putting a space between the % and the x. Compare the format string used here to the one above,

    fmt.Printf("% x\n", sample)

同时注意字节之间的空格是如何呈现的,这使输出结果看起来更加自然:

and notice how the bytes come out with spaces between, making the result a little less imposing:

bd b2 3d bc 20 e2 8c 98

还可以使用 %q 跳过转义字符串中任何不可打印的字节序列,因此输出是清楚的。

There’s more. The %q (quoted) verb will escape any non-printable byte sequences in a string so the output is unambiguous.

    fmt.Printf("%q\n", sample)

当字符串中的大多数为可以理解的文本时,这个方法很好用;它打印:

This technique is handy when much of the string is intelligible as text but there are peculiarities to root out; it produces:

"\xbd\xb2=\xbc ⌘"

让我们审视一下这个字符串,可以看到,隐藏在乱码中的有一个 ASCII 的等于符号,以及一个常规的空格符号,最后出现了一个瑞典著名的“⌘”符号。它的 Unicode 值为 U+2318,经过 UTF-8 进行字节编码之后为:e2 8c 98 (在空格之后,空格编码后值为 20)。

If we squint at that, we can see that buried in the noise is one ASCII equals sign, along with a regular space, and at the end appears the well-known Swedish “Place of Interest” symbol. That symbol has Unicode value U+2318, encoded as UTF-8 by the bytes after the space (hex value 20): e2 8c 98.

如果字符串中的怪异值令我们困惑或混淆,可以使用 %+q。此标志不仅转义不可打印的序列,还转义任何非 ASCII 字节,这些都是在解释 UTF-8 时进行的。最终的结果是,它会显示格式正确的 UTF-8 的对应的 Unicode 值,该值表示字符串中的非 ASCII 数据:

If we are unfamiliar or confused by strange values in the string, we can use the “plus” flag to the %q verb. This flag causes the output to escape not only non-printable sequences, but also any non-ASCII bytes, all while interpreting UTF-8. The result is that it exposes the Unicode values of properly formatted UTF-8 that represents non-ASCII data in the string:

    fmt.Printf("%+q\n", sample)

使用这个格式,符号“⌘”的 Unicode 值会以 \u 的转义形式来显示:

With that format, the Unicode value of the Swedish symbol shows up as a \u escape:

"\xbd\xb2=\xbc \u2318"

在调试字符串的内容时很容易了解这些打印技巧,这些技巧在接下来的讨论中也会派上用场。值得指出的是,所有这些技巧对字节切片的行为与对字符串的行为完全相同。

These printing techniques are good to know when debugging the contents of strings, and will be handy in the discussion that follows. It’s worth pointing out as well that all these methods behave exactly the same for byte slices as they do for strings.

下面的完整程序列出了上面谈到的所有打印选项,你可以直接在浏览器中运行(以及编辑):

Here’s the full set of printing options we’ve listed, presented as a complete program you can run (and edit) right in the browser:

package main

import "fmt"

func main() {
    const sample = "\xbd\xb2\x3d\xbc\x20\xe2\x8c\x98"

    fmt.Println("Println:")
    fmt.Println(sample)

    fmt.Println("Byte loop:")
    for i := 0; i < len(sample); i++ {
        fmt.Printf("%x ", sample[i])
    }
    fmt.Printf("\n")

    fmt.Println("Printf with %x:")
    fmt.Printf("%x\n", sample)

    fmt.Println("Printf with % x:")
    fmt.Printf("% x\n", sample)

    fmt.Println("Printf with %q:")
    fmt.Printf("%q\n", sample)

    fmt.Println("Printf with %+q:")
    fmt.Printf("%+q\n", sample)
}

【练习: 修改上面的示例,使用字节切片而不是一个字符串。提示:可以通过类型转换来创建切片。】

[Exercise: Modify the examples above to use a slice of bytes instead of a string. Hint: Use a conversion to create the slice.]

【练习: 通过 %q 格式遍历字符串的每一个字节,输出结果告诉你了什么?】

[Exercise: Loop over the string using the %q format on each byte. What does the output tell you?]

UTF-8 和字符串字面量

UTF-8 and string literals

正如所看到的,对字符串进行索引会返回它的字节,而不是它的字符:字符串只是一堆字节。这意味着当我们用字符串存储一个字符时,我们存储的是它的字节形式。让我们通过一个更可控的例子来了解详细的过程。

As we saw, indexing a string yields its bytes, not its characters: a string is just a bunch of bytes. That means that when we store a character value in a string, we store its byte-at-a-time representation. Let’s look at a more controlled example to see how that happens.

下面是一个简单的程序,它以三种不同的方式打印单个字符的字符串常量,第一次作为纯字符串,第二次作为仅 ASCII 引用的字符串,最后一次作为十六进制的单个字节。为了避免混淆,我们创建了一个”原始字符串“,它以后引号`包裹,只能包含文字文本(使用双引号包裹的常规字符串可以包含如上所示的转义序列)。

Here’s a simple program that prints a string constant with a single character three different ways, once as a plain string, once as an ASCII-only quoted string, and once as individual bytes in hexadecimal. To avoid any confusion, we create a “raw string”, enclosed by back quotes, so it can contain only literal text. (Regular strings, enclosed by double quotes, can contain escape sequences as we showed above.)

func main() {
    const placeOfInterest = `⌘`

    fmt.Printf("plain string: ")
    fmt.Printf("%s", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("quoted string: ")
    fmt.Printf("%+q", placeOfInterest)
    fmt.Printf("\n")

    fmt.Printf("hex bytes: ")
    for i := 0; i < len(placeOfInterest); i++ {
        fmt.Printf("%x ", placeOfInterest[i])
    }
    fmt.Printf("\n")
}

输出:

The output is:

plain string: ⌘
quoted string: "\u2318"
hex bytes: e2 8c 98

这提醒我们,字符 ⌘ 的 Unicode 值为 U+2318,字节表示为 e2 8c 98,这些字节是十六进制值 2318 的 UTF-8 编码。

which reminds us that the Unicode character value U+2318, the “Place of Interest” symbol ⌘, is represented by the bytes e2 8c 98, and that those bytes are the UTF-8 encoding of the hexadecimal value 2318.

取决于你对 UTF-8 的熟悉程度,这可能是显而易见的,也可能是需要理解的。花点时间解释字符串的 UTF-8 形式是如何创建的是有价值的。简而言之:它是在编写源代码的时候创建。

It may be obvious or it may be subtle, depending on your familiarity with UTF-8, but it’s worth taking a moment to explain how the UTF-8 representation of the string was created. The simple fact is: it was created when the source code was written.

使用 Go 编写的代码被定义为 UTF-8 编码的文本;且不允许其他编码。这意味着当我们在代码中写下文本

Source code in Go is defined to be UTF-8 text; no other representation is allowed. That implies that when, in the source code, we write the text

`⌘`

用于创建程序的文本编辑器将符号 ⌘ 的 UTF-8 编码放入源文本中。当我们以十六进制字节打印时,只是打印了编辑器放置在源文本中的数据。

the text editor used to create the program places the UTF-8 encoding of the symbol ⌘ into the source text. When we print out the hexadecimal bytes, we’re just dumping the data the editor placed in the file.

简而言之,Go 源代码是用 UTF-8 编码的,所以字符串字面量的源代码是 UTF-8 文本。如果该字符串文字不包含转义序列(原始字符串不能包含转义序列),则构造的字符串将准确地保存引号之间的源文本。根据定义和构造,原始字符串将始终包含其内容的合法 UTF-8 表示。类似地,除非它包含上一节中的那些破坏 UTF-8 的转义符,否则常规字符串字面量也将始终包含合法的 UTF-8。

In short, Go source code is UTF-8, so the source code for the string literal is UTF-8 text. If that string literal contains no escape sequences, which a raw string cannot, the constructed string will hold exactly the source text between the quotes. Thus by definition and by construction the raw string will always contain a valid UTF-8 representation of its contents. Similarly, unless it contains UTF-8-breaking escapes like those from the previous section, a regular string literal will also always contain valid UTF-8.

有些人认为 Go 字符串总是 UTF-8 编码,但并非如此:只有字符串字面量是 UTF-8 编码。正如我们在上一节中所展示的,字符串 values 可以包含任意字节;

Some people think Go strings are always UTF-8, but they are not: only string literals are UTF-8. As we showed in the previous section, string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes.

总之,字符串可以包含任意字节,但是当从字符串字面量构造字符串时,这些字节(几乎)总是UTF-8格式。

To summarize, strings can contain arbitrary bytes, but when constructed from string literals, those bytes are (almost always) UTF-8.

代码点、字符和 rune

Code points, characters, and runes

到目前为止,我们在使用“字节”和“字符”这两个词时都非常谨慎。这一部分是因为字符串由字节组成,另一部分是因为“字符”的概念有点难以定义。Unicode 标准使用术语“代码点(code point)”来指代一项。U+2318 这个代码点的十六进制值为 2318,代表 ⌘ 符号(查看 Unicode 以了解更多关于代码点的信息)。

We’ve been very careful so far in how we use the words “byte” and “character”. That’s partly because strings hold bytes, and partly because the idea of “character” is a little hard to define. The Unicode standard uses the term “code point” to refer to the item represented by a single value. The code point U+2318, with hexadecimal value 2318, represents the symbol ⌘. (For lots more information about that code point, see its Unicode page.)

一个更简单的例子是,Unicode 的代码点 U+0061 代表小写拉丁字母 a。

To pick a more prosaic example, the Unicode code point U+0061 is the lower case Latin letter ‘A’: a.

小写字母 à 是一个字符,它也是一个代码点(U+00E0),但它有其他的表示。例如,我们可以使用“组合”将重音符号(代码点为 U+0300)附加到小写字母 a(代码点为 U+0061),以创建相同的字符 à。一般而言,字符可以由多个不同的代码点序列表示,因此其 UTF-8 的字节序列也不同。

But what about the lower case grave-accented letter ‘A’, à? That’s a character, and it’s also a code point (U+00E0), but it has other representations. For example we can use the “combining” grave accent code point, U+0300, and attach it to the lower case letter a, U+0061, to create the same character à. In general, a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.

可见,计算中字符的概念是模糊的,至少是令人困惑的,所以我们应该谨慎使用它。为了使这一切变得可靠,有一些规范化技术可以保证给定的字符总是由相同的代码点表示,但深入这个话题会使我们偏离主题。以后的博客文章将解释 Go 库如何解决规范化问题。

The concept of character in computing is therefore ambiguous, or at least confusing, so we use it with care. To make things dependable, there are normalization techniques that guarantee that a given character is always represented by the same code points, but that subject takes us too far off the topic for now. A later blog post will explain how the Go libraries address normalization.

“code point有点拗口,所以 Go 为这个概念引入了一个较短的术语:rune。该术语出现在库和源代码中,其含义与“code point”完全相同,但有一个有趣的补充。

“Code point” is a bit of a mouthful, so Go introduces a shorter term for the concept: rune. The term appears in the libraries and source code, and means exactly the same as “code point”, with one interesting addition.

Go 语言将 rune 定义为 int32 类型的别名,因此程序可以清晰的使用整数值表示代码点。由此字符常量在 Go 中被称为 rune 常量。表达式

'⌘'

的类型为 rune 值为 0x2318

The Go language defines the word rune as an alias for the type int32, so programs can be clear when an integer value represents a code point. Moreover, what you might think of as a character constant is called a rune constant in Go. The type and value of the expression

'⌘'

is rune with integer value 0x2318.

总结一下,以下是几个要点:

  • Go 源代码总是 UTF-8 编码。
  • 字符串可以保存任意字节。
  • 一个没有字节级转义的字符串字面量,始终保存合法的 UTF-8 序列。
  • 这些序列代表 Unicode 的代码点,在 Go 中称为 rune。
  • Go 并不保证字符串中的字符是规范化的。

To summarize, here are the salient points:

  • Go source code is always UTF-8.
  • A string holds arbitrary bytes.
  • A string literal, absent byte-level escapes, always holds valid UTF-8 sequences.
  • Those sequences represent Unicode code points, called runes.
  • No guarantee is made in Go that characters in strings are normalized.

range 循环

Range loops

除了 Go 源代码总是 UTF-8 编码这个细节之外,Go 对 UTF-8 的特殊处理还有一种情形,那就是在字符串上使用 for range 遍历。

Besides the axiomatic detail that Go source code is UTF-8, there’s really only one way that Go treats UTF-8 specially, and that is when using a for range loop on a string.

我们已经看到了常规的 for 循环会发生什么。相比而言,for range 循环在每次迭代中解码一个 UTF-8 编码的 rune。每次迭代的索引值是当前 rune 的起始位置,这个位置是以字节为单位的,字符对应的代码点就是 rune 的值。下例使用另一种方便的 Printf 格式 %#U ,它打印了代码点 的 Unicode 值及其对应的字符:

We’ve seen what happens with a regular for loop. A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value. Here’s an example using yet another handy Printf format, %#U, which shows the code point’s Unicode value and its printed representation:

    const nihongo = "日本語"
    for index, runeValue := range nihongo {
        fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
    }

输出显示了每个代码点是如何占用多个字节的:

The output shows how each code point occupies multiple bytes:

U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6

【练习:将非法的 UTF-8 字节序列放入字符串中(如何放入?)。再进行迭代会发生什么?】

[Exercise: Put an invalid UTF-8 byte sequence into the string. (How?) What happens to the iterations of the loop?]

Libraries

Go 的标准库为解释 UTF-8 文本提供了强大的支持。如果 for range 循环无法满足要求,可以使用库中的包提供的工具。

Go’s standard library provides strong support for interpreting UTF-8 text. If a for range loop isn’t sufficient for your purposes, chances are the facility you need is provided by a package in the library.

最重要的包是 unicode/utf8,它包含用于验证、拆分和重组 UTF-8 字符串的帮助函数。下面是一个等效上述 for range 示例的程序,但使用该包中的 DecodeRuneInString 函数来完成这项工作。该函数的返回值是 rune 及其 UTF-8 编码的字节宽度。

The most important such package is unicode/utf8, which contains helper routines to validate, disassemble, and reassemble UTF-8 strings. Here is a program equivalent to the for range example above, but using the DecodeRuneInString function from that package to do the work. The return values from the function are the rune and its width in UTF-8-encoded bytes.

    const nihongo = "日本語"
    for i, w := 0, 0; i < len(nihongo); i += w {
        runeValue, width := utf8.DecodeRuneInString(nihongo[i:])
        fmt.Printf("%#U starts at byte position %d\n", runeValue, i)
        w = width
    }

运行它将会和 for range 的输出相同。for range 循环和 DecodeRuneInString 产生完全相同的迭代序列。

Run it to see that it performs the same. The for range loop and DecodeRuneInString are defined to produce exactly the same iteration sequence.

查看 unicode/utf8 包的文档以了解它提供的其他功能。

Look at the documentation for the unicode/utf8 package to see what other facilities it provides.

结语

Conclusion

现在来回答开头提出的问题:字符串是从字节构建的,因此对它们进行索引会得到字节,而非字符。字符串甚至可以不包含字符。事实上,“字符”的定义是模棱两可的,试图通过定义字符串是由字符组成的来解决歧义是不可取的。

To answer the question posed at the beginning: Strings are built from bytes so indexing them yields bytes, not characters. A string might not even hold characters. In fact, the definition of “character” is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters.

关于 Unicode、UTF-8 和多语言文本处理还有很多的话题可以谈,但它可以另成一篇。就目前而言,我们希望你对 Go 字符串的行为有了更好的理解,尽管它们可能包含任意字节,但 UTF-8 是其设计的核心部分。

There’s much more to say about Unicode, UTF-8, and the world of multilingual text processing, but it can wait for another post. For now, we hope you have a better understanding of how Go strings behave and that, although they may contain arbitrary bytes, UTF-8 is a central part of their design.