原文标题：Lock-free multithreading with atomic operations
本文链接：https://www.white-winds.com/post/lock-free multithreading with atomic operations
Synchronizing threads at a lower level.
本文已由 Federica Rinaldi 仔细校对。由衷感谢！
This article has been carefully proofread by Federica Rinaldi. Thank you!
The Greek word "atom" (ἄτομος; atomos) means uncuttable. A task performed by a computer is said to be atomic when it is not divisible anymore: it can't be broken into smaller steps.
Atomicity is an important property of multithreaded operations: since they are indivisible, there is no way for a thread to slip through an atomic operation concurrently performed by another one. For example, when a thread atomically writes on shared data no other thread can read the modification half-complete. Conversely, when a thread atomically reads from shared data, it sees the value as it appeared at a single moment in time. In other words, there is no risk of data races.
在上篇文章中已经介绍了所谓的同步原语，它们是线程同步最常用的工具。它们为多线程对共享数据的操作提供原子性。这是怎么做到的？ 它们只允许单个线程执行并发部分的工作，在第一个线程完成前，其他线程则被操作系统阻止。基本原理是阻塞的线程不会对其他线程造成坏的影响。由于它们能够使线程冻结，这种同步原语也被称为阻塞机制（blocking mechanisms）。
In the previous chapter I have introduced the so-called synchronization primitives, the most common tools for thread synchronization. They are used, among other things, to provide atomicity to operations that deal with data shared across multiple threads. How? They simply allow a single thread to do its concurrent job, while others are blocked by the operating system until the first one has finished. The rationale is that a blocked thread does no harm to others. Given their ability to freeze threads, such synchronization primitives are also known as blocking mechanisms.
- 几乎无法控制哪个线程将休眠 ——这通常取决于操作系统选择要阻止哪个线程。这可能会导致一个糟糕的情况，被称为优先级反转（priority inversion）：正在执行一项非常重要的任务的线程被另一个具有较低优先级的线程所阻塞。
Any blocking mechanism seen in the previous chapter will work great for the vast majority of your applications. They are fast and reliable if used correctly. However, they introduce some drawbacks that you might want to take into account:
- they block other threads — a dormant thread simply waits for the wakeup signal, doing nothing: it could be wasting precious time;
- they could hang your application — if a thread holding a lock to a synchronization primitive crashes for whatever reason, the lock itself will be never released and the waiting threads will get stuck forever;
- you have little control over which thread will sleep — it's usually up to the operating system to choose which thread to block. This could lead to an unfortunate event known as priority inversion: a thread that is performing a very important task gets blocked by another one with a lower priority.
Most of the time you don't care about these issues as they won't affect the correctness of your programs. On the other hand, sometimes having threads always up and running is desirable, especially if you want to take advantage of multi-processor/multi-core hardware. Or maybe you can't afford a system that could get stuck on a dead thread. Or again, the priority inversion problem looks too dangerous to ignore.
Lock-free programming to the rescue
好消息是：在以防止出现上述第 1、2 和 3 点的条件下，还有另一种方法可以控制多线程应用程序中的并发任务。被称为无锁编程（lock-free programming ，lockless programming），它是一种在多个线程之间安全共享更改数据的技术，而且没有锁和解锁的开销。
The good news: there is another way to control concurrent tasks in your multithreaded app, in order to prevent points 1), 2) and 3) seen above. Known as lock-free programming or lockless programming, it's a technique to safely share changing data between multiple threads without the cost of locking and unlocking them.
The bad news: this is low-level stuff. Way lower than using the traditional synchronization primitives like mutexes and semaphores: this time we will get closer to the metal. Despite this, I find lock-free programming a good mental challenge and a great opportunity to better understand how a computer actually works.
无锁编程依赖于 CPU 直接执行的原子操作，它被称为原子指令（atomic instructions）。作为无锁编程的基础，在本文的其余部分中将首先介绍原子指令，然后展示如何利用它们进行并发控制。 开始吧！
Lock-free programming relies upon atomic instructions, operations performed directly by the CPU that occur atomically. Being the foundation of lock-free programming, in the rest of this article I will introduce atomic instructions first, then I will show you how to leverage them for concurrency control. Let's get started!
What are atomic instructions?
想想计算机执行的某个操作，例如在屏幕上显示图片。这样的操作由许多较小的操作组成：将文件读入内存，解压缩图像，点亮屏幕上的像素等等。 如果不断地放大某个子任务，也就是说如果把它分解成越来越小的部分，最后就是由处理器执行的最小的、对人类可见的操作，称为机器指令（machine instruction），是由硬件直接执行的命令。
Think of any action performed by a computer, say for example displaying a picture on your screen. Such operation is made of many smaller ones: read the file into memory, de-compress the image, light up pixels on the screen and so on. If you recursively zoom into one of those sub-tasks, that is if you break it down into smaller and smaller pieces, you will eventually reach a dead end. The smallest, visible to a human operation performed by a processor is called machine instruction, a command executed by the hardware directly.
1.Multiple layers of a computer program. Dashed lines are software, solid lines are hardware.
取决于 CPU 的架构，一些机器指令是原子的，也就是说它们是在一个单一的、不可切割的、不可中断的步骤中执行的；另外一些不是原子的：处理器以更小的操作形式在内部完成更多工作，这被称为微操作（micro-operations）。对于前一类：原子指令是不能进一步分解的 CPU 操作。更具体地说，原子指令可以分为两大类：存储与加载（store and load）以及读取-修改-写入（read-modify-write，RMW）。
Depending on the CPU architecture, some machine instructions are atomic, that is they are performed in a single, uncuttable and uninterruptible step. Some others are not atomic instead: the processor does more work under the hood in form of even smaller operations, known as micro-operations. Let's focus on the former category: an atomic instruction is a CPU operation that cannot be further broken down. More specifically, atomic instructions can be grouped into two major classes: store and load and read-modify-write (RMW).
Store and load atomic instructions
The building blocks any processor operates on: they are used to write (store) and read (load) data in memory. Many CPU architectures guarantee that these operations are atomic by nature, under some circumstances. For example, processors that implement the x86 architecture feature the
MOVinstruction, which reads bytes from memory and gives them to the CPU. This operation is guaranteed to be atomic if performed on aligned data, that is information stored in memory in a way that makes it easy for the CPU to read it in a single shot.
Read-modify-write (RMW) atomic instructions
一些更复杂的操作不能单独使用简单的存储和加载来执行。例如，增加内存中的值将至少需要三个原子的加载和存储指令组合，从而使结果整体上不是原子的。 读取-修改-写入指令通过在一个原子步骤中执行多个操作来填补空白。有很多这类指令，某些 CPU 架构提供了全集，而另一些 CPU 仅提供一个子集。举几例：
- test-and-set — 将 1 写入内存位置并返回旧值，原子步骤；
- fetch-and-add — 增加内存中的一个值并返回旧值，原子步骤；
- compare-and-swap (CAS) — 将内存位置的内容与给定值进行比较，如果它们相等，将该内存位置的内容修改为新的给定值。
Some more complex operations can't be performed with simple stores and loads alone. For example, incrementing a value in memory would require a mixture of at least three atomic load and store instructions, making the outcome non-atomic as a whole. Read-modify-write instructions fill the gap by giving you the ability to compute multiple operations in one atomic step. There are many instructions in this class. Some CPU architectures provide them all, some others only a subset. To name a few:
- test-and-set — writes 1 to a memory location and returns the old value in a single, atomic step;
- fetch-and-add — increments a value in memory and returns the old value in a single, atomic step;
- compare-and-swap (CAS) — compares the content of a memory location with a given value and, if they are equal, modifies the contents of that memory location to a new given value.
All these instructions perform multiple things in memory in a single, atomic step. This is an important property that makes read-modify-write instructions suitable for lock-free multithreading operations. We will see why in few paragraphs.
Three levels of atomic instructions
上面看到的所有指令都属于硬件：它们要求与 CPU 直接对话。以这种方式工作显然很费力且不可移植，因为某些指令可能在不同的体系结构中具有不同的名称。 某些操作甚至可能不存在于一些处理器型号中！ 因此，除非正在为特定机器编写非常底层的代码，否则不太可能接触这些东西。
All the instructions seen above belong to the hardware: they require you to talk directly to the CPU. Working this way is obviously difficult and non-portable, as some instructions might have different name across different architectures. Some operations might not even exist across different processor models! So it is unlikely you will touch these things, unless you are working on very low-level code for a specific machine.
Climbing up to the software level, many operating systems provide their own versions of atomic instructions. Let's call them atomic operations, since we are abstracting away from their physical machine counterpart. For example, in Windows you may find the Interlocked API, a set of functions that handle variables in an atomic manner. MacOS does the same with its OSAtomic.h header. They surely conceal the hardware implementation, but you are still bound to a specific environment.
java.util.concurrent.atomic 包；C++ 提供了
std::atomic 头文件； Haskell 有
Data.Atomics 包等等。一般来说，如果一个编程语言处理多线程，它很可能会有对原子操作的支持。这种方式取决于编译器（如果它是编译型语言）或虚拟机（如果它是解释型语言）来找到实现原子操作的最佳指令，无论该指令是来自底层操作系统 API 还是直接来自硬件。
The best way to perform portable atomic operations is to rely upon the ones provided by the programming language of choice. In Java for example you will find the
java.util.concurrent.atomicpackage; C++ provides the
std::atomicheader; Haskell has the
Data.Atomicspackage and so on. Generally speaking, it is likely to find support for atomic operations if a programming language deals with multithreading. This way is up to the compiler (if it's a compiled language) or the virtual machine (if it's an interpreted language) to find the best instructions for implementing atomic operations, whether from the underlying operating system API or directly from the hardware.
2.Hierarchy of atomic instructions and operations. Dashed lines are software, solid lines are hardware.
例如，GCC（一种 C++ 编译器）通常将 C++ 原子操作和对象直接转换为机器指令。It also tries to emulate a specific operation that doesn't map directly to the hardware with other atomic machine instructions if available. 最坏的情况是：在不提供原子操作的平台上，它可能依赖于其他的阻塞策略，当然这些策略并不是无锁的。
For example, GCC — a C++ compiler — usually transforms C++ atomic operations and objects straight into machine instructions. It also tries to emulate a specific operation that doesn't map directly to the hardware with other atomic machine instructions if available. The worst-case scenario: on a platform that doesn't provide atomic operations it may rely upon other blocking strategies, which wouldn't be lock-free, of course.
Leveraging atomic operations in multithreading
Let's now see how atomic operations are used. Consider incrementing a simple variable, a task that is not atomic by nature as it is made of three different steps — read the value, increment it, store the new value back. Traditionally, you would regulate the operation with a mutex (pseudocode):
mutex = initialize_mutex() x = 0 reader_thread() mutex.lock() print(x) mutex.unlock() writer_thread() mutex.lock() x++ mutex.unlock()
The first thread that acquires the lock makes progress, while others sit and wait in line until it has finished.
Conversely, the lock-free approach introduces a different pattern: threads are free to run without any impediment, by employing atomic operations. For example:
x = 0 reader_thread() print(load(x)) writer_thread() fetch_and_add(x, 1)
I assume that
load()are atomic operations based on the corresponding hardware instructions. As you may notice, nothing is locked here. Multiple threads that call those functions concurrently can all make progress. The atomicity of
load()makes sure that no reader thread will read the shared value half-complete, as well as no writer thread will damage it with a partial write thanks to
Atomic operations in the real world
Now, this example reveals us an important property of atomic operations: they work only with primitive types — booleans, chars, shorts, ints and so on. On the other hand, actual programs require synchronization for more complex structures like arrays, vectors, objects, vectors of arrays, objects containing arrays, ... . How can we guarantee atomicity on such convoluted entities with simple operations based on primitive types?
无锁编程迫使你跳出通常的基于同步原语的框架来进行思考。不使用互斥锁或信号量来保护共享资源。取而代之的是，通过构建无锁算法（lock-free algorithms）或无锁数据结构（lock-free data structures），基于原子操作来确定多个线程将如何访问共享数据。
Lock-free programming forces you to think out of the box of the usual synchronization primitives. You don't protect a shared resource with atomic operations directly, as you would do with a mutex or a semaphore. Rather, you build lock-free algorithms or lock-free data structures, based on atomic operations to determine how multiple threads will access your data.
例如，前面看到的 fetch-and-add 操作可用于制作基本的信号量，如此可以使用它来调节线程。这并不奇怪，所有传统的阻塞同步实体都是基于原子操作的。
For example, the fetch-and-add operation seen before can be used to make a rudimentary semaphore that, in turn, you would employ to regulate threads. Not surprisingly all the traditional, blocking synchronization entities are based on atomic operations.
People have written countless lock-free data structures like Folly's AtomicHashMap, the Boost.Lockfree library, multi-producer/multi-consumer FIFO queues or algorithms like read-copy-update (RCU) and Shadow Paging to name a few. Writing these atomic weapons from scratch is hard, let alone making them work correctly. This is why most of the time you may want to employ existing, battle-tested algorithms and structures instead of rolling your owns.
The compare-and-swap (CAS) loop
在更接近现实场景的应用程序中，比较和交换循环（compare-and-swap loop）（又名 CAS 循环），无论是使用现有数据结构还是正在编写从头开始的算法，它可能是无锁编程中最常见的策略。它基于相应的 compare-and-swap 原子操作，这个原子操作有一个很好的属性：它支持多个写入器。这是并发算法的一个重要特征，尤其是在复杂系统中使用时。
Moving closer to real-world applications, the compare-and-swap loop (a.k.a. CAS loop) is probably the most common strategy in lock-free programming, whether you are using existing data structures or are writing algorithms from the ground up. It is based on the corresponding compare-and-swap atomic operation and has a nice property: it supports multiple writers. This is an important feature of a concurrent algorithm especially when used in complex systems.
The CAS loop is interesting also because it introduces a recurring pattern in lock-free code, as well as some theoretical concepts to reason about. Let's take a closer look.
一个 CAS 循环在起作用¶
A CAS loop in action
操作系统或编程语言提供的 CAS 函数可能如下所示：
A CAS function provided by an operating system or a programming language might look like this:
boolean compare_and_swap(shared_data, expected_value, new_value);
shared_data.value == expected_value ，该函数才用新值替换当前值（并返回
It takes in input a reference/pointer to some shared data, the expected value it currently takes on and the new value you want to apply. The function replaces the current value with the new one (and returns
true) only if the value hasn't changed, that is if
shared_data.value == expected_value.
在 CAS 循环中，核心的想法是反复尝试比较和交换，直到操作成功。在每次循环中，向 CAS 函数提供引用/指针、期望值和新值，以便处理同时执行相同操作的任何其他编写器线程，这是必要的：如果另一个线程同时更改了数据，即如果共享数据不再与预期值匹配，则 CAS 函数将失败。支持多个写入器！
In a CAS loop the idea is to repeatedly trying to compare and swap until the operation is successful. On each iteration you feed the CAS function with the reference/pointer, the expected value and the desired one. This is necessary in order to cope with any other writer thread that is doing the same thing concurrently: the CAS function fails if another thread has changed the data in the meantime, that is if the shared data no longer matches the expected value. Multiple writers support!
假设我们想用 CAS 循环复制前面代码片段中的 fetch-and-add 算法。它看起来大致像这样（伪代码）：
Suppose we want to replicate the fetch-and-add algorithm seen in the previous snippet with a CAS loop. It would look roughly like this (pseudocode):
x = 0 reader_thread() print(load(x)) writer_thread() temp = load(x) // (1) while(!compare_and_swap(x, temp, temp + 1)) // (2)
在 (1) 中，算法加载共享数据的现有值，然后尝试将其与新值交换，直到成功 (2)，即直到 CAS 函数返回
In (1) the algorithm loads the existing value of the shared data, then it tries to swap it with the new value until success (2), that is until the CAS function returns
The swapping paradigm
As said before, the CAS loop introduces a recurring pattern in many lock-free algorithms:
- create a local copy of the shared data;
- modify the local copy as needed;
- when ready, update the shared data by swapping it with the local copy created before.
第 3) 点是关键：交换是通过原子操作以原子方式执行的。 肮脏的工作由编写者线程在本地完成，然后仅在准备好时发布。 这样，另一个线程只能在两种状态下观察共享数据：旧状态或新状态。 由于原子交换，没有半完成或损坏的更新。
Point 3) is the key: the swap is performed atomically through an atomic operation. The dirty job is done locally by the writer thread and then published only when ready. This way another thread can observe the shared data only in two states: either the old one, or the new one. No half-complete or corrupted updates, thanks to the atomic swap.
This is also philosophically different from the locking approach: in a lock-free algorithm threads get in touch only during that tiny atomic swap, running undisturbed and unaware of others for the rest of the time. The point of contact between threads is now shrinked down and limited to the duration of the atomic operation.
A gentle form of locking
上面看到的 spin until success 策略被用于许多无锁算法中，被称为 spinlock：一个简单的循环，线程反复尝试执行某事直到成功。 这是一种温和的锁定形式，其中线程启动并运行 - 操作系统不会强制睡眠，尽管在循环结束之前没有任何进展。 互斥锁或信号量中使用的常规锁要昂贵得多，因为挂起/唤醒周期需要大量工作。
The spin until success strategy seen above is employed in many lock-free algorithms and is called spinlock: a simple loop where the thread repeatedly tries to perform something until successful. It's a form of gentle lock where the thread is up and running — no sleep forced by the operating system, although no progress is made until the loop is over. Regular locks employed in mutexes or semaphores are way more expensive, as the suspend/wakeup cycle requires a lot of work under the hood.
The ABA problem
第 (1) 行和第 (2) 行中的指令确实是原子的，但又是不同的。 一旦在 (1) 中读取，另一个线程可能会从裂缝中溜走并更改共享数据的值。 具体来说，它可以将初始值“A”转换为另一个值“B”，然后在 (2) 中开始比较和交换操作之前将其带回“A”。 运行 CAS 循环的线程不会注意到更改并成功执行交换。 这被称为 ABA 问题：如果您的算法像上述那样简单，有时您可以轻松忽略它，有时您想阻止它，因为它会在您的程序中引入细微的错误。 幸运的是，为此有 几种解决方法。
Instructions in lines (1) and (2) are atomic indeed, yet distinct. Another thread might slip through the cracks and change the value of the shared data once has been read in (1). Specifically, it could turn the initial value, say
A, into another value, say
B, and then bring it back to
Aright before the compare and swap operation has started in (2). The thread that is running the CAS loop wouldn't notice the change and perform the swap successfully. This is known as the ABA problem: sometimes you can easily ignore it if you algorithm is simple as the one above, sometimes you want to prevent it instead as it would introduce subtle bugs in your programs. Luckily there are several workarounds for this.
您可以在 CAS 循环中交换任何内容¶
You can swap anything inside a CAS loop
CAS 循环通常用于交换指针，这是 compare-and-swap 操作支持的类型。 当您想要修改复杂的数据集合（如类或数组）时，这很有用：只需创建本地副本，根据需要对其进行修改，然后在准备好时将指向本地数据的指针与指向全局数据的指针交换。 这样全局数据将指向为本地副本分配的内存，其他线程将看到最新信息。
The CAS loop is often used to swap pointers, a type supported by the compare-and-swap operation. This is useful when you want to modify a complex collection of data like a class or an array: just create the local copy, modify it as needed and then when ready swap a pointer to the local data with a pointer to the global data. This way global data will point to the memory allocated for the local copy and other threads will see up-to-date information.
此技术允许您成功同步非原始实体，但很难使其正常工作。 如果在交换之后，读取器线程仍在读取旧指针怎么办？ 如何正确删除以前的副本而不产生危险的悬空指针？ 工程师们再次找到了许多解决方案，例如使用支持 [垃圾收集](https://en.wikipedia.org/wiki/Garbage_collection_(computer_science) 的语言或 基于纪元的内存回收、危险指针 或 [引用计数](https://en. wikipedia.org/wiki/Reference_counting）。
This technique allows you to successfully synchronize non-primitive entities, yet is difficult to make it work correctly. What if, after the swap, a reader thread is still reading the old pointer? How to properly delete the previous copy without generating dangerous dangling pointers? Once again engineers have found many solutions such as using a language that supports [garbage collection](https://en.wikipedia.org/wiki/Garbage_collection_(computer_science) or esoteric techniques like epoch-based memory reclamation, hazard pointers or reference counting.
无锁 vs 无等待¶
Lock-freedom vs. wait-freedom
Every algorithm or data structure based on atomic operations can be clustered into two groups: lock-free or wait-free. This is an important distinction when you have to evaluate the impact of atomic-based tools on the performance of your program.
无锁算法允许剩余的线程继续做有用的工作，即使其中一个暂时忙。 换句话说，至少有一个线程总是取得进展。 CAS 循环是无锁的完美示例，因为如果 CAS 循环的单次迭代失败，通常是因为其他线程成功修改了共享资源。 但是，无锁算法可能会花费不可预测的时间来旋转，尤其是当有许多线程竞争相同的资源时：从技术上讲，当争用很高时。 将其推向极限，无锁算法在 CPU 资源上的效率可能远低于使阻塞线程进入睡眠状态的互斥锁。
Lock-free algorithms allow the remaining threads to continue doing useful work even if one of them is temporarily busy. In other words, at least one thread always makes progress. The CAS loop is a perfect example of lock-free because if a single iteration of the CAS loop fails, it’s usually because some other thread has modified the shared resource successfully. However, a lock-free algorithm might spend an unpredictable amount of time just spinning, especially when there are many threads competing for the same resource: technically speaking, when the contention is high. Pushing it to the limits, a lock-free algorithm could be far less efficient with CPU resources than a mutex that puts blocked threads to sleep.
另一方面，在无等待算法（无锁算法的一个子集）中，任何线程都可以在有限的数量或步骤中完成其工作，而不管其他线程的执行速度或工作负载水平如何。 本文中基于 fetch-and-add 操作的第一个片段是一个无等待算法的示例：没有循环，没有重试，只是不受干扰的流程。 此外，无等待算法是容错的：没有线程可以通过其他进程的故障或速度的任意变化来阻止完成操作。 这些属性使无等待算法适用于复杂的 实时系统，其中并发代码的可预测行为是必须的。
On the other hand in wait-free algorithms, a subset of lock-free ones, any thread can complete its work in a finite number or steps, regardless of the execution speed or the workload level of others. The first snippet in this article based on the fetch-and-add operation is an example of a wait-free algorithm: no loops, no retries, just undisturbed flow. Also, wait-free algorithms are fault-tolerant: no thread can be prevented from completing an operation by failures of other processes, or by arbitrary variations in their speed. These properties make wait-free algorithms suitable for complex real-time systems where the predictable behavior of concurrent code is a must.
3.Wait-free algorithms are a subset of lock-free ones.
无等待是并发代码非常需要的属性，但很难获得。 总而言之，无论您是构建阻塞、无锁还是无等待算法，黄金法则都是始终对代码进行基准测试并衡量结果。 有时，一个好的旧互斥锁可以胜过更高级的同步原语，尤其是在并发任务复杂度很高的情况下。
Wait-freedom is a highly desired property of concurrent code, yet very difficult to obtain. All in all, whether you are building a blocking, a lock-free or a wait-free algorithm the golden rule is to always benchmark your code and measure the results. Sometimes a good old mutex can outperform fancier synchronization primitives, especially when the concurrent task complexity is high.
原子操作是无锁编程的必要部分，即使在单处理器机器上也是如此。 如果没有原子性，线程可能会在事务中途中断，可能导致状态不一致。 在本文中，我只是触及了皮毛：只要将多核/多处理器添加到等式中，就会出现一个新的问题世界。顺序一致性和内存屏障等主题是难题的关键部分，如果您想充分利用无锁算法，则不能忽视这些主题。我将在下一集中介绍它们。
Atomic operations are a necessary part of lock-free programming, even on single-processor machines. Without atomicity, a thread could be interrupted halfway through the transaction, possibly leading to an inconsistent state. In this article I have just scratched the surface: a new world of problems opens up as soon as you add multicores/multiprocessors to the equation. Topics like sequential consistency and memory barriers are critical pieces of the puzzle and can't be overlooked if you want to get the best out of your lock-free algorithms. I will cover them all in the next episode.
Preshing on Programming - An Introduction to Lock-Free Programming
Preshing on Programming - Atomic vs. Non-Atomic Operations
Preshing on Programming - You Can Do Any Kind of Atomic Read-Modify-Write Operation
StackOverflow - Do I need a mutex for reading?
StackOverflow - Will atomic operations block other threads?
StackOverflow - Spinlock vs Busy wait
GCC Wiki - Atomics
GCC Wiki - Memory model synchronization modes
Threading Building Blocks - Atomic Operations
Just Software Solutions - Definitions of Non-blocking, Lock-free and Wait-free
Wikipedia - Compare-and-swap
Wikipedia - Read-modify-write
Wikipedia - Test-and-set
Tyler Neely - Fear and Loathing in Lock-Free Programming
Jason Gregory - Game Engine Architecture, Third Edition
AA.VV. - Evaluating the Cost of Atomic Operations on Modern Architectures
Herb Sutter, CppCon 2014 - Lock-Free Programming (or, Juggling Razor Blades), Part 1
Herb Sutter, CppCon 2014 - Lock-Free Programming (or, Juggling Razor Blades), Part 2
Maurice Herlihy - Wait-free synchronization
Fedor Pikus, CppCon 2014 - Read, Copy, Update, then what? RCU for non-kernel programmers
1024cores - Lockfree Algorithms
Microsoft - Lockless Programming Considerations for Xbox 360 and Microsoft Windows
Brian Goetz, IBM - Going Atomic