<?xml version="1.0" encoding="ANSI_X3.4-1968" standalone="no"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=ANSI_X3.4-1968" /><title>Avoiding Locks: Read Copy Update</title><meta name="generator" content="DocBook XSL Stylesheets V1.75.2" /><link rel="home" href="index.html" title="Unreliable Guide To Locking" /><link rel="up" href="ch09.html" title="Chapter 9. Locking Speed" /><link rel="prev" href="ch09.html" title="Chapter 9. Locking Speed" /><link rel="next" href="ch09s03.html" title="Per-CPU Data" /></head><body><div class="navheader"><table width="100%" summary="Navigation header"><tr><th colspan="3" align="center">Avoiding Locks: Read Copy Update</th></tr><tr><td width="20%" align="left"><a accesskey="p" href="ch09.html">Prev</a> </td><th width="60%" align="center">Chapter 9. Locking Speed</th><td width="20%" align="right"> <a accesskey="n" href="ch09s03.html">Next</a></td></tr></table><hr /></div><div class="sect1" title="Avoiding Locks: Read Copy Update"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a id="efficiency-read-copy-update"></a>Avoiding Locks: Read Copy Update</h2></div></div></div><p> There is a special method of read/write locking called Read Copy Update. Using RCU, the readers can avoid taking a lock altogether: as we expect our cache to be read more often than updated (otherwise the cache is a waste of time), it is a candidate for this optimization. </p><p> How do we get rid of read locks? Getting rid of read locks means that writers may be changing the list underneath the readers. That is actually quite simple: we can read a linked list while an element is being added if the writer adds the element very carefully. For example, adding <span class="symbol">new</span> to a single linked list called <span class="symbol">list</span>: </p><pre class="programlisting"> new->next = list->next; wmb(); list->next = new; </pre><p> The <code class="function">wmb()</code> is a write memory barrier. It ensures that the first operation (setting the new element's <span class="symbol">next</span> pointer) is complete and will be seen by all CPUs, before the second operation is (putting the new element into the list). This is important, since modern compilers and modern CPUs can both reorder instructions unless told otherwise: we want a reader to either not see the new element at all, or see the new element with the <span class="symbol">next</span> pointer correctly pointing at the rest of the list. </p><p> Fortunately, there is a function to do this for standard <span class="structname">struct list_head</span> lists: <code class="function">list_add_rcu()</code> (<code class="filename">include/linux/list.h</code>). </p><p> Removing an element from the list is even simpler: we replace the pointer to the old element with a pointer to its successor, and readers will either see it, or skip over it. </p><pre class="programlisting"> list->next = old->next; </pre><p> There is <code class="function">list_del_rcu()</code> (<code class="filename">include/linux/list.h</code>) which does this (the normal version poisons the old object, which we don't want). </p><p> The reader must also be careful: some CPUs can look through the <span class="symbol">next</span> pointer to start reading the contents of the next element early, but don't realize that the pre-fetched contents is wrong when the <span class="symbol">next</span> pointer changes underneath them. Once again, there is a <code class="function">list_for_each_entry_rcu()</code> (<code class="filename">include/linux/list.h</code>) to help you. Of course, writers can just use <code class="function">list_for_each_entry()</code>, since there cannot be two simultaneous writers. </p><p> Our final dilemma is this: when can we actually destroy the removed element? Remember, a reader might be stepping through this element in the list right now: if we free this element and the <span class="symbol">next</span> pointer changes, the reader will jump off into garbage and crash. We need to wait until we know that all the readers who were traversing the list when we deleted the element are finished. We use <code class="function">call_rcu()</code> to register a callback which will actually destroy the object once the readers are finished. </p><p> But how does Read Copy Update know when the readers are finished? The method is this: firstly, the readers always traverse the list inside <code class="function">rcu_read_lock()</code>/<code class="function">rcu_read_unlock()</code> pairs: these simply disable preemption so the reader won't go to sleep while reading the list. </p><p> RCU then waits until every other CPU has slept at least once: since readers cannot sleep, we know that any readers which were traversing the list during the deletion are finished, and the callback is triggered. The real Read Copy Update code is a little more optimized than this, but this is the fundamental idea. </p><pre class="programlisting"> --- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 +++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 @@ -1,15 +1,18 @@ #include <linux/list.h> #include <linux/slab.h> #include <linux/string.h> +#include <linux/rcupdate.h> #include <linux/mutex.h> #include <asm/errno.h> struct object { - /* These two protected by cache_lock. */ + /* This is protected by RCU */ struct list_head list; int popularity; + struct rcu_head rcu; + atomic_t refcnt; /* Doesn't change once created. */ @@ -40,7 +43,7 @@ { struct object *i; - list_for_each_entry(i, &cache, list) { + list_for_each_entry_rcu(i, &cache, list) { if (i->id == id) { i->popularity++; return i; @@ -49,19 +52,25 @@ return NULL; } +/* Final discard done once we know no readers are looking. */ +static void cache_delete_rcu(void *arg) +{ + object_put(arg); +} + /* Must be holding cache_lock */ static void __cache_delete(struct object *obj) { BUG_ON(!obj); - list_del(&obj->list); - object_put(obj); + list_del_rcu(&obj->list); cache_num--; + call_rcu(&obj->rcu, cache_delete_rcu, obj); } /* Must be holding cache_lock */ static void __cache_add(struct object *obj) { - list_add(&obj->list, &cache); + list_add_rcu(&obj->list, &cache); if (++cache_num > MAX_CACHE_SIZE) { struct object *i, *outcast = NULL; list_for_each_entry(i, &cache, list) { @@ -85,6 +94,7 @@ obj->popularity = 0; atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ spin_lock_init(&obj->lock); + INIT_RCU_HEAD(&obj->rcu); spin_lock_irqsave(&cache_lock, flags); __cache_add(obj); @@ -104,12 +114,11 @@ struct object *cache_find(int id) { struct object *obj; - unsigned long flags; - spin_lock_irqsave(&cache_lock, flags); + rcu_read_lock(); obj = __cache_find(id); if (obj) object_get(obj); - spin_unlock_irqrestore(&cache_lock, flags); + rcu_read_unlock(); return obj; } </pre><p> Note that the reader will alter the <em class="structfield"><code>popularity</code></em> member in <code class="function">__cache_find()</code>, and now it doesn't hold a lock. One solution would be to make it an <span class="type">atomic_t</span>, but for this usage, we don't really care about races: an approximate result is good enough, so I didn't change it. </p><p> The result is that <code class="function">cache_find()</code> requires no synchronization with any other functions, so is almost as fast on SMP as it would be on UP. </p><p> There is a furthur optimization possible here: remember our original cache code, where there were no reference counts and the caller simply held the lock whenever using the object? This is still possible: if you hold the lock, noone can delete the object, so you don't need to get and put the reference count. </p><p> Now, because the 'read lock' in RCU is simply disabling preemption, a caller which always has preemption disabled between calling <code class="function">cache_find()</code> and <code class="function">object_put()</code> does not need to actually get and put the reference count: we could expose <code class="function">__cache_find()</code> by making it non-static, and such callers could simply call that. </p><p> The benefit here is that the reference count is not written to: the object is not altered in any way, which is much faster on SMP machines due to caching. </p></div><div class="navfooter"><hr /><table width="100%" summary="Navigation footer"><tr><td width="40%" align="left"><a accesskey="p" href="ch09.html">Prev</a> </td><td width="20%" align="center"><a accesskey="u" href="ch09.html">Up</a></td><td width="40%" align="right"> <a accesskey="n" href="ch09s03.html">Next</a></td></tr><tr><td width="40%" align="left" valign="top">Chapter 9. Locking Speed </td><td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td><td width="40%" align="right" valign="top"> Per-CPU Data</td></tr></table></div></body></html>