Download this file

xapDocCopyCrash.txt    139 lines (110 with data), 6.7 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
= The case of the bad Xapian::Document copy
== How things were supposed to work
Coming from the link:threadingRecoll.html[threading *Recoll*] page,
you may remember that the third stage of the
processing pipeline breaks up text into terms, producing a *Xapian*
document (+Xapian::Document+) which is finally processed by the last stage,
the index updater.
What happens in practise is that the main routine in this stage has a local
+Xapian::Document+ object, automatically allocated on the stack, which it
updates appropriately and then copies into a task object which is placed on
the input queue for the last stage.
The text-splitting routine then returns, and its local +Xapian::Document+
object is (implicitely) deleted while the stack unwinds.
The idea is that the *copy* of the document which is on the queue should be
unaffected, it is independant of the original and will further be processed
by the index update thread, without interaction with the text-splitting one.
At no point do multiple threads access the +Xapian::Document+ data, so
there should be no problem.
== The problem
Most *Xapian* objects are reference-counted, which means that the object
itself is a small block of house-keeping variables. The actual data is
allocated on the heap through eventual calls to new/malloc, and is shared
by multiple copies of the object. This is the case for +Xapian::Document+
This is aboundantly documented, and users are encouraged to use copies
instead of passing pointers around (copies are cheap because only a small
block of auxiliary data is actually duplicated). This in general makes
memory management easier.
This is well-known, and it would not appear to be a problem in the above
case as the +Xapian::Document+ actual data is never accessed by multiple
threads.
The problem is that the reference counter which keeps track of the object
usage and triggers actual deletion when it goes to zero is accessed by two
threads:
- It is decremented while the first local object is destroyed during the
stack unwind in the first thread
- It is also updated by the last stage thread, incremented if copies are
made, then decremented until it finally goes down to 0 when we are done
with the object, at which point the document data is unallocated.
As the counter is not protected in any way against concurrent access, the
actual sequence of events is undefined and at least two kinds of problems
may occur: double deletion of the data, or accesses to already freed heap
data (potentially thrashing other threads allocations, or reading modified
data).
A relatively simple fix for this would be to use atomic test-and-set
operations for the counter (which is what the GNU +std::string+ does). But
the choice made by *Xapian* to let the application deal with all
synchronization issues is legitimate and documented, nothing to complain
about here. I just goofed.
Because the counter test and update operations are very fast, and occur
among a lot of processing from the final stage thread, the chances of
concurrent access are low, which is why the problem manifests itself very
rarely. Depending on thread scheduling and all manners of semi-random
conditions, it is basically impossible to reproduce reliably.
== The fix
The implemented fix was trivial: the upstream thread allocates the initial
+Xapian::Document+ on the heap, copies the pointer to the queue object, and
forgets about it. The index-updating thread peruses the object then
+delete+'s it. Real easy.
An alternative solution would have been to try and use locking to protect
the counter updates. The only place where such locking operations could
reasonably occur is inside the +Xapian::Document+ refcounted pointer
object, which we can't modify. Otherwise, we would have to protect the
_whole scopes of existence_ of the Xapian::Document object in any routine
which creates/copies or (implicitely) deletes it, which would cause many
problems and/or contention issues
== Why did I miss this ?
The mechanism of the crashes is simple enough, quasi-obvious.
How on earth could I miss this problem while writing the code ?
For the sake of anecdote, my first brush with atomicity for updates of
reference counters was while debugging a System V release 4 kernel VFS file
system module, at the time when SVR4 got a preemptive kernel with SVR4-MP,
circa 1990... I ended up replacing a +counter+++ with +atomic_add()+ after
a set of _interesting_ debugging sessions interspersed with kernel crashes
and +fsck+ waits. This should have left some memories. So what went wrong ?
Here follow a list of possible reasons:
- Reasoning by analogy: std::string are safe to use in this way. The other
objects used in the indexing pipe are also safe. I just used
+Xapian::Document+ in the same way without thinking further.
- Probably not how I would do it: faced with designing +Xapian::Document+,
(not clever enough to do this anyway), I'd probably conclude that not
wanting to deal with full-on concurrency is one thing, not protecting the
reference counters is another, and going too far.
- The problem was not so easily visible because the object deletion is
implicitely performed during the stack unwind: this provides no clue, no
specific operation to think about.
- Pure lazyness.
As a conclusion, a humble request to library designers: when an
interface works counter to the reasonable expectations of at least some of
the users (for example because it looks like, but works differently, than a
standard library interface), it is worth it to be very specific in the
documentation and header file comments about the gotcha's. Saving people
from their own deficiencies is a worthy goal.
Here, a simple statement that the reference count was not mt-safe
(admittedly redundant with the general statement that the *Xapian* library
does not deal with threads), would have got me thinking and avoided the
error.
++++
<h2 id="comments">Comments</h2>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'lesbonscomptes';
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the <a href="http://disqus.com/?ref_noscript">comments powered by Disqus.</a></noscript>
<a href="http://disqus.com" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
++++