Child: [c9c315] (diff)

Download this file

threadingRecoll.txt    409 lines (314 with data), 18.4 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
= Converting Recoll indexing to multithreading
:Author: Jean-François Dockès
:Email: jfd@recoll.org
:Date: 2012-12-03
== Abstract
This relates lessons learned while modifying *Recoll* indexing to be
multithreaded. I am by no means a threaded applications expert, so that a
few of the observations I made whole doing this may be of use to other
novices.
== Introduction
http://www.recoll.org[*Recoll*] is a document indexing application, it
allows you to find documents by specifying search terms.
The documents need to be _indexed_ for searches to be fast. In a nutshell,
we convert the different document formats to text, then split the text into
terms and remember where those occur. This is a time-consuming operation.
Up to version 1.18 *Recoll* indexing is single-threaded: routines which
call each other sequentially.
In most personal indexer contexts, it is also CPU-bound. There is a lot of
conversion work necessary for turning those PDF (or other) files into
appropriately cleaned up pure text, then split it into terms and update the
index. Given the relatively modest amount of data, and the speed of
storage, I/O issues are secondary.
Looking at the _CPU idle_ *top* output stuck at 75% on my quad-core CPU,
while waiting for the indexing to finish, was frustrating, and I was
tempted to find a way to keep those other cores at temperature and shorten
the waiting.
For some usages, the best way to accomplish this may be to just partition
the index and independantly start indexing on different configurations,
using multiple processes to better utilize the available processing power.
This is not an universal solution though, as it is complicated to set up,
not optimal in general for indexing performance, and not always optimal
either at query time.
The most natural way to improve indexing times is to increase CPU
utilization by using multiple threads inside an indexing process.
Something similar had been done with earlier versions of the *Recoll* GUI,
which had an internal indexing thread. This had been a frequent source of
trouble though, and linking the GUI and indexing process lifetimes was a
bad idea, so, in recent versions, the indexing is always performed by an
external process. Still, this experience had put in light most of the
problem areas, and prepared the code for further work.
It should be noted that, as `recollindex` is both _nice_'d and _ionice_'d
as a lowest priority process, it will only use free computing power on the
machine, and will step down as soon as anything else wants to work.
****
The only case where you may notice that the indexing is at work
is when the machine is short on memory and things (such as
your Web browser) get swapped-out while you are not actively using
them. You then notice a long delay when you want to start, because they
need to be swapped back in. There is little which can be done about
this. Setting _idxflushmb_ to a low value may help in some cases (depending
on the document sizes). May I also suggest in this case that, if your
machine can take more memory, it may be a good idea to procure some, as
memory is nowadays quite cheap, and memory-starved machines are not fun.
****
In general, augmenting the machine utilisation by `recollindex` just does
not change its responsiveness. My PC has a an Intel Pentium Core i5 750 (4
cores, no hyperthreading), which is far from being a high performance CPU
(nowadays...), and I often forget that I am running indexing tests, it is
just not noticeable. The machine does have a lot of memory though (12GB).
== The Recoll indexing processing flow
image::nothreads.png["Basic flow", float="right"]
There are 4 main steps in the `recollindex` processing pipeline:
. Find the file
. Convert it to text
. Process the text (split, strip etc.) and create a *Xapian* document
. Update the index
The first step, walking the file system (or some other data source), is
usually much faster than the others, and we just leave it alone to be
performed by the main thread. It outputs file names (and the associated
*POSIX* _stat_ data).
The last step, *Xapian* index updating, can only be single-threaded.
The first idea is to change the indexing pipeline so that each step is
performed by an independant worker thread, passing its output to the next
thread, in assembly-line fashion.
In order to achieve this, we need to decouple the different phases. They
are normally linked by procedure calls, which we replace with a job
control object: the 'WorkQueue'.
=== The WorkQueue
The _WorkQueue_ object is implemented by a reasonably simple class, which
manages an input queue on which client append jobs, and a set of worker
threads, which retrieve and perform the jobs, and whose lifetime are
managed by the _WorkQueue_ object. The
https://bitbucket.org/medoc/recoll/src/f06f3aba912045d6ad52e9a0fd930b95e363fd10/src/utils/workqueue.h?at=default[implementation] is straightforward with
*POSIX* threads synchronization functions and C++ *STL* data structures.
In practise it proved quite simple to modify existing code to create a job
object and put it on the queue, instead of calling the downstream routine
with the job parameters, _while keeping the capacity to call the downstream
routine directly_. The kind of coupling is determined either by compilation
flags (for global disabling/enabling of multithreading), or according to
configuration data, which allows experimenting with different threads
arrangements just by changing parameters in a file, without recompiling.
Each _WorkQueue_ accepts two parameters: the length of the input queue
(before a client will block when trying to add a job), and the number of
worker threads. Both parameters can be set in the *Recoll* configuration
file for each of the three queues used in the indexing pipeline. Setting
the queue length to -1 will disable the corresponding queue (using a direct
call instead).
unfloat::[]
== The Assembly Line
image::assembly.png["Assembly line", float="right"]
So the first idea is to create 3 explicit threads to manage the file
conversion, the term generation, and the *Xapian* index update. The first
thread prepares a file, passes it on to the term generation thread, and
immediately goes back to work on the next file, etc.
The presumed advantage of this method is that the different stages, which
perform disjointed processing, should share little, so that we can hope to
minimize the changes necessitated by the threads interactions.
However some changes to the code were needed to make this work (and a few
bugs were missed, which only became apparent at later stages, confirming
that the _low interaction_ idea was not completely false).
=== Converting to multithreading: what to look for
I am probably stating the obvious here, but when preparing a program for
multi-threading, problems can only arise where non-constant data is
accessed by different threads.
Once you have solved the core problems posed by the obvious data that needs
to be shared, you will be left to deal with less obvious, hidden,
interactions inside the program.
Classically this would concern global or static data, but in a C++ program,
class members will be a concern if a single object can be accessed by
several threads.
Hunting for static data inside a program of non trivial size is not always
obvious. Two approaches can be used: hunting for the _static_ keyword in
source code, or looking at global and static data symbols in *nm* output.
Once found, there are mostly three types of static/global data:
* Things that need to be eliminated: for example, routines can be made
reentrant by letting the caller supply a storage buffer instead of using
an internal static one (which was a bad idea in the first place
anyway).
* Things that need to be protected: sometimes, the best approach is just
to protect the access with a mutex lock. It is trivial to encapsulate
the locks in C++ objects to use the "Resource Acquisition is
Initialization" idiom, easily making sure that locks are freed when
exiting the critical section. A very basic
https://bitbucket.org/medoc/recoll/src/f06f3aba9120/src/utils/ptmutex.h?at=default[example of implementation]
can be found in the *Recoll* source code.
* Things which can stay: this is mostly initialization data such as value
tables which are computed once, and then stay logically constant during
program execution. In order to be sure of a correct single-threaded
initialization, it is best to explicitly initialize the modules or
functions that use this kind of data in the main thread when the program
starts.
=== Assembly line approach: the results
Unfortunately, the assembly line approach yields very modest improvements
when used inside *Recoll* indexing. The reason, is that this method needs
stages of equivalent complexity to be efficient. If one of the stages
dominates the others, its thread will be the only one active at any time,
and little will be gained.
What is especially problematic is that the balance between tasks need not
only exist on average, but also for the majority of individual jobs.
For *Recoll* indexing, even if the data preparation and index update steps
are often of the same order of magnitude _on average_, their balance
depends a lot on the kind of data being processed, so that things are
usually unbalanced at any given time: the index update thread is mostly
idle while processing PDF files, and the data preparation has little to do
when working on HTML or plain text.
In practice, very modest indexing time improvements from 5% to 15% were
achieved with this method.
[[recoll.idxthreads.multistage]]
== The next step: multi-stage parallelism
image::multipara.png["Multi-stage parallelism", float="right"]
Given the limitations of the assembly line approach, the next step in the
transformation of *Recoll* indexing was to enable full parallelism wherever
possible.
Of the four processing steps (see figures), two are not candidates for
parallelization:
* File system walking is so fast compared to the other steps that using
several threads would make no sense (it would also quite probably become
IO bound if we tried anyway).
* The *Xapian* library index updating code is not designed for
multi-threading and must stay protected from multiple accesses.
The two other steps are good candidates.
Most of the work to make *Recoll* code reentrant had been performed for the
previous transformation. Going full-parallel only implied protecting the
data structures that needed to be shared by the threads performing a given
processing step.
Just for the anecdotic value, a list of the elements that needed mutexes:
- Filter subprocesses cache: some file conversion subprocesses may be
expensive (starting a Python process is no piece of cake), so they are
cached for reuse after they are done translating a file. The shared cache
needs protection.
- Status updates: an object used to update the current file name and indexing
status to a shared file.
- Missing store: the list of missing helper programs
- The readonly *Xapian* database object: a Xapian::Database object which is
used for checking the validity of current index data against a file's
last modification date.
- Document existence map: a bit array used to store an existence bit about
every document, and purge the disappeared at the end of the indexing
pass. This is accessed both from the file conversion and database update
code, so it also needed protection in the previous assembly line
approach.
- Mbox offsets cache. Used to store the offsets of individual messages
inside *mbox* files.
- *iconv* control blocks: these are cached for reuse in several places, and
need protection. Actually, it might be better in multithreading context
to just suppress the reuse and locking. Rough tests seem to indicate that
the impact on overall performance is small, but this might change with
higher parallelism (or not...).
The *Recoll* configuration also used to be managed by a single shared
object, which is mutable as values may depend on what area of the
file-system we are exploring, so that the object is stateful and updated as
we change directories. The choice made here was to duplicate the object
where needed (each indexing thread gets its own). This gave rise to the
sneakiest bug in the whole transformation (see further down).
Having a dynamic way to define the threads configuration makes it easy to
experiment. For example, the following data defines the configuration that
was finally found to be best overall on my hardware:
thrQSizes = 2 2 2
thrTCounts = 4 2 1
This is using 3 queues of depth 2, 4 threads working on file conversion, 2
on text splitting and other document processing, and 1 on Xapian updating
(no choice here).
unfloat::[]
== Bench results
So the big question after all the work: was it worth it ? I could only get
a real answer when the program stopped crashing, so this took some time and
a little faith, but the answer is positive, as far as I'm
concerned. Performance has improved significantly and this was a fun
project.
.Results on a variety of file system areas:
[options="header", width="70%"]
|=======================
|Area |Seconds before |Seconds after| Percent Improvement| Speed Factor
|home |12742 | 6942 | 46%| 1.8
|mail |2700 | 1563 | 58% | 1.7
|projets | 5022 | 1970 | 61% | 2.5
|pdf | 2164 | 770 | 64% | 2.8
|otherhtml | 5593 | 4014| 28% | 1.4
|=======================
.Characteristics of the data
[options="header", width="70%"]
|=======================
|Area | Files MB | Files | DB MB | Documents
|home | 64106 | 44897 | 1197 | 104797
|mail | 813 | 232 | 663 | 47267
|projets | 2056 | 34504 | 549 | 40281
|pdf | 1123 | 1139 | 111 | 1139
|otherhtml | 3442 | 223007 | 2080 | 221890 |
|=======================
_home_ is my home directory. The high megabyte value is due to a number of
very big and not indexed *VirtualBox* images. Otherwise, it's a wide
mix of source files, email, miscellaneous documents and ebooks.
_mail_ is my mail directory, full of *mbox* files.
_projets_ mostly holds source files, and a number of documents.
_pdf_ holds random *pdf* files harvested on the internets. The performance
is quite spectacular, because most of the processing time goes to
converting them to text, and this is done in parallel. Probably could be
made a bit faster with more cores, until we hit the *Xapian* update speed
limit.
_otherhtml_ holds myriad of small html files, mostly from
*wikipedia*. The improvement is not great here because a lot of time is
spent in the single-threaded *Xapian* index update.
The tests were made with queue depths of 2 on all queues, and 4 threads
working on the file conversion step, 2 on the term generation.
== A variation: linear parallelism
Once past the assembly-line idea, another possible transformation would be
to get rid of the two downstream queues, and just create a job for each
file and let it go to the end (using a mutex to protect accesses to the
writable *Xapian* database).
With the current *Recoll* code, this can be defined by the following
parameters (one can also use a deeper front queue, this changes little):
thrQSizes = 2 -1 -1
thrTCounts = 4 0 0
In practise, the performance is close to the one for the multistage
version.
If we were to hard-code this approach, this would be a simpler
modification, necessitating less changes to the code, but it has a slight
inconvenient: when working on a single big multi-document file, no
parallelism at all can be obtained. In this situation, the multi-stage
approach brings us back to the assembly-line behaviour, so the improvements
are not great, but they do exist.
== Miscellany
=== The big gotcha: my stack dump staring days
Overall, debugging the modified program was reasonably
straightforward. Data access synchronization issues mostly provoke dynamic
data corruption, which can be beastly to debug. I was lucky enough that
most crashes occurred in the code that was actually related to the
corrupted data, not in some randomly located and unrelated dynamic memory
user, so that the issues were reasonably easy to find.
One issue though kept me working for a few days. The indexing process kept
crashing randomly at an interval of a few thousands documents, segfaulting
on a bad pointer. An access to the configuration data structure seemed to
be involved, but, as each thread was supposed to have its own copy, I was
out of ideas.
After reviewing all the uses for the configuration data (there are quite a
few), the problem was finally revealed to lie with the filter process
cache. Each filter structure stored in the cache stores a pointer to a
configuration structure. This belonged to the thread which initially
created the filter. But the filter would often be reused by a different
thread, with the consequence that the configuration object was now accessed
and modified by two unsynchronized threads... Resetting the config pointer
at the time of filter reuse was the
https://bitbucket.org/medoc/recoll/commits/943de4b78818079b0eb6ffd0fcbdfdd0746b4a40[ridiculously
simple (almost)single-line fix] to this evasive problem.
Looking at multi-threaded stack dumps is mostly fun for people with several
heads, which is unfortunately not my case, so I was quite elated when this
was over.
=== Fork performance issues
On a quite unrelated note, something that I discovered while evaluating the
program performance is that forking a big process like `recollindex` can be
quite expensive. Even if the memory space of the forked process is not
copied (it's Copy On Write, and we write very little before the following
exec), just duplicating the memory maps can be slow when the process uses a
few hundred megabytes.
I modified the single-threaded version of `recollindex` to use *vfork*
instead of *fork*, but this can't be used with multiple threads (no
modification of the process memory space is allowed in the child between
*vfork* and *exec*, so we'd have to have a way to suspend all the threads
first).
I did not implement a solution to this issue, and I don't think
that a simple one exists. The workaround is to use modest *Xapian* flush
values to prevent the process from becoming too big.
A longer time solution would be to implement a small slave process to do
the executing of ephemeral external commands.