Parent: [12682b] (diff)

Download this file

forkingRecoll.txt    225 lines (182 with data), 10.6 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
= Recoll command execution performance
:Author: Jean-François Dockès
:Email: jfd@recoll.org
:Date: 2015-05-22
== Abstract
== Introduction
The Recoll indexer, *recollindex*, is a big process which executes many
others, mostly for extracting text from documents. Some of the executed
processes are quite short-lived, and the time used by the process execution
machinery can actually dominate the time used to translate data. This
document explores possible approaches to improving performance without
adding excessive complexity or damaging reliability.
Studying fork/exec performance is not exactly a new venture, and there are
many texts which address the subject. While researching, though, I found
out that not so many were accurate and that a lot of questions were left as
an exercise to the reader.
== Issues with fork
The traditional way for a Unix process to start another is the
+fork()+/+exec()+ system call pair.
+fork()+ duplicates the process address space and resources (open files
etc.), then duplicates the thread of execution, ending up with 2 mostly
identical processes.
+exec()+ then replaces part of the newly executing process with an address
space initialized from an executable file, inheriting some of the resources
under various conditions.
This was all fine with the small processes of the first Unix systems, but
as time progressed, processes became bigger and the copy-before-discard
operation was found to waste significant resources. It was optimized using
two methods (at very different points in time):
- The first approach was to supplement +fork()+ with the +vfork()+ call, which
is similar but does not duplicate the address space: the new process
thread executes in the old address space. The old thread is blocked
until the new one calls +exec()+ and frees up access to the memory
space. Any modification performed by the child thread persists when
the old one resumes.
- The more modern approach, which cohexists with +vfork()+, was to replace
the full duplication of the memory space with duplication of the page
descriptors only. The pages in the new process are marked copy-on-write
so that the new process has write access to its memory without
disturbing its parent. This approach was supposed to make +vfork()+
obsolete, but the operation can still be a significant resource consumer
for big processes mapping a lot of memory, so that +vfork()+ is still
around. Programs can have big memory spaces not only because they have
huge data segments (rare), but just because they are linked to many
shared libraries (more common).
NOTE: Orders of magnitude: a *recollindex* process will easily grow into a
few hundred of megabytes of virtual space. It executes the small and
efficient *antiword* command to extract text from *ms-word* files. While
indexing multiple such files, *recollindex* can spend '60% of its CPU time'
doing `fork()`/`exec()` housekeeping instead of useful work (this is on Linux,
where `fork()` uses copy-on-write).
Apart from the performance cost, another issue with +fork()+ is that a big
process can fail executing a small command because of the temporary need to
allocate twice its address space. This is a much discussed subject which we
will leave aside because it generally does not concern *recollindex*, which
in typical conditions uses a small portion of the machine virtual memory,
so that a temporary doubling is not an issue.
The Recoll indexer is multithreaded, which may introduce other issues. Here
is what happens to threads during the +fork()+/+exec()+ interval:
- +fork()+:
* The parent process threads all go on their merry way.
* The child process is created with only one thread active, duplicated
from the one which called +fork()+
- +vfork()+
* The parent process thread calling +vfork()+ is suspended, the others
are unaffected.
* The child is created with only one thread, as for +fork()+.
This thread shares the memory space with the parent ones, without
having any means to synchronize with them (pthread locks are not
supposed to work across processes): caution needed !
NOTE: for a multithreaded program using the classical pipe method to
communicate with children, the sequence between the `pipe()` call and the
parent `close()` of the unused side is a candidate for a critical section:
if several threads can interleave in there, children process may inherit
descriptors which 'belong' to other `fork()`/`exec()` operations, which may
in turn be a problem or not depending on how descriptor cleanup is
performed in the child (if no cleanup is performed, pipes may remain open
at both ends which will prevents seeing EOFs etc.). Thanks to StackExchange
user Celada for explaining this to me.
For multithreaded programs, both +fork()+ and +vfork()+ introduce possibilities
of deadlock, because the resources held by a non-forking thread in the
parent process can't be released in the child because the thread is not
duplicated. This used to happen from time to time in *recollindex* because
of an error logging call performed if the +exec()+ failed after the +fork()+
(e.g. command not found).
With +vfork()+ it is also possible to trigger a deadlock in the parent by
(inadvertently) modifying data in the child. This could happen just
link:http://www.oracle.com/technetwork/server-storage/solaris10/subprocess-136439.html[because
of dynamic linker operation] (which, seriously, should be considered a
system bug).
In general, the state of program data in the child process is a semi-random
snapshot of what it was in the parent, and the official word about what you
can do is that you can only call
link:http://man7.org/linux/man-pages/man7/signal.7.html[async-safe library
functions] between +fork()+ and +exec()+. These are functions which are
safe to call from a signal handler because they are either reentrant or
can't be interrupted by a signal. A notable missing entry in the list is
`malloc()`.
These are normally not issues for programs which only fork to execute
another program (but the devil is in the details as demonstrated by the
logging call issue...).
One of the approaches often proposed for working around this mine-field is
to use an auxiliary small process to execute any command needed by the main
one. The small process can just use +fork()+/+exec()+ with no performance
issues. This has the inconvenient of complicating communication a lot if
data needs to be transferred one way or another.
////
Passing descriptors around
http://stackoverflow.com/questions/909064/portable-way-to-pass-file-descriptor-between-different-processes
http://www.normalesup.org/~george/comp/libancillary/
http://stackoverflow.com/questions/28003921/sending-file-descriptor-by-linux-socket/
The process would then be:
- Tell slave to fork/exec cmd (issue with cmd + args format)
- Get fds
- Tell slave to wait, recover status.
////
== The posix_spawn() Linux non-event
Given the performance issues of `fork()` and tricky behaviour of `vfork()`,
a "simpler" method for starting a child process was introduced by Posix:
`posix_spawn()`.
The `posix_spawn()` function is a black box, externally equivalent to a
`fork()`/`exec()` sequence, and has parameters to specify the usual
house-keeping performed at this time (file descriptors and signals
management etc.). Hiding the internals gives the system a chance to
optimize the performance and avoid `vfork()` pitfalls like the `ld.so`
lockup described in the Oracle article.
The Linux posix_spawn() is implemented by a `fork()`/`exec()` pair by default.
`vfork()` is used either if specified by an input flag or no
signal/scheduler/process_group changes are requested. There must be a
reason why signal handling changes would preclude `vfork()` usage, but I
could not find it (signal handling data is stored in the kernel task_struct).
The Linux glibc `posix_spawn()` currently does nothing that user code could
not do. Still, using it would probably be a good future-proofing idea, but
for a significant problem: there is no way to specify closing all open
descriptors bigger than a specified value (closefrom() equivalent). This is
available on Solaris and quite necessary in fact, because we have no way to
be sure that all open descriptors have the CLOEXEC flag set.
So, no `posix_spawn()` for us (support was implemented inside
*recollindex*, but the code is normally not used).
== The chosen solution
The previous version of +recollindex+ used to use +vfork()+ if it was running
a single thread, and +fork()+ if it ran multiple ones.
After another careful look at the code, I could see few issues with
using +vfork()+ in the multithreaded indexer, so this was committed.
The only change necessary was to get rid of an implementation of the
lacking Linux +closefrom()+ call (used to close all open descriptors above a
given value). The previous Recoll implementation listed the +/proc/self/fd+
directory to look for open descriptors but this was unsafe because of of
possible memory allocations in +opendir()+ etc.
== Test results
.Indexing 12500 small .doc files
[options="header"]
|===============================
|call |real |user |sys
|fork |0m46.025s |0m26.574s |0m39.494s
|vfork |0m18.223s |0m17.753s |0m1.736s
|spawn/fork| 0m45.726s|0m27.082s| 0m40.575s
|spawn/vfork|0m18.915s|0m18.681s|0m3.828s
|recoll 1.18|1m47.589s|0m21.537s|0m29.458s
|================================
No surprise here, given the implementation of +posix_spawn()+, it gets the
same times as the +fork()+/+vfork()+ options.
The tests were performed on an Intel Core i5 750 (4 cores, 4 threads).
It would be painful to play it safe and discard the 60% reduction in
execution time offered by using +vfork()+, so this was adopted for Recoll
1.21. To this day, no problems were discovered, but, still crossing
fingers...
The last line in the table is just for the fun: *recollindex* 1.18
(single-threaded) needed almost 6 times as long to process the same
files...
////
Objections to vfork:
sigaction locks
https://bugzilla.redhat.com/show_bug.cgi?id=193631
Is Linux vfork thread-safe ? Quoting interesting comments from Solaris
implementation: No answer to the issues cited though.
https://sourceware.org/bugzilla/show_bug.cgi?id=378
Aussi:
http://blog.famzah.net/2009/11/20/fork-gets-slower-as-parent-process-use-more-memory/
http://blog.famzah.net/2009/11/20/a-much-faster-popen-and-system-implementation-for-linux/
Avec un workaround basĂŠ sur clone (donc linux-only). Tried it but crashes.
////