Error Buddy
Do you have an error message from your application? Then find the answer with Error Buddy. You can search over 40000 source code files and troubleshooting documents using our beta lucene/nutch search interface or if you prefer, search as normal using google. With LXR technology you can drill right down into the line of source code where it came from with full cross-referencing.
If after searching you didn't get your ideal answer, or you are still unclear what the error means, you can choose to post that question to the community forums following the link included in the search results.
[1.6]001 Intro 002 ----- 003 This describes an adaptive, stable, natural mergesort, modestly called 004 timsort (hey, I earned it <wink>). It has supernatural performance on many 005 kinds of partially ordered arrays (less than lg(N!) comparisons needed, and 006 as few as N-1), yet as fast as Python's previous highly tuned samplesort 007 hybrid on random arrays. 008 009 In a nutshell, the main routine marches over the array once, left to right, 010 alternately identifying the next run, then merging it into the previous 011 runs "intelligently". Everything else is complication for speed, and some 012 hard-won measure of memory efficiency. 013 014 015 Comparison with Python's Samplesort Hybrid 016 ------------------------------------------ 017 + timsort can require a temp array containing as many as N//2 pointers, 018 which means as many as 2*N extra bytes on 32-bit boxes. It can be 019 expected to require a temp array this large when sorting random data; on 020 data with significant structure, it may get away without using any extra 021 heap memory. This appears to be the strongest argument against it, but 022 compared to the size of an object, 2 temp bytes worst-case (also expected- 023 case for random data) doesn't scare me much. 024 025 It turns out that Perl is moving to a stable mergesort, and the code for 026 that appears always to require a temp array with room for at least N 027 pointers. (Note that I wouldn't want to do that even if space weren't an 028 issue; I believe its efforts at memory frugality also save timsort 029 significant pointer-copying costs, and allow it to have a smaller working 030 set.) 031 032 + Across about four hours of generating random arrays, and sorting them 033 under both methods, samplesort required about 1.5% more comparisons 034 (the program is at the end of this file). 035 036 + In real life, this may be faster or slower on random arrays than 037 samplesort was, depending on platform quirks. Since it does fewer 038 comparisons on average, it can be expected to do better the more 039 expensive a comparison function is. OTOH, it does more data movement 040 (pointer copying) than samplesort, and that may negate its small 041 comparison advantage (depending on platform quirks) unless comparison 042 is very expensive. 043 044 + On arrays with many kinds of pre-existing order, this blows samplesort out 045 of the water. It's significantly faster than samplesort even on some 046 cases samplesort was special-casing the snot out of. I believe that lists 047 very often do have exploitable partial order in real life, and this is the 048 strongest argument in favor of timsort (indeed, samplesort's special cases 049 for extreme partial order are appreciated by real users, and timsort goes 050 much deeper than those, in particular naturally covering every case where 051 someone has suggested "and it would be cool if list.sort() had a special 052 case for this too ... and for that ..."). 053 054 + Here are exact comparison counts across all the tests in sortperf.py, 055 when run with arguments "15 20 1". 056 057 First the trivial cases, trivial for samplesort because it special-cased 058 them, and trivial for timsort because it naturally works on runs. Within 059 an "n" block, the first line gives the # of compares done by samplesort, 060 the second line by timsort, and the third line is the percentage by 061 which the samplesort count exceeds the timsort count: 062 063 n \sort /sort =sort 064 ------- ------ ------ ------ 065 32768 32768 32767 32767 samplesort 066 32767 32767 32767 timsort 067 0.00% 0.00% 0.00% (samplesort - timsort) / timsort 068 069 65536 65536 65535 65535 070 65535 65535 65535 071 0.00% 0.00% 0.00% 072 073 131072 131072 131071 131071 074 131071 131071 131071 075 0.00% 0.00% 0.00% 076 077 262144 262144 262143 262143 078 262143 262143 262143 079 0.00% 0.00% 0.00% 080 081 524288 524288 524287 524287 082 524287 524287 524287 083 0.00% 0.00% 0.00% 084 085 1048576 1048576 1048575 1048575 086 1048575 1048575 1048575 087 0.00% 0.00% 0.00% 088 089 The algorithms are effectively identical in these cases, except that 090 timsort does one less compare in \sort. 091 092 Now for the more interesting cases. lg(n!) is the information-theoretic 093 limit for the best any comparison-based sorting algorithm can do on 094 average (across all permutations). When a method gets significantly 095 below that, it's either astronomically lucky, or is finding exploitable 096 structure in the data. 097 098 n lg(n!) *sort 3sort +sort %sort ~sort !sort 099 ------- ------- ------ ------- ------- ------ ------- -------- 100 32768 444255 453096 453614 32908 452871 130491 469141 old 101 448885 33016 33007 50426 182083 65534 new 102 0.94% 1273.92% -0.30% 798.09% -28.33% 615.87% %ch from new 103 104 65536 954037 972699 981940 65686 973104 260029 1004607 105 962991 65821 65808 101667 364341 131070 106 1.01% 1391.83% -0.19% 857.15% -28.63% 666.47% 107 108 131072 2039137 2101881 2091491 131232 2092894 554790 2161379 109 2057533 131410 131361 206193 728871 262142 110 2.16% 1491.58% -0.10% 915.02% -23.88% 724.51% 111 112 262144 4340409 4464460 4403233 262314 4445884 1107842 4584560 113 4377402 262437 262459 416347 1457945 524286 114 1.99% 1577.82% -0.06% 967.83% -24.01% 774.44% 115 116 524288 9205096 9453356 9408463 524468 9441930 2218577 9692015 117 9278734 524580 524633 837947 2916107 1048574 118 1.88% 1693.52% -0.03% 1026.79% -23.92% 824.30% 119 120 1048576 19458756 19950272 19838588 1048766 19912134 4430649 20434212 121 19606028 1048958 1048941 1694896 5832445 2097150 122 1.76% 1791.27% -0.02% 1074.83% -24.03% 874.38% 123 124 Discussion of cases: 125 126 *sort: There's no structure in random data to exploit, so the theoretical 127 limit is lg(n!). Both methods get close to that, and timsort is hugging 128 it (indeed, in a *marginal* sense, it's a spectacular improvement -- 129 there's only about 1% left before hitting the wall, and timsort knows 130 darned well it's doing compares that won't pay on random data -- but so 131 does the samplesort hybrid). For contrast, Hoare's original random-pivot 132 quicksort does about 39% more compares than the limit, and the median-of-3 133 variant about 19% more. 134 135 3sort, %sort, and !sort: No contest; there's structure in this data, but 136 not of the specific kinds samplesort special-cases. Note that structure 137 in !sort wasn't put there on purpose -- it was crafted as a worst case for 138 a previous quicksort implementation. That timsort nails it came as a 139 surprise to me (although it's obvious in retrospect). 140 141 +sort: samplesort special-cases this data, and does a few less compares 142 than timsort. However, timsort runs this case significantly faster on all 143 boxes we have timings for, because timsort is in the business of merging 144 runs efficiently, while samplesort does much more data movement in this 145 (for it) special case. 146 147 ~sort: samplesort's special cases for large masses of equal elements are 148 extremely effective on ~sort's specific data pattern, and timsort just 149 isn't going to get close to that, despite that it's clearly getting a 150 great deal of benefit out of the duplicates (the # of compares is much less 151 than lg(n!)). ~sort has a perfectly uniform distribution of just 4 152 distinct values, and as the distribution gets more skewed, samplesort's 153 equal-element gimmicks become less effective, while timsort's adaptive 154 strategies find more to exploit; in a database supplied by Kevin Altis, a 155 sort on its highly skewed "on which stock exchange does this company's 156 stock trade?" field ran over twice as fast under timsort. 157 158 However, despite that timsort does many more comparisons on ~sort, and 159 that on several platforms ~sort runs highly significantly slower under 160 timsort, on other platforms ~sort runs highly significantly faster under 161 timsort. No other kind of data has shown this wild x-platform behavior, 162 and we don't have an explanation for it. The only thing I can think of 163 that could transform what "should be" highly significant slowdowns into 164 highly significant speedups on some boxes are catastrophic cache effects 165 in samplesort. 166 167 But timsort "should be" slower than samplesort on ~sort, so it's hard 168 to count that it isn't on some boxes as a strike against it <wink>. 169 170 + Here's the highwater mark for the number of heap-based temp slots (4 171 bytes each on this box) needed by each test, again with arguments 172 "15 20 1": 173 174 2**i *sort \sort /sort 3sort +sort %sort ~sort =sort !sort 175 32768 16384 0 0 6256 0 10821 12288 0 16383 176 65536 32766 0 0 21652 0 31276 24576 0 32767 177 131072 65534 0 0 17258 0 58112 49152 0 65535 178 262144 131072 0 0 35660 0 123561 98304 0 131071 179 524288 262142 0 0 31302 0 212057 196608 0 262143 180 1048576 524286 0 0 312438 0 484942 393216 0 524287 181 182 Discussion: The tests that end up doing (close to) perfectly balanced 183 merges (*sort, !sort) need all N//2 temp slots (or almost all). ~sort 184 also ends up doing balanced merges, but systematically benefits a lot from 185 the preliminary pre-merge searches described under "Merge Memory" later. 186 %sort approaches having a balanced merge at the end because the random 187 selection of elements to replace is expected to produce an out-of-order 188 element near the midpoint. \sort, /sort, =sort are the trivial one-run 189 cases, needing no merging at all. +sort ends up having one very long run 190 and one very short, and so gets all the temp space it needs from the small 191 temparray member of the MergeState struct (note that the same would be 192 true if the new random elements were prefixed to the sorted list instead, 193 but not if they appeared "in the middle"). 3sort approaches N//3 temp 194 slots twice, but the run lengths that remain after 3 random exchanges 195 clearly has very high variance. 196 197 198 A detailed description of timsort follows. 199 200 Runs 201 ---- 202 count_run() returns the # of elements in the next run. A run is either 203 "ascending", which means non-decreasing: 204 205 a0 <= a1 <= a2 <= ... 206 207 or "descending", which means strictly decreasing: 208 209 a0 > a1 > a2 > ... 210 211 Note that a run is always at least 2 long, unless we start at the array's 212 last element. 213 214 The definition of descending is strict, because the main routine reverses 215 a descending run in-place, transforming a descending run into an ascending 216 run. Reversal is done via the obvious fast "swap elements starting at each 217 end, and converge at the middle" method, and that can violate stability if 218 the slice contains any equal elements. Using a strict definition of 219 descending ensures that a descending run contains distinct elements. 220 221 If an array is random, it's very unlikely we'll see long runs. If a natural 222 run contains less than minrun elements (see next section), the main loop 223 artificially boosts it to minrun elements, via a stable binary insertion sort 224 applied to the right number of array elements following the short natural 225 run. In a random array, *all* runs are likely to be minrun long as a 226 result. This has two primary good effects: 227 228 1. Random data strongly tends then toward perfectly balanced (both runs have 229 the same length) merges, which is the most efficient way to proceed when 230 data is random. 231 232 2. Because runs are never very short, the rest of the code doesn't make 233 heroic efforts to shave a few cycles off per-merge overheads. For 234 example, reasonable use of function calls is made, rather than trying to 235 inline everything. Since there are no more than N/minrun runs to begin 236 with, a few "extra" function calls per merge is barely measurable. 237 238 239 Computing minrun 240 ---------------- 241 If N < 64, minrun is N. IOW, binary insertion sort is used for the whole 242 array then; it's hard to beat that given the overheads of trying something 243 fancier. 244 245 When N is a power of 2, testing on random data showed that minrun values of 246 16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost 247 in binary insertion sort clearly hurt, and at 8 the increase in the number 248 of function calls clearly hurt. Picking *some* power of 2 is important 249 here, so that the merges end up perfectly balanced (see next section). We 250 pick 32 as a good value in the sweet range; picking a value at the low end 251 allows the adaptive gimmicks more opportunity to exploit shorter natural 252 runs. 253 254 Because sortperf.py only tries powers of 2, it took a long time to notice 255 that 32 isn't a good choice for the general case! Consider N=2112: 256 257 >>> divmod(2112, 32) 258 (66, 0) 259 >>> 260 261 If the data is randomly ordered, we're very likely to end up with 66 runs 262 each of length 32. The first 64 of these trigger a sequence of perfectly 263 balanced merges (see next section), leaving runs of lengths 2048 and 64 to 264 merge at the end. The adaptive gimmicks can do that with fewer than 2048+64 265 compares, but it's still more compares than necessary, and-- mergesort's 266 bugaboo relative to samplesort --a lot more data movement (O(N) copies just 267 to get 64 elements into place). 268 269 If we take minrun=33 in this case, then we're very likely to end up with 64 270 runs each of length 33, and then all merges are perfectly balanced. Better! 271 272 What we want to avoid is picking minrun such that in 273 274 q, r = divmod(N, minrun) 275 276 q is a power of 2 and r>0 (then the last merge only gets r elements into 277 place, and r < minrun is small compared to N), or q a little larger than a 278 power of 2 regardless of r (then we've got a case similar to "2112", again 279 leaving too little work for the last merge to do). 280 281 Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a 282 power of 2, or if that isn't possible, is close to, but strictly less than, 283 a power of 2. This is easier to do than it may sound: take the first 6 284 bits of N, and add 1 if any of the remaining bits are set. In fact, that 285 rule covers every case in this section, including small N and exact powers 286 of 2; merge_compute_minrun() is a deceptively simple function. 287 288 289 The Merge Pattern 290 ----------------- 291 In order to exploit regularities in the data, we're merging on natural 292 run lengths, and they can become wildly unbalanced. That's a Good Thing 293 for this sort! It means we have to find a way to manage an assortment of 294 potentially very different run lengths, though. 295 296 Stability constrains permissible merging patterns. For example, if we have 297 3 consecutive runs of lengths 298 299 A:10000 B:20000 C:10000 300 301 we dare not merge A with C first, because if A, B and C happen to contain 302 a common element, it would get out of order wrt its occurence(s) in B. The 303 merging must be done as (A+B)+C or A+(B+C) instead. 304 305 So merging is always done on two consecutive runs at a time, and in-place, 306 although this may require some temp memory (more on that later). 307 308 When a run is identified, its base address and length are pushed on a stack 309 in the MergeState struct. merge_collapse() is then called to see whether it 310 should merge it with preceding run(s). We would like to delay merging as 311 long as possible in order to exploit patterns that may come up later, but we 312 like even more to do merging as soon as possible to exploit that the run just 313 found is still high in the memory hierarchy. We also can't delay merging 314 "too long" because it consumes memory to remember the runs that are still 315 unmerged, and the stack has a fixed size. 316 317 What turned out to be a good compromise maintains two invariants on the 318 stack entries, where A, B and C are the lengths of the three righmost not-yet 319 merged slices: 320 321 1. A > B+C 322 2. B > C 323 324 Note that, by induction, #2 implies the lengths of pending runs form a 325 decreasing sequence. #1 implies that, reading the lengths right to left, 326 the pending-run lengths grow at least as fast as the Fibonacci numbers. 327 Therefore the stack can never grow larger than about log_base_phi(N) entries, 328 where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice 329 for very large arrays. 330 331 If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the 332 freshness-in-cache reason), and the new run replaces the A,B or B,C entries; 333 e.g., if the last 3 entries are 334 335 A:30 B:20 C:10 336 337 then B is merged with C, leaving 338 339 A:30 BC:30 340 341 on the stack. Or if they were 342 343 A:500 B:400: C:1000 344 345 then A is merged with B, leaving 346 347 AB:900 C:1000 348 349 on the stack. 350 351 In both examples, the stack configuration after the merge still violates 352 invariant #2, and merge_collapse() goes on to continue merging runs until 353 both invariants are satisfied. As an extreme case, suppose we didn't do the 354 minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2, 355 and 2. Nothing would get merged until the final 2 was seen, and that would 356 trigger 7 perfectly balanced merges. 357 358 The thrust of these rules when they trigger merging is to balance the run 359 lengths as closely as possible, while keeping a low bound on the number of 360 runs we have to remember. This is maximally effective for random data, 361 where all runs are likely to be of (artificially forced) length minrun, and 362 then we get a sequence of perfectly balanced merges (with, perhaps, some 363 oddballs at the end). 364 365 OTOH, one reason this sort is so good for partly ordered data has to do 366 with wildly unbalanced run lengths. 367 368 369 Merge Memory 370 ------------ 371 Merging adjacent runs of lengths A and B in-place is very difficult. 372 Theoretical constructions are known that can do it, but they're too difficult 373 and slow for practical use. But if we have temp memory equal to min(A, B), 374 it's easy. 375 376 If A is smaller (function merge_lo), copy A to a temp array, leave B alone, 377 and then we can do the obvious merge algorithm left to right, from the temp 378 area and B, starting the stores into where A used to live. There's always a 379 free area in the original area comprising a number of elements equal to the 380 number not yet merged from the temp array (trivially true at the start; 381 proceed by induction). The only tricky bit is that if a comparison raises an 382 exception, we have to remember to copy the remaining elements back in from 383 the temp area, lest the array end up with duplicate entries from B. But 384 that's exactly the same thing we need to do if we reach the end of B first, 385 so the exit code is pleasantly common to both the normal and error cases. 386 387 If B is smaller (function merge_hi, which is merge_lo's "mirror image"), 388 much the same, except that we need to merge right to left, copying B into a 389 temp array and starting the stores at the right end of where B used to live. 390 391 A refinement: When we're about to merge adjacent runs A and B, we first do 392 a form of binary search (more on that later) to see where B[0] should end up 393 in A. Elements in A preceding that point are already in their final 394 positions, effectively shrinking the size of A. Likewise we also search to 395 see where A[-1] should end up in B, and elements of B after that point can 396 also be ignored. This cuts the amount of temp memory needed by the same 397 amount. 398 399 These preliminary searches may not pay off, and can be expected *not* to 400 repay their cost if the data is random. But they can win huge in all of 401 time, copying, and memory savings when they do pay, so this is one of the 402 "per-merge overheads" mentioned above that we're happy to endure because 403 there is at most one very short run. It's generally true in this algorithm 404 that we're willing to gamble a little to win a lot, even though the net 405 expectation is negative for random data. 406 407 408 Merge Algorithms 409 ---------------- 410 merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo 411 deals with runs where A <= B, and merge_hi where A > B. They don't know 412 whether the data is clustered or uniform, but a lovely thing about merging 413 is that many kinds of clustering "reveal themselves" by how many times in a 414 row the winning merge element comes from the same run. We'll only discuss 415 merge_lo here; merge_hi is exactly analogous. 416 417 Merging begins in the usual, obvious way, comparing the first element of A 418 to the first of B, and moving B[0] to the merge area if it's less than A[0], 419 else moving A[0] to the merge area. Call that the "one pair at a time" 420 mode. The only twist here is keeping track of how many times in a row "the 421 winner" comes from the same run. 422 423 If that count reaches MIN_GALLOP, we switch to "galloping mode". Here 424 we *search* B for where A[0] belongs, and move over all the B's before 425 that point in one chunk to the merge area, then move A[0] to the merge 426 area. Then we search A for where B[0] belongs, and similarly move a 427 slice of A in one chunk. Then back to searching B for where A[0] belongs, 428 etc. We stay in galloping mode until both searches find slices to copy 429 less than MIN_GALLOP elements long, at which point we go back to one-pair- 430 at-a-time mode. 431 432 A refinement: The MergeState struct contains the value of min_gallop that 433 controls when we enter galloping mode, initialized to MIN_GALLOP. 434 merge_lo() and merge_hi() adjust this higher when galloping isn't paying 435 off, and lower when it is. 436 437 438 Galloping 439 --------- 440 Still without loss of generality, assume A is the shorter run. In galloping 441 mode, we first look for A[0] in B. We do this via "galloping", comparing 442 A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2**j - 1], ..., until finding 443 the k such that B[2**(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most 444 roughly lg(B) comparisons, and, unlike a straight binary search, favors 445 finding the right spot early in B (more on that later). 446 447 After finding such a k, the region of uncertainty is reduced to 2**(k-1) - 1 448 consecutive elements, and a straight binary search requires exactly k-1 449 additional comparisons to nail it. Then we copy all the B's up to that 450 point in one chunk, and then copy A[0]. Note that no matter where A[0] 451 belongs in B, the combination of galloping + binary search finds it in no 452 more than about 2*lg(B) comparisons. 453 454 If we did a straight binary search, we could find it in no more than 455 ceiling(lg(B+1)) comparisons -- but straight binary search takes that many 456 comparisons no matter where A[0] belongs. Straight binary search thus loses 457 to galloping unless the run is quite long, and we simply can't guess 458 whether it is in advance. 459 460 If data is random and runs have the same length, A[0] belongs at B[0] half 461 the time, at B[1] a quarter of the time, and so on: a consecutive winning 462 sub-run in B of length k occurs with probability 1/2**(k+1). So long 463 winning sub-runs are extremely unlikely in random data, and guessing that a 464 winning sub-run is going to be long is a dangerous game. 465 466 OTOH, if data is lopsided or lumpy or contains many duplicates, long 467 stretches of winning sub-runs are very likely, and cutting the number of 468 comparisons needed to find one from O(B) to O(log B) is a huge win. 469 470 Galloping compromises by getting out fast if there isn't a long winning 471 sub-run, yet finding such very efficiently when they exist. 472 473 I first learned about the galloping strategy in a related context; see: 474 475 "Adaptive Set Intersections, Unions, and Differences" (2000) 476 Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro 477 478 and its followup(s). An earlier paper called the same strategy 479 "exponential search": 480 481 "Optimistic Sorting and Information Theoretic Complexity" 482 Peter McIlroy 483 SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 484 467-474, Austin, Texas, 25-27 January 1993. 485 486 and it probably dates back to an earlier paper by Bentley and Yao. The 487 McIlory paper in particular has good analysis of a mergesort that's 488 probably strongly related to this one in its galloping strategy. 489 490 491 Galloping with a Broken Leg 492 --------------------------- 493 So why don't we always gallop? Because it can lose, on two counts: 494 495 1. While we're willing to endure small per-merge overheads, per-comparison 496 overheads are a different story. Calling Yet Another Function per 497 comparison is expensive, and gallop_left() and gallop_right() are 498 too long-winded for sane inlining. 499 500 2. Galloping can-- alas --require more comparisons than linear one-at-time 501 search, depending on the data. 502 503 #2 requires details. If A[0] belongs before B[0], galloping requires 1 504 compare to determine that, same as linear search, except it costs more 505 to call the gallop function. If A[0] belongs right before B[1], galloping 506 requires 2 compares, again same as linear search. On the third compare, 507 galloping checks A[0] against B[3], and if it's <=, requires one more 508 compare to determine whether A[0] belongs at B[2] or B[3]. That's a total 509 of 4 compares, but if A[0] does belong at B[2], linear search would have 510 discovered that in only 3 compares, and that's a huge loss! Really. It's 511 an increase of 33% in the number of compares needed, and comparisons are 512 expensive in Python. 513 514 index in B where # compares linear # gallop # binary gallop 515 A[0] belongs search needs compares compares total 516 ---------------- ----------------- -------- -------- ------ 517 0 1 1 0 1 518 519 1 2 2 0 2 520 521 2 3 3 1 4 522 3 4 3 1 4 523 524 4 5 4 2 6 525 5 6 4 2 6 526 6 7 4 2 6 527 7 8 4 2 6 528 529 8 9 5 3 8 530 9 10 5 3 8 531 10 11 5 3 8 532 11 12 5 3 8 533 ... 534 535 In general, if A[0] belongs at B[i], linear search requires i+1 comparisons 536 to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons. 537 The advantage of galloping is unbounded as i grows, but it doesn't win at 538 all until i=6. Before then, it loses twice (at i=2 and i=4), and ties 539 at the other values. At and after i=6, galloping always wins. 540 541 We can't guess in advance when it's going to win, though, so we do one pair 542 at a time until the evidence seems strong that galloping may pay. MIN_GALLOP 543 is 7, and that's pretty strong evidence. However, if the data is random, it 544 simply will trigger galloping mode purely by luck every now and again, and 545 it's quite likely to hit one of the losing cases next. On the other hand, 546 in cases like ~sort, galloping always pays, and MIN_GALLOP is larger than it 547 "should be" then. So the MergeState struct keeps a min_gallop variable 548 that merge_lo and merge_hi adjust: the longer we stay in galloping mode, 549 the smaller min_gallop gets, making it easier to transition back to 550 galloping mode (if we ever leave it in the current merge, and at the 551 start of the next merge). But whenever the gallop loop doesn't pay, 552 min_gallop is increased by one, making it harder to transition back 553 to galloping mode (and again both within a merge and across merges). For 554 random data, this all but eliminates the gallop penalty: min_gallop grows 555 large enough that we almost never get into galloping mode. And for cases 556 like ~sort, min_gallop can fall to as low as 1. This seems to work well, 557 but in all it's a minor improvement over using a fixed MIN_GALLOP value. 558 559 560 Galloping Complication 561 ---------------------- 562 The description above was for merge_lo. merge_hi has to merge "from the 563 other end", and really needs to gallop starting at the last element in a run 564 instead of the first. Galloping from the first still works, but does more 565 comparisons than it should (this is significant -- I timed it both ways). 566 For this reason, the gallop_left() and gallop_right() functions have a 567 "hint" argument, which is the index at which galloping should begin. So 568 galloping can actually start at any index, and proceed at offsets of 1, 3, 569 7, 15, ... or -1, -3, -7, -15, ... from the starting index. 570 571 In the code as I type it's always called with either 0 or n-1 (where n is 572 the # of elements in a run). It's tempting to try to do something fancier, 573 melding galloping with some form of interpolation search; for example, if 574 we're merging a run of length 1 with a run of length 10000, index 5000 is 575 probably a better guess at the final result than either 0 or 9999. But 576 it's unclear how to generalize that intuition usefully, and merging of 577 wildly unbalanced runs already enjoys excellent performance. 578 579 ~sort is a good example of when balanced runs could benefit from a better 580 hint value: to the extent possible, this would like to use a starting 581 offset equal to the previous value of acount/bcount. Doing so saves about 582 10% of the compares in ~sort. However, doing so is also a mixed bag, 583 hurting other cases. 584 585 586 Comparing Average # of Compares on Random Arrays 587 ------------------------------------------------ 588 [NOTE: This was done when the new algorithm used about 0.1% more compares 589 on random data than does its current incarnation.] 590 591 Here list.sort() is samplesort, and list.msort() this sort: 592 593 """ 594 import random 595 from time import clock as now 596 597 def fill(n): 598 from random import random 599 return [random() for i in xrange(n)] 600 601 def mycmp(x, y): 602 global ncmp 603 ncmp += 1 604 return cmp(x, y) 605 606 def timeit(values, method): 607 global ncmp 608 X = values[:] 609 bound = getattr(X, method) 610 ncmp = 0 611 t1 = now() 612 bound(mycmp) 613 t2 = now() 614 return t2-t1, ncmp 615 616 format = "%5s %9.2f %11d" 617 f2 = "%5s %9.2f %11.2f" 618 619 def drive(): 620 count = sst = sscmp = mst = mscmp = nelts = 0 621 while True: 622 n = random.randrange(100000) 623 nelts += n 624 x = fill(n) 625 626 t, c = timeit(x, 'sort') 627 sst += t 628 sscmp += c 629 630 t, c = timeit(x, 'msort') 631 mst += t 632 mscmp += c 633 634 count += 1 635 if count % 10: 636 continue 637 638 print "count", count, "nelts", nelts 639 print format % ("sort", sst, sscmp) 640 print format % ("msort", mst, mscmp) 641 print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp) 642 643 drive() 644 """ 645 646 I ran this on Windows and kept using the computer lightly while it was 647 running. time.clock() is wall-clock time on Windows, with better than 648 microsecond resolution. samplesort started with a 1.52% #-of-comparisons 649 disadvantage, fell quickly to 1.48%, and then fluctuated within that small 650 range. Here's the last chunk of output before I killed the job: 651 652 count 2630 nelts 130906543 653 sort 6110.80 1937887573 654 msort 6002.78 1909389381 655 1.80 1.49 656 657 We've done nearly 2 billion comparisons apiece at Python speed there, and 658 that's enough <wink>. 659 660 For random arrays of size 2 (yes, there are only 2 interesting ones), 661 samplesort has a 50%(!) comparison disadvantage. This is a consequence of 662 samplesort special-casing at most one ascending run at the start, then 663 falling back to the general case if it doesn't find an ascending run 664 immediately. The consequence is that it ends up using two compares to sort 665 [2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be 666 taught how to deal with mixtures of ascending and descending runs 667 efficiently in all cases.
Testing
