diff options
Diffstat (limited to 'doc/todo/smarter_sorting.mdwn')
-rw-r--r-- | doc/todo/smarter_sorting.mdwn | 141 |
1 files changed, 141 insertions, 0 deletions
diff --git a/doc/todo/smarter_sorting.mdwn b/doc/todo/smarter_sorting.mdwn new file mode 100644 index 000000000..901e143a7 --- /dev/null +++ b/doc/todo/smarter_sorting.mdwn @@ -0,0 +1,141 @@ +I benchmarked a build of a large wiki (my home wiki), and it was spending +quite a lot of time sorting; `CORE::sort` was called only 1138 times, but +still flagged as the #1 time sink. (I'm not sure I trust NYTProf fully +about that FWIW, since it also said 27238263 calls to `cmp_age` were +the #3 timesink, and I suspect it may not entirely accurately measure +the overhead of so many short function calls.) + +`pagespec_match_list` currently always sorts *all* pages first, and then +finds the top M that match the pagespec. That's innefficient when M is +small (as for example in a typical blog, where only 20 posts are shown, +out of maybe thousands). + +As [[smcv]] noted, It could be flipped, so the pagespec is applied first, +and then sort the smaller matching set. But, checking pagespecs is likely +more expensive than sorting. (Also, influence calculation complicates +doing that.) + +Another option, when there is a limit on M pages to return, might be to +cull the M top pages without sorting the rest. + +> The patch below implements this. +> +> But, I have not thought enough about influence calculation. +> I need to figure out which pagespec matches influences need to be +> accumulated for in order to determine all possible influences of a +> pagespec are known. +> +> The old code accumulates influences from matching all successful pages +> up to the num cutoff, as well as influences from an arbitrary (sometimes +> zero) number of failed matches. New code does not accumulate influences +> from all the top successful matches, only an arbitrary group of +> successes and some failures. +> +> Also, by the time I finished this, it was not measuarably faster than +> the old method. At least not with a few thousand pages; it +> might be worth revisiting this sometime for many more pages? [[done]] +> --[[Joey]] + +<pre> +diff --git a/IkiWiki.pm b/IkiWiki.pm +index 1730e47..bc8b23d 100644 +--- a/IkiWiki.pm ++++ b/IkiWiki.pm +@@ -2122,36 +2122,54 @@ sub pagespec_match_list ($$;@) { + my $num=$params{num}; + delete @params{qw{num deptype reverse sort filter list}}; + +- # when only the top matches will be returned, it's efficient to +- # sort before matching to pagespec, +- if (defined $num && defined $sort) { +- @candidates=IkiWiki::SortSpec::sort_pages( +- $sort, @candidates); +- } +- ++ # Find the first num matches (or all), before sorting. + my @matches; +- my $firstfail; + my $count=0; + my $accum=IkiWiki::SuccessReason->new(); +- foreach my $p (@candidates) { +- my $r=$sub->($p, %params, location => $page); ++ my $i; ++ for ($i=0; $i < @candidates; $i++) { ++ my $r=$sub->($candidates[$i], %params, location => $page); + error(sprintf(gettext("cannot match pages: %s"), $r)) + if $r->isa("IkiWiki::ErrorReason"); + $accum |= $r; + if ($r) { +- push @matches, $p; ++ push @matches, $candidates[$i]; + last if defined $num && ++$count == $num; + } + } + ++ # We have num natches, but they may not be the best. ++ # Efficiently find and add the rest, without sorting the full list of ++ # candidates. ++ if (defined $num && defined $sort) { ++ @matches=IkiWiki::SortSpec::sort_pages($sort, @matches); ++ ++ for ($i++; $i < @candidates; $i++) { ++ # Comparing candidate with lowest match is cheaper, ++ # so it's done before testing against pagespec. ++ if (IkiWiki::SortSpec::cmptwo($candidates[$i], $matches[-1], $sort) < 0 && ++ $sub->($candidates[$i], %params, location => $page) ++ ) { ++ # this could be done less expensively ++ # using a binary search ++ for (my $j=0; $j < @matches; $j++) { ++ if (IkiWiki::SortSpec::cmptwo($candidates[$i], $matches[$j], $sort) < 0) { ++ splice @matches, $j, $#matches-$j+1, $candidates[$i], ++ @matches[$j..$#matches-1]; ++ last; ++ } ++ } ++ } ++ } ++ } ++ + # Add simple dependencies for accumulated influences. +- my $i=$accum->influences; +- foreach my $k (keys %$i) { +- $depends_simple{$page}{lc $k} |= $i->{$k}; ++ my $inf=$accum->influences; ++ foreach my $k (keys %$inf) { ++ $depends_simple{$page}{lc $k} |= $inf->{$k}; + } + +- # when all matches will be returned, it's efficient to +- # sort after matching ++ # Sort if we didn't already. + if (! defined $num && defined $sort) { + return IkiWiki::SortSpec::sort_pages( + $sort, @matches); +@@ -2455,6 +2473,12 @@ sub sort_pages { + sort $f @_ + } + ++sub cmptwo { ++ $a=$_[0]; ++ $b=$_[1]; ++ $_[2]->(); ++} ++ + sub cmp_title { + IkiWiki::pagetitle(IkiWiki::basename($a)) + cmp +</pre> + +This would be bad when M is very large, and particularly, of course, when +there is no limit and all pages are being matched on. (For example, an +archive page shows all pages that match a pagespec specifying a creation +date range.) Well, in this case, it *does* make sense to flip it, limit by +pagespe first, and do a (quick)sort second. (No influence complications, +either.) + +> Flipping when there's no limit implemented, and it knocked 1/3 off +> the rebuild time of my blog's archive pages. --[[Joey]] + +Adding these special cases will be more complicated, but I think the best +of both worlds. --[[Joey]] |