Sorting Related Articles by Relevance

Sort By Frequency On Array

To make the code easier to write and test, I decided to roll a `sort_by_frequency` method that I could add as an instance method to `Array`. I initially started with the tests inside the existing plugin tests, but, it soon became difficult when trying to do




1234
assert_equal 3, article.related_articles.size1.upto(3) do |n|  assert_equal thoughtworks_articles[n], article.related_articles[n]end

... since Rails would dump an inspection of the Articles that didn’t match, making it rather difficult to figure out what’s going on.

So, I started simpler, and added a new test into the plugin’s test directory, and wrote the following




123456
def test_can_sort_by_frequency_for_numbers  unsorted = [2, 1, 2, 3, 4, 3, 4, 3, 4, 4]  expected = [4, 4, 4, 4, 3, 3, 3, 2, 2, 1]      assert_equal expected, unsorted.sort_by_frequencyend

Running the test showed that the `Array` class didn’t contain a `sort_by_frequency` method, so I went and added a module and (like with the `related_articles` method) injected into the Array class




Array.send :include, SortByFrequency

To get the test up and failing meaningfully, I also then just returned `self` inside `sorty_by_frequency`. Finally, a failing test – we’re expecting an array sorted by the frequency the object occurred inside the original array. Instead, we’re just getting the original array back.

So, to implement I wrote




1234
def sort_by_frequency  histogram = inject(Hash.new(0)) {|hash, element| hash[element] += 1; hash}  sort {|x, y| (histogram[x] <=> histogram[y]) * - 1}end

The code works as follows. Firstly, it creates a histogram – which records the number of times the value inside the array has occurred. The histogram is implemented as a `Hash` which uses `Enumerable’s` `inject` method to create itself.

Using `inject` reminds you of just how painful dealing with arrays, lists, sets, and collections is in other languages (read Java, C# and others).

It works by passing the parameter to `inject` as the first argument to the block. The second time it’s called, it passes the result of the block’s previous execution. So, the `hash` is retained across block invocations – it’s nice to avoid having to declare the `histogram` separately! The second parameter to the block is the element from the array – from our original test, the numbers.

Once the histogram has been created, it’s time to then sort ourselves based on it’s contents. To do so, we use `Enumerable`’s `sort` method which uses the result from the block to determine how to sort the array. We compare the frequency counts inside the histogram for both sides of the sort (`x` and `y` parameters to the block being elements from our array). We then multiply the result of the comparison by -1, this way, it will invert the results so the most frequent are first.

If we run the test again we’ll get a green-ish bar (love TextMate’s new colouring of test results). That’s given me some confidence that we’ve written the sort code correctly, but I just want to make sure the method works fine with Typo article’s also. So, let’s also add a test




12345678910
def test_can_sort_articles  a1 = Article.new()  a2 = Article.new()  a3 = Article.new()  unsorted = [a2, a1, a3, a2, a3, a3]  expected = [a3, a3, a3, a2, a2, a1]  assert_equal expected, unsorted.sort_by_frequencyend

Now all we need to do is go to our `related_articles` tests, and add a test to show we’re sorting related articles by relevance.

Sorting Related Articles by Relevance

To test that we can order the results by the number of tags they share, we need to create some test articles. Rather than adding them through Rails’ fixtures (which I’ve found to be a bit of a pain) instead we’ll create them inside our test.

We’ll create four articles. The first article will be what we consider to be the article we’re currently viewing. The others will have a different number of shared tags each. My test looks as follows




123456789101112131415161718192021
def test_can_order_by_number_of_overlapping_tags  article = create_valid_article(:keywords => 'thoughtworks agile development')      # Create a sliding scale of articles, all of which share the 'thoughtworks'  # tag. 2 will share the 'agile' tag, and 1 will share the development tag.  thoughtworks_articles = []  1.upto(3) do |n|    if n == 1      thoughtworks_articles << create_valid_article(:keywords => 'thoughtworks agile development')    elsif n < 3      thoughtworks_articles << create_valid_article(:keywords => 'thoughtworks agile')    else      thoughtworks_articles << create_valid_article(:keywords => 'thoughtworks')    end  end      assert_equal 3, article.related_articles.size  1.upto(3) do |n|    assert_equal thoughtworks_articles[n], article.related_articles[n]  endend

If we run the test now, it will fail so we need to change the code as follows




1234567891011121314
def related_articles(*options)  related_articles = []  self.tags.each do |tag|    tag.articles.each {|article| related_articles << article if can_add_related?(article, related_articles)}  end      related_articles = related_articles.sort_by_frequency.uniq  ...endprivatedef can_add_related?(article, related_articles)  article.published and article != selfend

Note that we’ve also changed `can_add_related?` from the previous iteration (from the code I put in a previous post), since we now need to have an array containing duplicate entries – the number of times an article appears is used as the index to the array. We exclude duplicate entries once we’ve sorted, using the `uniq` method.

Running the tests now and we’ll see the tests pass!

Refactoring

My last post attracted a few suggestions from Piers Cawley (one of the Typo developers), so I thought I’d take a look to try and learn a few things.

Firstly, I changed my related articles test to the following




123456
privateVALID_ARTICLE_DEFAULTS = {:user_id => 1, :body => 'foo', :title => 'bar', :keywords => 'testing1', :published => true}def create_valid_article(*args)  options = Hash === args.last ? args.pop : {}  Article.create! options.reverse_merge(VALID_ARTICLE_DEFAULTS)end

The change uses Rails’ core extensions for Hash – `reverse_merge` which replaces anything inside our `VALID_ARTICLES_DEFAULTS` with those set on our `options` argument. It also uses Rails’ helper method `create` which both creates the in-memory object, and saves it to the database. Typo has a number of filters which handle when the `ActiveRecord` object is created/saved so the `create` method can be seen a factory method that sets up the `Article` in a good state.

I’m also not so keen on the length and structure of `related_articles` in `RelatedArticlesSearch` module. Let’s see if we can make some small refactorings to improve the code.

Firstly, let’s extract out a method for the finding of all related articles before we do the rest of the processing




1234567891011121314151617181920
def related_articles(*options)  related_articles = find_related_articles.sort_by_frequency.uniq      if options.length == 0    related_articles  else    params = options[0]    count = params[:limit] - 1    related_articles[0..count]  endend   privatedef find_related_articles  related_articles = []  self.tags.each do |tag|    tag.articles.each {|article| related_articles << article if can_add_related?(article, related_articles)}  end  related_articles    end

We introduced the `find_related_articles` method – to do the database search. Re-run the tests to make sure we can run and we’re happy to go on. Cool, let’s see if we can reduce the number of lines inside our `find_related_articles` method. How about we use the `inject` method again to create our Array




12345678910111213141516171819
def related_articles(*options)  related_articles = find_related_articles.sort_by_frequency.uniq      if options.length == 0    return related_articles  end      params = options[0]  count = params[:limit] - 1  related_articles[0..count]endprivatedef find_related_articles  self.tags.inject(Array.new) do |related_articles, tag|    tag.articles.each {|article| related_articles << article if can_add_related?(article)}    related_articles  endend

Neat. So, we create our result array when we start iterating over our `tags`, and then add items to it as we go. Although this works, we could probably wrap up the appending of items to the array using one of the other Ruby iterators.




123456
privatedef find_related_articles  self.tags.inject(Array.new) do |related_articles, tag|    related_articles + tag.articles.collect {|article| article if can_add_related?(article)}.compact  endend

This time we use the `collect` method to return an Array of elements we’d like to add to our other array. Before we append them using the `+` operator, we call `compact` to remove any `nil` entries first. Much nicer.

Now, let’s turn ourselves to the `related_articles` method, let’s first get rid of the large conditional and instead do




1234567891011
def related_articles(*options)  related_articles = find_related_articles.sort_by_frequency.uniq      if options.length == 0    return related_articles  end      params = options[0]  count = params[:limit] - 1  related_articles[0..count]end

Looks a little neater, but that `:limit` stuff still looks messy. How about we use




12345678
def related_articles(*options)  params = Hash === options.last ? options.pop : {}      related_articles = find_related_articles.sort_by_frequency.uniq      count = nil != params[:limit] ? params[:limit] : related_articles.size      related_articles[0..count - 1]end

Better, but still reads a little weirdly. Let’s try a little more, perhaps we should go with




12345678
def related_articles(*options)  params = Hash === options.last ? options.pop : {}      related_articles = find_related_articles.sort_by_frequency.uniq  count = params[:limit] ||= related_articles.size      related_articles[0..count - 1]end

Much nicer.

Getting the Code

The code for the `Article::related_articles` method is available as a Rails plugin, and can be installed by running:

$ script/plugin install http://www.engross.org/svn/typolatest/trunk$ mv vendor/plugins/trunk vendor/plugins/related_articles_search

or, you can also check it out (via Subversion) as follows

$ cd vendor/plugins$ svn co http://www.engross.org/svn/typo/typolatest/trunk related_articles_search

Once you’ve done that, you can download the Sidebar code. This will currently only work with non-edge Typo (i.e. the latest fully released version) as I’ve not written a Sidebar component for the new architecture. I’ll be working on that next :)

To get the Sidebar component you can download the component in a ZIP archive. Then, unzip it inside your `components/plugins/sidebars` directory. You should end up with a `components/plugins/sidebars/related_articles` directory.

Of course, you can also browse the code through Trac at http://dev.engross.org/trac/typo/browser. Hopefully I’ve not made quite such a mess of the code as before, but if I have, please point out where I can improve.