Sort By Frequency On Array
To make the code easier to write and test, I decided to roll a `sort_by_frequency` method that I could add as an instance method to `Array`. I initially started with the tests inside the existing plugin tests, but, it soon became difficult when trying to do
1234 |
assert_equal 3, article.related_articles.size1.upto(3) do |n| assert_equal thoughtworks_articles[n], article.related_articles[n]end |
... since Rails would dump an inspection of the Articles that didn’t match, making it rather difficult to figure out what’s going on.
So, I started simpler, and added a new test into the plugin’s test directory, and wrote the following
123456 |
def test_can_sort_by_frequency_for_numbers unsorted = [2, 1, 2, 3, 4, 3, 4, 3, 4, 4] expected = [4, 4, 4, 4, 3, 3, 3, 2, 2, 1] assert_equal expected, unsorted.sort_by_frequencyend |
Running the test showed that the `Array` class didn’t contain a `sort_by_frequency` method, so I went and added a module and (like with the `related_articles` method) injected into the Array class
|
Array.send :include, SortByFrequency |
To get the test up and failing meaningfully, I also then just returned `self` inside `sorty_by_frequency`. Finally, a failing test – we’re expecting an array sorted by the frequency the object occurred inside the original array. Instead, we’re just getting the original array back.
So, to implement I wrote
1234 |
def sort_by_frequency histogram = inject(Hash.new(0)) {|hash, element| hash[element] += 1; hash} sort {|x, y| (histogram[x] <=> histogram[y]) * - 1}end |
The code works as follows. Firstly, it creates a histogram – which records the number of times the value inside the array has occurred. The histogram is implemented as a `Hash` which uses `Enumerable’s` `inject` method to create itself.
Using `inject` reminds you of just how painful dealing with arrays, lists, sets, and collections is in other languages (read Java, C# and others).
It works by passing the parameter to `inject` as the first argument to the block. The second time it’s called, it passes the result of the block’s previous execution. So, the `hash` is retained across block invocations – it’s nice to avoid having to declare the `histogram` separately! The second parameter to the block is the element from the array – from our original test, the numbers.
Once the histogram has been created, it’s time to then sort ourselves based on it’s contents. To do so, we use `Enumerable`’s `sort` method which uses the result from the block to determine how to sort the array. We compare the frequency counts inside the histogram for both sides of the sort (`x` and `y` parameters to the block being elements from our array). We then multiply the result of the comparison by -1, this way, it will invert the results so the most frequent are first.
If we run the test again we’ll get a green-ish bar (love TextMate’s new colouring of test results). That’s given me some confidence that we’ve written the sort code correctly, but I just want to make sure the method works fine with Typo article’s also. So, let’s also add a test
12345678910 |
def test_can_sort_articles a1 = Article.new() a2 = Article.new() a3 = Article.new() unsorted = [a2, a1, a3, a2, a3, a3] expected = [a3, a3, a3, a2, a2, a1] assert_equal expected, unsorted.sort_by_frequencyend |
Now all we need to do is go to our `related_articles` tests, and add a test to show we’re sorting related articles by relevance.
Sorting Related Articles by Relevance
To test that we can order the results by the number of tags they share, we need to create some test articles. Rather than adding them through Rails’ fixtures (which I’ve found to be a bit of a pain) instead we’ll create them inside our test.
We’ll create four articles. The first article will be what we consider to be the article we’re currently viewing. The others will have a different number of shared tags each. My test looks as follows
123456789101112131415161718192021 |
def test_can_order_by_number_of_overlapping_tags article = create_valid_article(:keywords => 'thoughtworks agile development') # Create a sliding scale of articles, all of which share the 'thoughtworks' # tag. 2 will share the 'agile' tag, and 1 will share the development tag. thoughtworks_articles = [] 1.upto(3) do |n| if n == 1 thoughtworks_articles << create_valid_article(:keywords => 'thoughtworks agile development') elsif n < 3 thoughtworks_articles << create_valid_article(:keywords => 'thoughtworks agile') else thoughtworks_articles << create_valid_article(:keywords => 'thoughtworks') end end assert_equal 3, article.related_articles.size 1.upto(3) do |n| assert_equal thoughtworks_articles[n], article.related_articles[n] endend |
If we run the test now, it will fail so we need to change the code as follows
1234567891011121314 |
def related_articles(*options) related_articles = [] self.tags.each do |tag| tag.articles.each {|article| related_articles << article if can_add_related?(article, related_articles)} end related_articles = related_articles.sort_by_frequency.uniq ...endprivatedef can_add_related?(article, related_articles) article.published and article != selfend |
Note that we’ve also changed `can_add_related?` from the previous iteration (from the code I put in a previous post), since we now need to have an array containing duplicate entries – the number of times an article appears is used as the index to the array. We exclude duplicate entries once we’ve sorted, using the `uniq` method.
Running the tests now and we’ll see the tests pass!
Refactoring
My last post attracted a few suggestions from Piers Cawley (one of the Typo developers), so I thought I’d take a look to try and learn a few things.
Firstly, I changed my related articles test to the following
123456 |
privateVALID_ARTICLE_DEFAULTS = {:user_id => 1, :body => 'foo', :title => 'bar', :keywords => 'testing1', :published => true}def create_valid_article(*args) options = Hash === args.last ? args.pop : {} Article.create! options.reverse_merge(VALID_ARTICLE_DEFAULTS)end |
The change uses Rails’ core extensions for Hash – `reverse_merge` which replaces anything inside our `VALID_ARTICLES_DEFAULTS` with those set on our `options` argument. It also uses Rails’ helper method `create` which both creates the in-memory object, and saves it to the database. Typo has a number of filters which handle when the `ActiveRecord` object is created/saved so the `create` method can be seen a factory method that sets up the `Article` in a good state.
I’m also not so keen on the length and structure of `related_articles` in `RelatedArticlesSearch` module. Let’s see if we can make some small refactorings to improve the code.
Firstly, let’s extract out a method for the finding of all related articles before we do the rest of the processing
1234567891011121314151617181920 |
def related_articles(*options) related_articles = find_related_articles.sort_by_frequency.uniq if options.length == 0 related_articles else params = options[0] count = params[:limit] - 1 related_articles[0..count] endend privatedef find_related_articles related_articles = [] self.tags.each do |tag| tag.articles.each {|article| related_articles << article if can_add_related?(article, related_articles)} end related_articles end |
We introduced the `find_related_articles` method – to do the database search. Re-run the tests to make sure we can run and we’re happy to go on. Cool, let’s see if we can reduce the number of lines inside our `find_related_articles` method. How about we use the `inject` method again to create our Array
12345678910111213141516171819 |
def related_articles(*options) related_articles = find_related_articles.sort_by_frequency.uniq if options.length == 0 return related_articles end params = options[0] count = params[:limit] - 1 related_articles[0..count]endprivatedef find_related_articles self.tags.inject(Array.new) do |related_articles, tag| tag.articles.each {|article| related_articles << article if can_add_related?(article)} related_articles endend |
Neat. So, we create our result array when we start iterating over our `tags`, and then add items to it as we go. Although this works, we could probably wrap up the appending of items to the array using one of the other Ruby iterators.
123456 |
privatedef find_related_articles self.tags.inject(Array.new) do |related_articles, tag| related_articles + tag.articles.collect {|article| article if can_add_related?(article)}.compact endend |
This time we use the `collect` method to return an Array of elements we’d like to add to our other array. Before we append them using the `+` operator, we call `compact` to remove any `nil` entries first. Much nicer.
Now, let’s turn ourselves to the `related_articles` method, let’s first get rid of the large conditional and instead do
1234567891011 |
def related_articles(*options) related_articles = find_related_articles.sort_by_frequency.uniq if options.length == 0 return related_articles end params = options[0] count = params[:limit] - 1 related_articles[0..count]end |
Looks a little neater, but that `:limit` stuff still looks messy. How about we use
12345678 |
def related_articles(*options) params = Hash === options.last ? options.pop : {} related_articles = find_related_articles.sort_by_frequency.uniq count = nil != params[:limit] ? params[:limit] : related_articles.size related_articles[0..count - 1]end |
Better, but still reads a little weirdly. Let’s try a little more, perhaps we should go with
12345678 |
def related_articles(*options) params = Hash === options.last ? options.pop : {} related_articles = find_related_articles.sort_by_frequency.uniq count = params[:limit] ||= related_articles.size related_articles[0..count - 1]end |
Much nicer.
Getting the Code
The code for the `Article::related_articles` method is available as a Rails plugin, and can be installed by running:
$ script/plugin install http://www.engross.org/svn/typolatest/trunk$ mv vendor/plugins/trunk vendor/plugins/related_articles_search
or, you can also check it out (via Subversion) as follows
$ cd vendor/plugins$ svn co http://www.engross.org/svn/typo/typolatest/trunk related_articles_search
Once you’ve done that, you can download the Sidebar code. This will currently only work with non-edge Typo (i.e. the latest fully released version) as I’ve not written a Sidebar component for the new architecture. I’ll be working on that next :)
To get the Sidebar component you can download the component in a ZIP archive. Then, unzip it inside your `components/plugins/sidebars` directory. You should end up with a `components/plugins/sidebars/related_articles` directory.
Of course, you can also browse the code through Trac at http://dev.engross.org/trac/typo/browser. Hopefully I’ve not made quite such a mess of the code as before, but if I have, please point out where I can improve.