Successfully recommending relevant content to web users is critical to business success for may web properties. There are a variety of techniques used, with the largest sites and companies employing some highly sophisticated technology to optimize their recommendations. In this blog post we discuss how to build an effective recommendation system from first principles in ruby.

Introduction

In an earlier article we introduced distance metrics and showed how these could be implemented in ruby. At the time we mentioned that such metrics can be used in nearest-neighbour models and in generating recommendation systems. We aim to develop these ideas further in this blog post, and to develop a recommendation system based on said metrics.

The algorithms implemented in this article are inspired by the python versions presented by Toby Segaran in Collective Intelligence (Amazon).

We adapt these algorithms, formulate them in ruby and apply them to a whimsical dataset containing class scores for different students in Hogwart's School for Witchcraft and Wizardry. We use the data to generate predicted scores for the different classes which the student has not taken. Whilst the dataset is whimsical, the algorithms presented are sound, and could be legitimately applied to any real dataset.

Let's get started!

Introducing the Hogwarts data

Our dataset is defined inside a static ruby hash named CLASS_SCORES. The full data can be seen on GitHub, but we reproduce a small extract below to illustrate the format:

    
CLASS_SCORES = {
  "Harry Potter" => {
    "Alchemy" => nil,
    "Arithmancy" => nil,
    "Astronomy" => 6,
    "Care of Magical Creatures" => 8,
    "Charms" => 5,
    "Defence Against the Dark Arts" => 10,
    "Divination" => 4,
    "Herbology" => 7,
    "Muggle Studies" => nil,
    "Potions" => 2,
    "Study of Ancient Runes" => nil,
    "Transfiguration" => 7,
  },
  "Hermione Granger" => {
    "Alchemy" => 9,
    "Arithmancy" => 10,
    "Astronomy" => 9,
    "Care of Magical Creatures" => 9,
    "Charms" => 10,
    "Defence Against the Dark Arts" => 9,
    "Divination" => 1,
    "Herbology" => 10,
    "Muggle Studies" => 10,
    "Potions" => 3,
    "Study of Ancient Runes" => 9,
    "Transfiguration" => 10,
  }, …
    
  

The algorithm will use this ruby hash, CLASS_SCORES, but it will be useful to consider the representation of the data in tabular form:

Harry Potter Hermione Granger Ron Weasley Neville Longbottom Draco Malfoy Vincent Crabbe Greggory Goyle Ginny Weasley Cho Chang Cedric Diggory
Alchemy -9-4-128-6
Arithmancy -10--722-8-
Astronomy 696-----98
Magical Creatures 89754238-8
Charms 510747--977
Dark Arts 10984312976
Divination 4175------
Herbology 710810832986
Muggle Studies -10-----9--
Potions 23311098363
Ancient Runes -9--73--9-
Transfiguration 71076723893
Each column represents scores from a single user, whilst each row represents scores for a particular class.

In this format each row of the table represents the scores in a particular class and each column represents the scores for a single student. So we can consider a single student to be represented by a vector (or array) of scores, e.g. Ginny Weasley can be represented by the vector [8, nil, nil, 8, 9, 9, nil, 9, 9, 3, nil, 8]. There are two points worth mentioning here:

  1. The order is obviously important. The 3 in Ginny's array is at index 9 because it relates to her score in Potions class, so the index of a value is meaningful. As long as two student vectors use the same indexing then they can be directly compared.
  2. There are nil values in the vector for any classes that the student does not take.

That 3 in Potions is really bugging Ginny, so she is considering switching this subject for one where she might achieve a better score. There are four subjects which she is not currently studying: Arithmancy, Astronomy, Divination and Study of Ancient Runes. Can we advise her which subject she would be best to switch to?

Finding similar students

The first approach Ginny might attempt is to find another student with similar interests and see if they could recommend an alternative class to her. How do we find a similar student? Well, we can use the distance metrics discussed in our previous post to compare Ginny's vector to that of each of the other students, we can then identify the student who is closest to Ginny and take their recommendation.

To drive this process we will introduce a Recommender class which will be parameterized by the class_scores data structure and the distance_measure to use in calculations:

    
require_relative './metrics'

class Recommender
  attr_reader :class_scores, :distance_measure

  def initialize(class_scores = {}, distance_measure = Metrics.method(:euclidean_distance))
    @class_scores = class_scores
    @distance_measure = distance_measure
  end
    
  

The class_scores attribute is just the Hash of scores we introduced in the previous section. The distance_measure is a method (callable) that should take two equal-length vectors and return a distance between them. We can reuse many of the implementations discussed in our previous post, but the current class will default to using the euclidean_distance when no other measure is specified. I reproduce the implementation of this particular measure here, for convenience:

  
    
  def self.euclidean_distance(a,b)
    return unless a.any? && (a.size == b.size)    # We need two vectors of equal lenght to calculate this distance

    diff_squared = (0..a.size-1).reduce(0) do |sum, i|
      sum + (a[i] - b[i])**2
    end
    Math.sqrt(diff_squared)
  end
    
  

With the Recommender class set up in this way we can now look at how we might find the other students most similar to Ginny.

The following code snippet from the Recommender class shows how we can implement a top_k_matches method. The method takes a single student name (the target student) and the number, k, of matches we want to retrieve. It returns the k other students who are deemed to resemble our target student most closely, according to the distance metric being considered. The implementation is below:

    
  …

  def top_k_matches(person, k = class_scores.size)
    class_scores.keys.inject({}) do |scores, other_person|
      next scores if other_person == person
      scores[other_person] = similarity(person, other_person)
      scores
    end.compact.sort_by do |_key, value|
      -1*value
    end[0..k-1].to_h
  end

  private

  # The similarity score lies in the closed interval [0, 1].
  # A score of 1 indicates that the items are identical.
  # A score of 0 indicates that they are infinitely distant.
  def similarity(person, other_person)
    1.0/(1.0 + distance(person, other_person))
  end

  def distance(person, other)
    # Must only include class dimensions where both students have scores
    person_vec, other_vec = class_scores[person].values.each_with_index.map do |score, i|
      next unless score && class_scores[other].values[i]
      [score, class_scores[other].values[i]]
    end.compact.transpose

    # Return arbitrarily large value if no overlapping scores
    # Effectively mimicing a distance of inifinity.
    return 10_000 if [person_vec, other_vec].any?(&:empty?)

    distance_measure.call(person_vec, other_vec)
  end
end
    
  

The top_k_matches method, itself, is pretty straighforward. It loops through each of the students in class_scores and calculates the similarity between that student and the target student, taking care not to compare the target student with themselves. The resulting similarities are ordered and we return the top k most similar.

    
  def top_k_matches(person, k = class_scores.size)
    class_scores.keys.inject({}) do |scores, other_person|
      next scores if other_person == person    # Don't compare person to themselves
      scores[other_person] = similarity(person, other_person)
      scores
    end.compact.sort_by do |_key, value|
      -1*value    # Order by decreasing similarity score
    end[0..k-1].to_h    # Extract the top k, and output as a hash
  end
    
  

Perhaps more interesting are the manipulations we have to carry out to determine the similarity scores once we have the two student vectors. The distance measures introduced in the previous post generally require two equal-length, numeric arrays. In this respect, the presence of nil values in the arrays would make it impossible to calculate a distance.

What do the nil values represent again? A student vector will have a nil value when they do not study the class at that particular index. So if one student has a score for a particular subject and the other student does not study that subject, i.e. has a nil, then we cannot say much about the similarity, or difference, of the students based on that subject. As a result, let's omit the elements for each index where either, or both, of our vectors have a nil value. This is what we achieve in the first part of the distance method:

    
  def distance(person, other)
    # Must only include class dimensions where both students have scores
    person_vec, other_vec = class_scores[person].values.each_with_index.map do |score, i|
      next unless score && class_scores[other].values[i]
      [score, class_scores[other].values[i]]
    end.compact.transpose

    # Return arbitrarily large value if no overlapping scores
    # Effectively mimicing a distance of inifinity.
    return 10_000 if [person_vec, other_vec].any?(&:empty?)

    distance_measure.call(person_vec, other_vec)
  end
    
  

The input vectors, person and other, are transformed into shorter vectors, person_vec and other_vec, where we have sliced out any index for which either student has a nil entry. We can then pass these shorter arrays into our established distance functions to determine the distance between the two students in this new mutual subspace.

One other complication that we must consider is when the two students do not share any classes in common. In such cases both person_vec and other_vec will be empty, and we cannot determine the distance between two empty arrays. In this case, if the students share no common classes, we will assume that they are very dissimilar, so we will return an arbitrarily large distance in this case (to approximate a distance of infinity).

So far we have determined the distance between our two students, but what we actually want is a similarity score. To convert our distance into a similarity we can use the following:

    
  def similarity(person, other_person)
    1.0/(1.0 + distance(person, other_person))
  end
    
  

Hopefully you can see that if the two students are identical then the distance should evaluate to zero, in which case the similarity will equal 1. By contrast, when the distance evaluates to a very large value then the similarity will tend towards 0. In this way, the similarity measure will range between 0 and 1.

With the code described we can now calculate the top 3 most similar students to Ginny:

    
require_relative './hogwarts_class_scores'
require_relative './recommender'

recommender = Recommender.new(CLASS_SCORES)
puts recommender.top_k_matches("Ginny Weasley", 3) 
# {"Ron Weasley"=>0.2612038749637414, "Hermione Granger"=>0.25, "Cho Chang"=>0.18660549686337075}
    
  

OK cool! So it looks like brother Ron is the best match. So Ginny could ask Ron for his advice, and he will likely recommend that she study Divination. But would you trust Ron's advice??

Recommending classes

The approach in the previous section might work fairly well for Ginny, to help her choose which class to switch to. But because she is only asking one other student there is the possibility that the student she chooses (Ron, in this case) has very peculiar tastes, and their recommendation ends up not being a great fit.

What is more, Ron himself does not study Arithmancy or Ancient Runes, so he cannot make an informed recommendation on the things which he has not experienced himself.

To overcome these shortcomings we need a new approach. Rather than taking the recommendation from a single student, we could, instead, take recommendations from many (or all students) who have studied a subject and calculate an average score for that subject. But in this process it would also be sensible to apply a greater weight to the recommendations provided by other students who are more similar to Ginny, so we weight the contribution from each student by their similarity to Ginny. The ruby implementation for this calculation is shown below:

    
  def get_recommendations(person)
    class_scores[person].inject({}) do |memo, (class_name, score)|
      next memo if score # We only want to recommend classes for which we have no score registered
      total_similarity = 0
      memo[class_name] = class_scores.keys.inject(0) do |avg, other_person|
        next avg unless class_scores[other_person][class_name]   # We are only interested in students who have taken this class, and have a score
        similarity = similarity(person, other_person)    # The similarity between students will be the weighting factor
        total_similarity += similarity    # Keep a running total of similarity to divide by at the end
        avg += class_scores[other_person][class_name] * similarity   # Tally the weighted contributions
      end
      memo[class_name] = memo[class_name] / total_similarity  # Divide by the sum of the weights
      memo
    end
  end
    
  

We loop over the class scores for the target student, but we only concern ourselves with the classes for which the student has no score. For each of these classes we loop over all the other students who have registered a score in that class, and calculate a weighted average of their scores for the class, where the weighting factor is the similarity to the target student. That's it.

Let's see what recommendations we now generate for Ginny using this weighted average

    
require_relative './hogwarts_class_scores'
require_relative './recommender'
require_relative './metrics'

recommender = Recommender.new(CLASS_SCORES, Metrics.method(:euclidean_distance))
puts recommender.get_recommendations(student)
# {"Arithmancy"=>7.492996951956632, "Astronomy"=>7.566161526023427, "Divination"=>4.168756010451042, "Study of Ancient Runes"=>8.094049420209119}
    
  

Based on this model, Ginny would be best advised to swith to Study of Ancient Runes, as we estimate that she will achieve a score of about 8.09.

Perhaps Euclidean distance is not the best distance measure in this higher-dimensional space (see here, for example)? Not to fear, our algorithm should work equally well with other distance measures. Let's consider the same calculation based on the manhattan_distance:

    
recommender = Recommender.new(CLASS_SCORES, Metrics.method(:manhattan_distance))
puts recommender.get_recommendations(student)
# {"Arithmancy"=>7.653125000000001, "Astronomy"=>7.508417508417509, "Divination"=>4.235938089845225, "Study of Ancient Runes"=>8.184615384615384}
    
  

So we can see slightly different numbers in our predicted scores, but the same conclusion: the best option for Ginny would be to switch to Ancient Runes.

Conclusion

That was our simple recommendation system in ruby; these algorithms can be used to generate recommendations from any user-item dataset. In this case, our recommendation system is driven by the idea of similar users; and the technique is known as user-based filtering. Perhaps you can appreciate that we could easily invert our dataset and, instead, consider our items (in our case, the classes) to be represented by vectors of individual student scores. We could then proceed to look for similar classes rather than similar students, and use these similar classes to generate our recommendations. Such an approach is know as item-based filtering, and in some cases this approach will make more sense.

There are many shortcomings with the model we have presented. Perhaps most significantly, is the need to calculate the distance between our target user and every other user in our database. As the number of users increase this may become impractical. Also if we have a completely new user, with few or no scores, we can't make any meaningful recommendations.

Finally our dataset is relatively dense, in that each student studies a pretty large subset of available classes. In such dense datasets it is possible to calculate meaningful user distances. But imagine a context like movie recommendations, where each user has rated only a very small subset of the available items. In such cases our dataset is sparse, and the idea of distance between our user vectors becomes a little more suspect.

There are various other recommendation models which address these different problems, but further exploration of these algorithms will have to wait for a future post.

I hope you enjoyed this article and found it useful. f you would like to be kept up-to-date when we publish new articles please subscribe.

References

  1. Collective Intelligence by Toby Sagaran, available on Amazon
  2. Previous article on implementing distance measures in ruby
  3. GitHub repo with the algorithm implementation as discussed in this post
  4. Blog post ranking Hogwarts classes
  5. Blog post discussing when Euclidean distance is not a good fit

Comments

  • Submitted by Serguei Cambour
    8 months ago

    Thank you for sharing. Just a tiny note: you referenced the CLASS_PREFERENCES but in the provided code you rather used CLASS_SCORES one.

    • Submitted by Domhnall Murphy
      8 months ago

      Thanks for reading, and thanks for picking up on this mistake. That has now been corrected.

Got your own view or feedback? Share it with us below …

×

Subscribe

Join our mailing list to hear when new content is published to the VectorLogic blog.
We promise not to spam you, and you can unsubscribe at any time.