Successfully recommending relevant content to web users is critical to business success for may web properties. There are a variety of techniques used, with the largest sites and companies employing some highly sophisticated technology to optimize their recommendations. In this blog post we discuss how to build an effective recommendation system from first principles in ruby.

## Introduction

We adapt these algorithms, formulate them in ruby and apply them to a whimsical dataset containing class scores for different students in Hogwart's School for Witchcraft and Wizardry. We use the data to generate predicted scores for the different classes which the student has not taken. Whilst the dataset is whimsical, the algorithms presented are sound, and could be legitimately applied to any real dataset.

Let's get started!

### Introducing the Hogwarts data

Our dataset is defined inside a static ruby hash named CLASS_SCORES. The full data can be seen on GitHub, but we reproduce a small extract below to illustrate the format:

    
CLASS_SCORES = {
"Harry Potter" => {
"Alchemy" => nil,
"Arithmancy" => nil,
"Astronomy" => 6,
"Care of Magical Creatures" => 8,
"Charms" => 5,
"Defence Against the Dark Arts" => 10,
"Divination" => 4,
"Herbology" => 7,
"Muggle Studies" => nil,
"Potions" => 2,
"Study of Ancient Runes" => nil,
"Transfiguration" => 7,
},
"Hermione Granger" => {
"Alchemy" => 9,
"Arithmancy" => 10,
"Astronomy" => 9,
"Care of Magical Creatures" => 9,
"Charms" => 10,
"Defence Against the Dark Arts" => 9,
"Divination" => 1,
"Herbology" => 10,
"Muggle Studies" => 10,
"Potions" => 3,
"Study of Ancient Runes" => 9,
"Transfiguration" => 10,
}, …



The algorithm will use this ruby hash, CLASS_SCORES, but it will be useful to consider the representation of the data in tabular form:

In this format each row of the table represents the scores in a particular class and each column represents the scores for a single student. So we can consider a single student to be represented by a vector (or array) of scores, e.g. Ginny Weasley can be represented by the vector [8, nil, nil, 8, 9, 9, nil, 9, 9, 3, nil, 8]. There are two points worth mentioning here:

1. The order is obviously important. The 3 in Ginny's array is at index 9 because it relates to her score in Potions class, so the index of a value is meaningful. As long as two student vectors use the same indexing then they can be directly compared.
2. There are nil values in the vector for any classes that the student does not take.

That 3 in Potions is really bugging Ginny, so she is considering switching this subject for one where she might achieve a better score. There are four subjects which she is not currently studying: Arithmancy, Astronomy, Divination and Study of Ancient Runes. Can we advise her which subject she would be best to switch to?

### Finding similar students

The first approach Ginny might attempt is to find another student with similar interests and see if they could recommend an alternative class to her. How do we find a similar student? Well, we can use the distance metrics discussed in our previous post to compare Ginny's vector to that of each of the other students, we can then identify the student who is closest to Ginny and take their recommendation.

To drive this process we will introduce a Recommender class which will be parameterized by the class_scores data structure and the distance_measure to use in calculations:

    
require_relative './metrics'

class Recommender

def initialize(class_scores = {}, distance_measure = Metrics.method(:euclidean_distance))
@class_scores = class_scores
@distance_measure = distance_measure
end



The class_scores attribute is just the Hash of scores we introduced in the previous section. The distance_measure is a method (callable) that should take two equal-length vectors and return a distance between them. We can reuse many of the implementations discussed in our previous post, but the current class will default to using the euclidean_distance when no other measure is specified. I reproduce the implementation of this particular measure here, for convenience:



def self.euclidean_distance(a,b)
return unless a.any? && (a.size == b.size)    # We need two vectors of equal lenght to calculate this distance

diff_squared = (0..a.size-1).reduce(0) do |sum, i|
sum + (a[i] - b[i])**2
end
Math.sqrt(diff_squared)
end



With the Recommender class set up in this way we can now look at how we might find the other students most similar to Ginny.

The following code snippet from the Recommender class shows how we can implement a top_k_matches method. The method takes a single student name (the target student) and the number, k, of matches we want to retrieve. It returns the k other students who are deemed to resemble our target student most closely, according to the distance metric being considered. The implementation is below:

    
…

def top_k_matches(person, k = class_scores.size)
class_scores.keys.inject({}) do |scores, other_person|
next scores if other_person == person
scores[other_person] = similarity(person, other_person)
scores
end.compact.sort_by do |_key, value|
-1*value
end[0..k-1].to_h
end

private

# The similarity score lies in the closed interval [0, 1].
# A score of 1 indicates that the items are identical.
# A score of 0 indicates that they are infinitely distant.
def similarity(person, other_person)
1.0/(1.0 + distance(person, other_person))
end

def distance(person, other)
# Must only include class dimensions where both students have scores
person_vec, other_vec = class_scores[person].values.each_with_index.map do |score, i|
next unless score && class_scores[other].values[i]
[score, class_scores[other].values[i]]
end.compact.transpose

# Return arbitrarily large value if no overlapping scores
# Effectively mimicing a distance of inifinity.
return 10_000 if [person_vec, other_vec].any?(&:empty?)

distance_measure.call(person_vec, other_vec)
end
end



The top_k_matches method, itself, is pretty straighforward. It loops through each of the students in class_scores and calculates the similarity between that student and the target student, taking care not to compare the target student with themselves. The resulting similarities are ordered and we return the top k most similar.

    
def top_k_matches(person, k = class_scores.size)
class_scores.keys.inject({}) do |scores, other_person|
next scores if other_person == person    # Don't compare person to themselves
scores[other_person] = similarity(person, other_person)
scores
end.compact.sort_by do |_key, value|
-1*value    # Order by decreasing similarity score
end[0..k-1].to_h    # Extract the top k, and output as a hash
end



Perhaps more interesting are the manipulations we have to carry out to determine the similarity scores once we have the two student vectors. The distance measures introduced in the previous post generally require two equal-length, numeric arrays. In this respect, the presence of nil values in the arrays would make it impossible to calculate a distance.

What do the nil values represent again? A student vector will have a nil value when they do not study the class at that particular index. So if one student has a score for a particular subject and the other student does not study that subject, i.e. has a nil, then we cannot say much about the similarity, or difference, of the students based on that subject. As a result, let's omit the elements for each index where either, or both, of our vectors have a nil value. This is what we achieve in the first part of the distance method:

    
def distance(person, other)
# Must only include class dimensions where both students have scores
person_vec, other_vec = class_scores[person].values.each_with_index.map do |score, i|
next unless score && class_scores[other].values[i]
[score, class_scores[other].values[i]]
end.compact.transpose

# Return arbitrarily large value if no overlapping scores
# Effectively mimicing a distance of inifinity.
return 10_000 if [person_vec, other_vec].any?(&:empty?)

distance_measure.call(person_vec, other_vec)
end



The input vectors, person and other, are transformed into shorter vectors, person_vec and other_vec, where we have sliced out any index for which either student has a nil entry. We can then pass these shorter arrays into our established distance functions to determine the distance between the two students in this new mutual subspace.

One other complication that we must consider is when the two students do not share any classes in common. In such cases both person_vec and other_vec will be empty, and we cannot determine the distance between two empty arrays. In this case, if the students share no common classes, we will assume that they are very dissimilar, so we will return an arbitrarily large distance in this case (to approximate a distance of infinity).

So far we have determined the distance between our two students, but what we actually want is a similarity score. To convert our distance into a similarity we can use the following:

    
def similarity(person, other_person)
1.0/(1.0 + distance(person, other_person))
end



Hopefully you can see that if the two students are identical then the distance should evaluate to zero, in which case the similarity will equal 1. By contrast, when the distance evaluates to a very large value then the similarity will tend towards 0. In this way, the similarity measure will range between 0 and 1.

With the code described we can now calculate the top 3 most similar students to Ginny:

    
require_relative './hogwarts_class_scores'
require_relative './recommender'

recommender = Recommender.new(CLASS_SCORES)
puts recommender.top_k_matches("Ginny Weasley", 3)
# {"Ron Weasley"=>0.2612038749637414, "Hermione Granger"=>0.25, "Cho Chang"=>0.18660549686337075}



OK cool! So it looks like brother Ron is the best match. So Ginny could ask Ron for his advice, and he will likely recommend that she study Divination. But would you trust Ron's advice??

### Recommending classes

The approach in the previous section might work fairly well for Ginny, to help her choose which class to switch to. But because she is only asking one other student there is the possibility that the student she chooses (Ron, in this case) has very peculiar tastes, and their recommendation ends up not being a great fit.

What is more, Ron himself does not study Arithmancy or Ancient Runes, so he cannot make an informed recommendation on the things which he has not experienced himself.

To overcome these shortcomings we need a new approach. Rather than taking the recommendation from a single student, we could, instead, take recommendations from many (or all students) who have studied a subject and calculate an average score for that subject. But in this process it would also be sensible to apply a greater weight to the recommendations provided by other students who are more similar to Ginny, so we weight the contribution from each student by their similarity to Ginny. The ruby implementation for this calculation is shown below:

    
def get_recommendations(person)
class_scores[person].inject({}) do |memo, (class_name, score)|
next memo if score # We only want to recommend classes for which we have no score registered
total_similarity = 0
memo[class_name] = class_scores.keys.inject(0) do |avg, other_person|
next avg unless class_scores[other_person][class_name]   # We are only interested in students who have taken this class, and have a score
similarity = similarity(person, other_person)    # The similarity between students will be the weighting factor
total_similarity += similarity    # Keep a running total of similarity to divide by at the end
avg += class_scores[other_person][class_name] * similarity   # Tally the weighted contributions
end
memo[class_name] = memo[class_name] / total_similarity  # Divide by the sum of the weights
memo
end
end



We loop over the class scores for the target student, but we only concern ourselves with the classes for which the student has no score. For each of these classes we loop over all the other students who have registered a score in that class, and calculate a weighted average of their scores for the class, where the weighting factor is the similarity to the target student. That's it.

Let's see what recommendations we now generate for Ginny using this weighted average

    
require_relative './hogwarts_class_scores'
require_relative './recommender'
require_relative './metrics'

recommender = Recommender.new(CLASS_SCORES, Metrics.method(:euclidean_distance))
puts recommender.get_recommendations(student)
# {"Arithmancy"=>7.492996951956632, "Astronomy"=>7.566161526023427, "Divination"=>4.168756010451042, "Study of Ancient Runes"=>8.094049420209119}



Based on this model, Ginny would be best advised to swith to Study of Ancient Runes, as we estimate that she will achieve a score of about 8.09.

Perhaps Euclidean distance is not the best distance measure in this higher-dimensional space (see here, for example)? Not to fear, our algorithm should work equally well with other distance measures. Let's consider the same calculation based on the manhattan_distance:

    
recommender = Recommender.new(CLASS_SCORES, Metrics.method(:manhattan_distance))
puts recommender.get_recommendations(student)
# {"Arithmancy"=>7.653125000000001, "Astronomy"=>7.508417508417509, "Divination"=>4.235938089845225, "Study of Ancient Runes"=>8.184615384615384}



So we can see slightly different numbers in our predicted scores, but the same conclusion: the best option for Ginny would be to switch to Ancient Runes.

## Conclusion

That was our simple recommendation system in ruby; these algorithms can be used to generate recommendations from any user-item dataset. In this case, our recommendation system is driven by the idea of similar users; and the technique is known as user-based filtering. Perhaps you can appreciate that we could easily invert our dataset and, instead, consider our items (in our case, the classes) to be represented by vectors of individual student scores. We could then proceed to look for similar classes rather than similar students, and use these similar classes to generate our recommendations. Such an approach is know as item-based filtering, and in some cases this approach will make more sense.

There are many shortcomings with the model we have presented. Perhaps most significantly, is the need to calculate the distance between our target user and every other user in our database. As the number of users increase this may become impractical. Also if we have a completely new user, with few or no scores, we can't make any meaningful recommendations.

Finally our dataset is relatively dense, in that each student studies a pretty large subset of available classes. In such dense datasets it is possible to calculate meaningful user distances. But imagine a context like movie recommendations, where each user has rated only a very small subset of the available items. In such cases our dataset is sparse, and the idea of distance between our user vectors becomes a little more suspect.

There are various other recommendation models which address these different problems, but further exploration of these algorithms will have to wait for a future post.

I hope you enjoyed this article and found it useful. f you would like to be kept up-to-date when we publish new articles please subscribe.

## References

1. Collective Intelligence by Toby Sagaran, available on Amazon
2. Previous article on implementing distance measures in ruby
3. GitHub repo with the algorithm implementation as discussed in this post
4. Blog post ranking Hogwarts classes
5. Blog post discussing when Euclidean distance is not a good fit

• Submitted by Serguei Cambour

Thank you for sharing. Just a tiny note: you referenced the CLASS_PREFERENCES but in the provided code you rather used CLASS_SCORES one.

• Submitted by Domhnall Murphy