Many social networks, including Facebook and LinkedIn, have suggest-a-friend features that prompt users with a list of likely friends. This is a great feature because it allows these sites to notify current users when one of their (likely) friends joins the site.

It's also good for the social networks because it allows them to improve their copy of the social graph. But how do they do it?

The Problem

Let's state the problem precisely, first. You have a graph, G, which is a representation of relationships that exist in the real world.

Recall that a graph is a collection of nodes and edges between those nodes. For social networks these nodes represent people and the edges represent relationships, e.g., "friends with," "married to," etc.

In Facebook's case, people are friends independent of their status on Facebook. This means that in Facebook's copy of the social graph is inherently inaccurate. If Tim and Jane are friends in reality but not on Facebook that means Facebook's copy of the social graph is missing an edge between the nodes representing Tim and Jane

So, predicting friends on a social network is the same as predicting likely but missing edges on the abstract graph representing that social network.

The Solution

The solution should come in the form an ordered list of likely edges, with the most likely edge at top. For Facebook this means that for each person you should get a list of people most likely to be their friends.

Mathematically this means that for each pair of nodes, X and Y, you should have a function f(X,Y) that gives you the likelihood that there's a missing edge between those two nodes.

Here are give ways of constructing such a function.

Notation

Warning: there's some math ahead. I'm going to designate users by capital letters A, B, C, etc. The set of all people who are friends with a person will be designated by F(A), F(B), F(C), etc. That means, for example, |F(A)| is the number of friends of A.

Common Neighbors / Mutual Friends

The common neighbors, or mutual friends, metric is almost certainly the way Facebook produces their suggest-a-friend list since Facebook already calculates the number of mutual friends and displays it right on people's profiles.

Let's say we have three Facebook users, Alex, Kim, and James, none of whom are friends on the site. If Alex and Kim have 50 friends in common on Facebook but Alex and James have only 10 then this metric says that Alex and Kim are more likely to be friends than Alex and James.

Mathematically the common neighbor metric can be expressed as

The downside to this metric is that it favors people with a lot of friends. For example, I've never once met Robert Scoble, but I work in the tech world and he befriends everyone he comes in contact with. It's likely we'll have lots of mutual friends just by virtue of the fact that he has lots of friends in general.

Shortest Path

There is no one "shortest path" metric. Rather, it's a family of metrics.

The idea is to use degrees of separation to measure affinity. People who are fewer degrees of separation apart are more likely to be friend. There are lots of ways to parlay this into a metric.

One way would be to measure the number of "short" paths between two users, A and B, where "short" can be defined however we like. The common neighbors metric is a special case of this, measuring the number of length-one paths between two users. But we could just as easily measure paths with length no greater than two, for example.

Another metric is to measure average path lengths. Take two users, A and B, and take all the shortest paths from A to B. The larger the average shortest path length the less likely the two people are friends.

Degree Product

The degree product is the most simple and also the most naive metric. The idea behind it is that if a person has a lot of friends in general they're more likely to be friends with any given person.

The degree product of two people, A and B, is the product of the number of their friends. If A has 10 friends and B has 20 then their degree product is 200.

Mathematically it can be expressed as

The downside to this is that there's no locality. Robert Scoble is just as likely to be friends with me as he is with any other person having the same number of friends that I do.

On the other hand, because there's no locality, it's easy to compute and doesn't require understanding much about the graph itself.

Jaccard Coefficient

The Jaccard coefficient or Jaccard index is similar to the common neighbors metric except it takes into account the fact that the latter is more likely to suggest people with lots of friends in general.

It does this by normalizing the common neighbors metric by the number of friends two people have altogether. The larger the coefficient the more likely two people are to be friends.

Mathematically the Jaccard coefficient can be expressed as

As a concrete example consider our three Facebook users, Alex, Kim, and James, again. Let's say Alex has 200 friends, Kim has 1,000, and James has 100. Furthermore, let's say that Kim has 70 friends in common with Alex and James has only 30.

Friends w/ Alex CN score Jaccard Coeff.
Kim 1000 70 70 0.06
James 100 30 30 0.11

As you can see, even though Alex and Kim have more friends in common, the Jacard coefficient between Alex and James is larger.

Other Methods

Predicting missing links in networks goes beyond the suggest-a-friend feature. It can be used to predict connections in terrorist networks, likely predator-prey relationships in an ecosystem, and even the relationships between metabolic pathways in a complex organism.

For example, team at the Santa Fe institute recently published a paper in nature called "Hierarchical structure and the prediction of missing links in networks." You can read about it and download the associated code at Aaron Clauset's homepage.

Their method has the downsize of being very computationally intensive but is also much more accurate at predicting missing links than the other methods I've outlined above. Explaining it would be an article in itself, so I'll just leave the links up for now.

Drop me a line if you're interested in such an article or if you want me to forward you the original Nature article.

6 Comments

  1. Matt Rubens May 27th, 2008 / 2:24 pm

    Here’s the problem: this algorithm only predicts if someone is in your social graph — it doesn’t predict whether you like them or not. In fact, I would argue that past a certain point, having a lot of mutual friends negatively correlates with the probability that I want to friend someone.

    I’ve joked that Facebook should rename their feature “Ex-girlfriends and people you hate”…

  2. Jesse May 27th, 2008 / 2:26 pm

    This is definitely true, although sometime in the last few weeks Facebook added a feedback feature so you can banish people from your suggest-a-friend list.

    I’d say I banish about 90% of the people who appear.

  3. SG May 27th, 2008 / 2:33 pm

    Networks like Facebook have more data than just the ‘friend of’ edges. They could use the interests, favorite movie/books, groups etc to better predict if two people not yet directly friends have enough in common to ‘like’ each other.

  4. Matt Rubens May 27th, 2008 / 2:34 pm

    Yeah, me too. And there’s something cathartic about the banishing…

    It’s interesting to think about how you could improve the algorithms to take like/dislike into account though. For example, if both the other person and I have been on the site for a while and have large/stable friend graphs, we probably don’t like each other. But if either of us are new to the site, it’s probably worth making the suggestion.

  5. Jesse May 27th, 2008 / 2:41 pm

    SG,

    I don’t see how interests listed on Facebook have any bearing with whether or not I’m friends with them.

    Besides, improvements in the algorithm would come from less locality, not more. As in the HRG approach you’d want to start learning about relationships between cohorts and components, not between individual people.

    Facebook’s feature is “good enough,” anyhow. The main goal, IMO, isn’t to improve their copy of the social graph. Rather, it’s to (1) improve the experience for people just joining the site as they add their first seed friends and (2) reengage inactive users by emailing people when potential friends join the site.

    Also, remember, more data usually beats better algorithms.

  6. name August 21st, 2008 / 7:09 pm

    4B64Gb Hello!,

Leave a Reply