I was writing some code for Adonomics when I came across the need to select a random, weighted element from an array. I looked around the web but all the solutions I found were cumbersome or involved creating an auxiliary array whose size was proportional to the sum of the weights. Unacceptable!

The Math

So, of course, I rolled my own. Here's the idea:

Given an array of values result.png and and array of weights result-1.png such that result-2.png write a function which takes as its input W and produces the ith key with probability wi.

That is, define a function f such that result-3.png

I'm going to call our little function w_rand, which looks like PHPese to me.

As an example we might have result-4.png and result-5.png

w_rand(W) should return 0 (the first index) with a 15% probability, 1 with a 35% probability, 2 with a 20% probability, etc.

The Code

Let's get down to bussiness!

function w_rand($weights) {
    $r = mt_rand(1,1000);
    $offset = 0;
    foreach ($weights as $k => $w) {
        $offset += $w*1000;
        if ($r <= $offset) {
            return $k;
        }
    }
}

I like this function because it only makes one random call and doesn't require any additional memory beyond what the array of weights takes.

The Details

How does this function work? Many other such functions work by creating an auxiliary array and filling it with the same elements in proportion to their probability.

For example, if you had an array A = [a, b] and weights W = [0.30, 0.70], it would create an auxiliary array [a, a, a, b, b, b, b, b, b, b]. You then select an element randomly from this array, and that gets you what you're looking for.

The downside to this is that this requires both more processing power. You also shouldn't need to care about the values — what if the values are huge strings?

Instead, imagine a number line from 1 to 100. We pick a random number in that range. Using the example W = [0.30, 0.70], we'd want to return 0 if our random number is between 1-30 and 1 if our random number is between 31-100.

See? No extra memory required. The above case, with just two weights, is easy enough to do with a single if-else statement, but what about the case where you have three, four, or more weights?

Take W = [0.30, 0.20, 0.50]. We pick a random integer from 1-100. If the number is between 1-30 we want to return 0, if it between 31-50 then return 1, and if its between 51-100 return 2.

It's easy to make this into a loop: keep a running total T of the weights so far and return the first key such that N ≤ T.

Why does this work? Slice up the number line, 1-100, according to the weights. If our random number N is in the first slice return that key. If our random number is in the first or second slice (i.e., is less than the sum of the first and second weights) then it must be in the second slice since we already know it's not in the first. Continue this way until you reach the end.

Ta da! I hope someone finds this useful.

Download

Download the w_rand test script.

7 Comments

  1. Chris February 7th, 2008 / 4:26 pm

    I’m glad you’re still blogging. It’s been a while since your previous post. I’ve enjoyed reading your posts on programming (especially the one on MySQL improvements). Thanks for writing these useful posts.

  2. Jesse February 7th, 2008 / 4:33 pm

    Chris,

    Yeah, I’m trying to get back into it. Ever since I sold Adonomics I’ve been working more than full-time on it. Between that and a desire for an active social life I don’t have much time for blogging.

    I’m going to try to write one article at least once every other week, though.

  3. Mike February 7th, 2008 / 6:13 pm

    Great post. Now I’m trying to remember what I learned in my courses on probability in college… seems like there is a probability distribution for this, or something like that… what’s the difference between a PDF and a CDF again? In any case, I think your method is essentially the same thing a stats class would have you do. :P

  4. Jesse February 7th, 2008 / 6:18 pm

    Mike,

    X = f(W) is the random variable in question, if you want to make it even more formal. The specific properties of X, e.g., its CDF, are determined entirely by W.

  5. Random Weighted Elements in PHP | David Bisset: Web Designer, Coder, Wordpress Guru February 7th, 2008 / 7:53 pm

    […] horrible in math. So I hope I never have to figure out how to select a random, weighted element from an array in PHP. But if I do, I now have something to refer to! Thanks Jesse! Tags: […]

  6. Jeremy April 26th, 2008 / 4:59 pm

    Brilliant, and superbly clean code and explanation!

  7. Venkat Koduru May 11th, 2008 / 6:18 am

    Hmmm. I like your function, but here’s what I’m looking to do…

    I want to generate a random number from 0 to 6000. I want their to be the highest probability (40% chance) of getting between 100 and 2000, inclusive. I want the probability of getting a number from 99 to 0, to gradually decrease, so that it would be almost impossible to get 0. Just to be clear the probability of getting 98 should be less than that of 99, the probability of getting 97 should be less than the probability of getting 98, etc. Similarly, I want the probability of getting a number from 2001 to 6000 (again including 2001 and 6000) to gradually decrease, so that it’s almost impossible to get 6000.

    Anyone know how I could do this?

Leave a Reply