February 7th, 2008

Random Weighted Elements in PHP

I was writing some code for Adonomics when I came across the need to select a random, weighted element from an array. I looked around the web but all the solutions I found were cumbersome or involved creating an auxiliary array whose size was proportional to the sum of the weights. Unacceptable!

The Math

So, of course, I rolled my own. Here’s the idea:

Given an array of values result.png and and array of weights result-1.png such that result-2.png write a function which takes as its input W and produces the ith key with probability wi.

That is, define a function f such that result-3.png

I’m going to call our little function w_rand, which looks like PHPese to me.

As an example we might have result-4.png and result-5.png

w_rand(W) should return 0 (the first index) with a 15% probability, 1 with a 35% probability, 2 with a 20% probability, etc.

The Code

Let’s get down to bussiness!

function w_rand($weights) {
    $r = mt_rand(1,1000);
    $offset = 0;
    foreach ($weights as $k => $w) {
        $offset += $w*1000;
        if ($r <= $offset) {
            return $k;
        }
    }
}

I like this function because it only makes one random call and doesn’t require any additional memory beyond what the array of weights takes.

The Details

How does this function work? Many other such functions work by creating an auxiliary array and filling it with the same elements in proportion to their probability.

For example, if you had an array A = [a, b] and weights W = [0.30, 0.70], it would create an auxiliary array [a, a, a, b, b, b, b, b, b, b]. You then select an element randomly from this array, and that gets you what you’re looking for.

The downside to this is that this requires both more processing power. You also shouldn’t need to care about the values — what if the values are huge strings?

Instead, imagine a number line from 1 to 100. We pick a random number in that range. Using the example W = [0.30, 0.70], we’d want to return 0 if our random number is between 1-30 and 1 if our random number is between 31-100.

See? No extra memory required. The above case, with just two weights, is easy enough to do with a single if-else statement, but what about the case where you have three, four, or more weights?

Take W = [0.30, 0.20, 0.50]. We pick a random integer from 1-100. If the number is between 1-30 we want to return 0, if it between 31-50 then return 1, and if its between 51-100 return 2.

It’s easy to make this into a loop: keep a running total T of the weights so far and return the first key such that N ≤ T.

Why does this work? Slice up the number line, 1-100, according to the weights. If our random number N is in the first slice return that key. If our random number is in the first or second slice (i.e., is less than the sum of the first and second weights) then it must be in the second slice since we already know it’s not in the first. Continue this way until you reach the end.

Ta da! I hope someone finds this useful.

Download

Download the w_rand test script.
  • I'm glad you're still blogging. It's been a while since your previous post. I've enjoyed reading your posts on programming (especially the one on MySQL improvements). Thanks for writing these useful posts.
  • Chris,

    Yeah, I'm trying to get back into it. Ever since I sold Adonomics I've been working more than full-time on it. Between that and a desire for an active social life I don't have much time for blogging.

    I'm going to try to write one article at least once every other week, though.
  • Great post. Now I'm trying to remember what I learned in my courses on probability in college... seems like there is a probability distribution for this, or something like that... what's the difference between a PDF and a CDF again? In any case, I think your method is essentially the same thing a stats class would have you do. :P
  • Mike,

    X = f(W) is the random variable in question, if you want to make it even more formal. The specific properties of X, e.g., its CDF, are determined entirely by W.
  • Jeremy
    Brilliant, and superbly clean code and explanation!
  • Hmmm. I like your function, but here's what I'm looking to do...

    I want to generate a random number from 0 to 6000. I want their to be the highest probability (40% chance) of getting between 100 and 2000, inclusive. I want the probability of getting a number from 99 to 0, to gradually decrease, so that it would be almost impossible to get 0. Just to be clear the probability of getting 98 should be less than that of 99, the probability of getting 97 should be less than the probability of getting 98, etc. Similarly, I want the probability of getting a number from 2001 to 6000 (again including 2001 and 6000) to gradually decrease, so that it's almost impossible to get 6000.

    Anyone know how I could do this?
  • Kareem
    Thank you very much for this awesome trick, i was working on a small application, and i was using the array method, but i didn't like it, as you said it takes lots of resources if there was a big set of data, which is the case for me.
    Wish you the best.
    Regards.
  • pj
    I'm going to implement this for my python program, thanks!
  • I've been looking for something which does exactly this for a very long time now, and this is by far more elegant than anything I've been able to make so far. Thanks!
  • This is great - a nice, tight function that does exactly what I was looking for - with one possible exception. Does this work well if you are drawing from an array with, say, 200 elements? Does adjusting the weight to the thousandths instead of the hundredths result with any degree of accuracy?
  • Thank you! Exactly what I was looking for.
  • blueskiwi
    awesome, you rock, this is just what I needed, thanks!!
  • blueskiwi
    for anyone wondering... you need to adjust the random range and the 1000 multiplier according to your weights

    ie the min and max rand value should (I think) correspond to the smallest possible weight value and the sum of all your weight values respectively (?)

    and if you use a multiplier that would be...

    min rand value = smallest possible weight * multiplier
    max rand value = sum of all weight values in set * multiplier
  • niabi
    This is great, looking for something similar, the only thing that can't quite work on my side, is what if i have about 1,000 elements and I want certain ones to have more weight than others?

    I would have to write
    A = [1,2-999,1000]

    but what about the weight? it would be kind of hard I guess to divide 100 / 1,000 and just give weight to certain items...

    any solution for this kind of problem?

    something like: in my sql i have 1,000 items, in the table i have 3 rows, ID, NAME, WEIGHT then the items that have a higher number (1,2,3,4 ...) have more chance of showing up, 2 has double the chance of 1 and 3 double the chance of 2 but triple the chance of 1 and so on...

    any idea?
  • Richard Lynch
    As far as I can tell, the choice of 1000 within the function is fairly arbitrary.

    I replaced it with 10, 100, 10000, and 42 and got seemingly equally random results (just eyeballing them).

    I was careful, however, to provide an array with weights between 0 and 1, and whose sum was 1.
    I.e., $ws = array(array(0.1,1),array(0.2,2),array(0.3,3),array(0.4,4));

    But I think you'll be in trouble no matter what if your array input does not give a full 100% values.

    I'd have to test more to be sure, but it's easier to just validate the input. :-)
blog comments powered by Disqus