Sum elements of an array from given labels, in python, without for loop

‘For loops’ are slow in python. That’s why it’s better to use numpy or scipy functions. Here a simple example on how to sum given element of b given the index a of c:

import numpy
import scipy.ndimage
a = numpy.asarray([0,0,0,4,4,4,3,3])
b = numpy.random.uniform(size=len(a))
c = numpy.zeros(5)
uni_a = numpy.unique(a)
c[uni_a] = scipy.ndimage.measurements.sum(b,labels=a, index=uni_a)

Now some benchmarks:

%pylab inline
import scipy.ndimage
import timeit

Initialization of the arrays:

size = 100
a = numpy.random.randint(0,10, size=size)
b = numpy.random.uniform(size=len(a))
c = numpy.zeros(max(a)+1)

First the code with ‘for loop’:

for k,i in enumerate(a):
    c[i] += b[k]

And now the code without ‘for loop’:

uni_a = numpy.unique(a)
c2 = numpy.zeros(max(a)+1)
c2[uni_a] = scipy.ndimage.measurements.sum(b,labels=a, index=uni_a)

And the result is the same:

(c == c2).all()

out: True

Now we’ll benchmark the execution time, with the loop for 10 to 100 iterations in the loop:

def time_loop(size=100):
    elapsed = []
    for l in range(100):
        a = numpy.random.randint(0,10, size=size)
        b = numpy.random.uniform(size=len(a))
        c = numpy.zeros(max(a)+1)
        start_time = timeit.default_timer()
        for k,i in enumerate(a):
            c[i] += b[k]
        elapsed.append(timeit.default_timer() - start_time)
    return numpy.mean(elapsed[:3])

elapsed = []
for s in range(10,100):

And without the loop:

def time_noloop(size=100):
    elapsed = []
    for l in range(100):
        a = numpy.random.randint(0,10, size=size)
        b = numpy.random.uniform(size=len(a))
        c = numpy.zeros(max(a)+1)
        start_time = timeit.default_timer()
        uni_a = numpy.unique(a)
        c[uni_a] = scipy.ndimage.measurements.sum(b,labels=a, index=uni_a)
        elapsed.append(timeit.default_timer() - start_time)
    return numpy.mean(elapsed[:3])

elapsed_noloop = []
for s in range(10,100):
plot(range(10,100),elapsed,label='for loop')
plot(range(10,100),elapsed_noloop, label='no for loop')
xlabel('number of loop')
ylabel('execution time (s)')

And the result shows an obvious advantage to use the scipy function when the number of iterations is more than 50:


If you want to ask me a question or leave me a message add @bougui505 in your comment.