Sum elements of an array from given labels, in python, without for loop
‘For loops’ are slow in python.
That’s why it’s better to use numpy or scipy functions.
Here a simple example on how to sum given element of b given the index a of c:
import numpy
import scipy.ndimage
a = numpy.asarray([0,0,0,4,4,4,3,3])
b = numpy.random.uniform(size=len(a))
c = numpy.zeros(5)
uni_a = numpy.unique(a)
c[uni_a] = scipy.ndimage.measurements.sum(b,labels=a, index=uni_a)Now some benchmarks:
%pylab inline
import scipy.ndimage
import timeitInitialization of the arrays:
size = 100
a = numpy.random.randint(0,10, size=size)
b = numpy.random.uniform(size=len(a))
c = numpy.zeros(max(a)+1)First the code with ‘for loop’:
for k,i in enumerate(a):
c[i] += b[k]And now the code without ‘for loop’:
uni_a = numpy.unique(a)
c2 = numpy.zeros(max(a)+1)
c2[uni_a] = scipy.ndimage.measurements.sum(b,labels=a, index=uni_a)And the result is the same:
(c == c2).all()out: True
Now we’ll benchmark the execution time, with the loop for 10 to 100 iterations in the loop:
def time_loop(size=100):
elapsed = []
for l in range(100):
a = numpy.random.randint(0,10, size=size)
b = numpy.random.uniform(size=len(a))
c = numpy.zeros(max(a)+1)
start_time = timeit.default_timer()
for k,i in enumerate(a):
c[i] += b[k]
elapsed.append(timeit.default_timer() - start_time)
elapsed.sort()
return numpy.mean(elapsed[:3])
elapsed = []
for s in range(10,100):
elapsed.append(time_loop(s))And without the loop:
def time_noloop(size=100):
elapsed = []
for l in range(100):
a = numpy.random.randint(0,10, size=size)
b = numpy.random.uniform(size=len(a))
c = numpy.zeros(max(a)+1)
start_time = timeit.default_timer()
uni_a = numpy.unique(a)
c[uni_a] = scipy.ndimage.measurements.sum(b,labels=a, index=uni_a)
elapsed.append(timeit.default_timer() - start_time)
elapsed.sort()
return numpy.mean(elapsed[:3])
elapsed_noloop = []
for s in range(10,100):
elapsed_noloop.append(time_noloop(s))plot(range(10,100),elapsed,label='for loop')
plot(range(10,100),elapsed_noloop, label='no for loop')
grid()
legend()
xlabel('number of loop')
ylabel('execution time (s)')And the result shows an obvious advantage to use the scipy function when the number of iterations is more than 50:
