python pandas performance optimization
- Crude loop should never be used.
iterrowsis a bit faster.
apply()is even faster. It takes advantage of internal optimization such as iterators in Cython.
- Vectorization over Pandas series is even faster.Instead of passing individual scalar values into function, we can pass entire series.
- The fastest is vectorization with Numpy arrays. When possible, we should use numpy
ufuncfeature. Simply converting from the pandas representation to a numpy representation via the
Series.valuesfield yields an almost full order of magnitude performance improvement.
python pandas boolean evaluation
orperform a single boolean evaluation on an entire object, while
|perform multiple boolean evaluations on the content (the individual bits or bytes) of an object. For boolean numpy arrays, the latter is nearly always the desired operation.
x=np.arange(10) (x > 4) & (x < 8) # right (x > 4) and (x < 8) # error