python pandas performance optimization
- Crude loop should never be used.
iterrows
is a bit faster.apply()
is even faster. It takes advantage of internal optimization such as iterators in Cython.- Vectorization over Pandas series is even faster.Instead of passing individual scalar values into function, we can pass entire series.
- The fastest is vectorization with Numpy arrays. When possible, we should use numpy
ufunc
feature. Simply converting from the pandas representation to a numpy representation via theSeries.values
field yields an almost full order of magnitude performance improvement.
python pandas boolean evaluation
and
andor
perform a single boolean evaluation on an entire object, while&
and|
perform multiple boolean evaluations on the content (the individual bits or bytes) of an object. For boolean numpy arrays, the latter is nearly always the desired operation.x=np.arange(10) (x > 4) & (x < 8) # right (x > 4) and (x < 8) # error
using numba in python
- Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions.
- Numba tutorial 1
- Numba tutorial 2