pandas is a Python library for data manipulation and analysis. It’s one of the standard tools in data science and engineering, and can enable a wide range of work.

Quick links:

Dataframes

The core data structure that pandas provides is the dataframe. Some core dataframe methods:

  • We can load in a new dataframe from:
    • A CSV file with: pd.read_csv()
      • Note that large (on the order of 10k+) pandas dataframes can take up a huge amount of memory. By using the chunksize parameter, we can essentially break up the loading into different chunks, aiding performance on operations. Note that this means any work cannot be done on a single contiguous dataframe.1
    • A NumPy array with: pd.DataFrame(arr)
  • df.iloc[row, col] to index a dataframe. Raw array-style indexing is not performant.

For row/column manipulations. Specify axis=0 for rows, axis=1 for columns.

  • df.drop() to remove a row/column.

Parallelisation

pandas operations are singlethreaded by default, with no built-in mechanism to easily parallelise operations. Without changing our code, we can use cuDF (requires Python >=3.9), a CUDA-compatible package. Note that this is only valid for Linux systems (or WSL).

In our import statements:

try:
    %load_ext cudf.pandas  
except ModuleNotFoundError:  
    print('CuDF not installed, defaulting to regular pandas')
import pandas as pd

And all pandas operations will use CUDA, if available.

Modern alternatives to pandas (like Polars) are multithreaded by default with all the same features. Using Polars means you 1) avoid the problems with cuDF and broadly CUDA and 2) perform better across machines even without a GPU.

Resources

Footnotes

  1. From this LinkedIn post by Khuyen Tran.