pandas is a Python library for data manipulation and analysis. It’s one of the standard tools in data science and engineering, and can enable a wide range of work.
Quick links:
Dataframes
The core data structure that pandas provides is the dataframe. Some core dataframe methods:
- We can load in a new dataframe from:
- A CSV file with:
pd.read_csv()
- Note that large (on the order of 10k+) pandas dataframes can take up a huge amount of memory. By using the
chunksize
parameter, we can essentially break up the loading into different chunks, aiding performance on operations. Note that this means any work cannot be done on a single contiguous dataframe.1
- Note that large (on the order of 10k+) pandas dataframes can take up a huge amount of memory. By using the
- A NumPy array with:
pd.DataFrame(arr)
- A CSV file with:
df.iloc[row, col]
to index a dataframe. Raw array-style indexing is not performant.
For row/column manipulations. Specify axis=0
for rows, axis=1
for columns.
df.drop()
to remove a row/column.
Parallelisation
pandas operations are singlethreaded by default, with no built-in mechanism to easily parallelise operations. Without changing our code, we can use cuDF (requires Python >=3.9), a CUDA-compatible package. Note that this is only valid for Linux systems (or WSL).
In our import statements:
And all pandas operations will use CUDA, if available.
Modern alternatives to pandas (like Polars) are multithreaded by default with all the same features. Using Polars means you 1) avoid the problems with cuDF and broadly CUDA and 2) perform better across machines even without a GPU.
Resources
Footnotes
-
From this LinkedIn post by Khuyen Tran. ↩