cuDF vs Pandas dataframe performance comparison

cuDF is a Python GPU DataFrame library with similar functionalities to Pandas DataFrame. It can be used for loading, joining, aggregating, filtering, and manipulating large datasets leveraging GPU programming models. cuDF provides a pandas-like API, so the developers or data scientists do not need to dive deeper into the CUDA programming model. It is a part of the Rapids suite that utilizes the NVIDIA CUDA® programming model to expose high bandwidth memory speed and GPU parallelism.

Install using `conda`

Create and activate new conda environment:

conda create --name gpu_env
conda activate gpu_env

Install packages:

conda install -c rapidsai -c nvidia -c numba -c conda-forge cudf=22.06 python=3.9 cudatoolkit=11.2

cuDF vs Pandas DataFrame performance comparison

The following sections include the system configuration, the dataset, the benchmark code, and the benchmarking result. This result is generated using timeit module of Python.

System configuration

GPU information:
- NVIDIA A100-SXM4-40GB
- Driver Version: 470.129.06
- CUDA Version: 11.4
Hardware information:
- Total Memory: 1.0T
- CPU name: AMD EPYC 7742 64-Core Processor
- CPU(s): 256
OS information:
- Operating System: Ubuntu 20.04.4 LTS
- Kernel: Linux 5.4.0-121-generic
- Architecture: x86-64
Python package information:
- Python version: 3.9.13
- Conda version: conda 4.13.0
- cuda-python: 11.7.0
- cudatoolkit: 11.2.72
- cudf: 22.06.01
- pandas: 1.4.3

Dataset

California road network (Leskovec 2009) has the following properties:

Nodes 1965206
Edges 2766607
File size 84M
Matrix size 5533214 x 2 (source - destination)

Benchmarks

Benchmark code (performance_comparison.py) calculates execution times for a set of operations for both cuDF and Pandas DF on the same dataset in the mentioned environment:

import re
import pandas as pd
import cudf
import timeit


def display_time(time_took, message):
    print(f"{message}: {time_took:.6f}s")


def get_read_csv(filename, method='cudf'):
    column_names = ['column 1', 'column 2']
    n = int(re.search('\d+|$', filename).group())
    if method == 'df':
        return pd.read_csv(filename, sep='\t', header=None,
                           names=column_names, nrows=n)
    return cudf.read_csv(filename, sep='\t', header=None,
                         names=column_names, nrows=n)


def get_reverse(relation):
    column_names = ['column 1', 'column 2']
    reverse_relation = relation[relation.columns[::-1]]
    reverse_relation.columns = column_names
    return reverse_relation


def get_merge(relation_1, relation_2):
    column_names = ['column 1', 'column 2']
    return relation_1.merge(relation_2, on=column_names[0],
                            how="inner",
                            suffixes=('_relation_1', '_relation_2'))


def get_drop(result):
    column_names = ['column 1', 'column 2']
    temp = result.drop([column_names[0]], axis=1).drop_duplicates()
    temp.columns = column_names
    return temp


def get_concat(relation_1, relation_2, method='cudf'):
    if method == 'df':
        return pd.concat([relation_1, relation_2], ignore_index=True)
    return cudf.concat([relation_1, relation_2], ignore_index=True)


if __name__ == "__main__":
    dataset = "../data/data_5533214.txt"
    repeat = 100

    cudf_csv_read = timeit.timeit('get_read_csv(dataset)',
                                  number=repeat,
                                  globals=globals())
    display_time(cudf_csv_read, "CUDF read csv")
    relation_1 = get_read_csv(dataset)

    cudf_reverse_df = timeit.timeit('get_reverse(relation_1)',
                                    number=repeat,
                                    globals=globals())
    display_time(cudf_reverse_df, "CUDF reverse dataframe")
    relation_2 = get_reverse(relation_1)

    cudf_merge_df = timeit.timeit('get_merge(relation_1, relation_2)',
                                  number=repeat,
                                  globals=globals())
    display_time(cudf_merge_df, "CUDF merge dataframes")
    result = get_merge(relation_1, relation_2)

    cudf_drop = timeit.timeit('get_drop(result)',
                              number=repeat,
                              globals=globals())
    display_time(cudf_drop, "CUDF drop rows")
    result = get_drop(result)

    cudf_concat = timeit.timeit('get_concat(relation_1, relation_2)',
                                number=repeat,
                                globals=globals())
    display_time(cudf_concat, "CUDF concat relations")
    result = get_concat(relation_1, relation_2)
    print(f"CUDF final result length: {len(result)}")

    print("\n")
    method = 'df'

    pandas_csv_read = timeit.timeit('get_read_csv(dataset, method)',
                                    number=repeat,
                                    globals=globals())
    display_time(pandas_csv_read, "Pandas read csv")
    relation_1 = get_read_csv(dataset, method)

    pandas_reverse_df = timeit.timeit('get_reverse(relation_1)',
                                      number=repeat,
                                      globals=globals())
    display_time(pandas_reverse_df, "Pandas reverse dataframe")
    relation_2 = get_reverse(relation_1)

    pandas_merge_df = timeit.timeit('get_merge(relation_1, relation_2)',
                                    number=repeat,
                                    globals=globals())
    display_time(pandas_merge_df, "Pandas merge dataframes")
    result = get_merge(relation_1, relation_2)

    pandas_drop = timeit.timeit('get_drop(result)',
                                number=repeat,
                                globals=globals())
    display_time(pandas_drop, "Pandas drop rows")
    result = get_drop(result)

    pandas_concat = timeit.timeit('get_concat(relation_1, relation_2, method)',
                                  number=repeat,
                                  globals=globals())
    display_time(pandas_concat, "Pandas concat relations")
    result = get_concat(relation_1, relation_2, method)
    print(f"Pandas final result length: {len(result)}")

For the California road network (Leskovec 2009) benchmark result python performance_comparison.py:

CUDF read csv: 7.532238s
CUDF reverse dataframe: 0.031103s
CUDF merge dataframes: 2.354040s
CUDF drop rows: 4.165711s
CUDF concat relations: 0.345340s
CUDF final result length: 11066428

Pandas read csv: 67.287993s
Pandas reverse dataframe: 1.622508s
Pandas merge dataframes: 80.349599s
Pandas drop rows: 218.142479s
Pandas concat relations: 2.469050s
Pandas final result length: 11066428

The cuDF shows significant performance gains for the same dataset using the system configuration mentioned.

Operation	cuDF (s)	Pandas DF (s)	Speedup
Read CSV	7.532238	67.287993	8.9x
Reverse DF	0.031103	1.622508	52.2x
Merge DFs	2.354040	80.349599	34.1x
Drop column and rows	4.165711	218.142479	52.4x
Concat DFs	0.345340	2.469050	7.1x

Acknowledgement

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Reference

cuDF’s documentation
Documentation on cuDF Drop
Documentation on cuDF Drop Duplicates
Documentation on cuDF concatenate
California road network dataset
(Leskovec 2009) J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1) 29--123, 2009.

cuDF vs Pandas dataframe performance comparison

By Ahmedur Rahman Shovon

Install using `conda`

cuDF vs Pandas DataFrame performance comparison

System configuration

Dataset

Benchmarks

Acknowledgement

Reference

Citation

Related contents in this website

Run a Python script inside a virtual environment in the background

Introduction to SYCL and DPC++

Sorting a list (Python 3)

Previous post

Next post

Ahmedur Rahman Shovon

Install using conda

cuDF vs Pandas DataFrame performance comparison

System configuration

Dataset

Benchmarks

Acknowledgement

Reference

Citation

APA Style

MLA Style

BibTeX entry

Related contents in this website

Run a Python script inside a virtual environment in the background

Introduction to SYCL and DPC++

Sorting a list (Python 3)

Previous post

Next post

Ahmedur Rahman Shovon

Install using `conda`