cuDF is a Python GPU DataFrame library with similar functionalities to Pandas DataFrame. It can be used for loading, joining, aggregating, filtering, and manipulating large datasets leveraging GPU programming models. cuDF provides a pandas-like API, so the developers or data scientists do not need to dive deeper into the CUDA programming model. It is a part of the Rapids suite that utilizes the NVIDIA CUDA® programming model to expose high bandwidth memory speed and GPU parallelism.

Install using conda

  • Create and activate new conda environment:
conda create --name gpu_env
conda activate gpu_env
  • Install packages:
conda install -c rapidsai -c nvidia -c numba -c conda-forge cudf=22.06 python=3.9 cudatoolkit=11.2

cuDF vs Pandas DataFrame performance comparison

The following sections include the system configuration, the dataset, the benchmark code, and the benchmarking result. This result is generated using timeit module of Python.

System configuration

  • GPU information:
    • NVIDIA A100-SXM4-40GB
    • Driver Version: 470.129.06
    • CUDA Version: 11.4
  • Hardware information:
    • Total Memory: 1.0T
    • CPU name: AMD EPYC 7742 64-Core Processor
    • CPU(s): 256
  • OS information:
    • Operating System: Ubuntu 20.04.4 LTS
    • Kernel: Linux 5.4.0-121-generic
    • Architecture: x86-64
  • Python package information:
    • Python version: 3.9.13
    • Conda version: conda 4.13.0
    • cuda-python: 11.7.0
    • cudatoolkit: 11.2.72
    • cudf: 22.06.01
    • pandas: 1.4.3

Dataset

California road network (Leskovec 2009) has the following properties:

  • Nodes 1965206
  • Edges 2766607
  • File size 84M
  • Matrix size 5533214 x 2 (source - destination)

Benchmarks

Benchmark code (performance_comparison.py) calculates execution times for a set of operations for both cuDF and Pandas DF on the same dataset in the mentioned environment:

import re
import pandas as pd
import cudf
import timeit


def display_time(time_took, message):
    print(f"{message}: {time_took:.6f}s")


def get_read_csv(filename, method='cudf'):
    column_names = ['column 1', 'column 2']
    n = int(re.search('\d+|$', filename).group())
    if method == 'df':
        return pd.read_csv(filename, sep='\t', header=None,
                           names=column_names, nrows=n)
    return cudf.read_csv(filename, sep='\t', header=None,
                         names=column_names, nrows=n)


def get_reverse(relation):
    column_names = ['column 1', 'column 2']
    reverse_relation = relation[relation.columns[::-1]]
    reverse_relation.columns = column_names
    return reverse_relation


def get_merge(relation_1, relation_2):
    column_names = ['column 1', 'column 2']
    return relation_1.merge(relation_2, on=column_names[0],
                            how="inner",
                            suffixes=('_relation_1', '_relation_2'))


def get_drop(result):
    column_names = ['column 1', 'column 2']
    temp = result.drop([column_names[0]], axis=1).drop_duplicates()
    temp.columns = column_names
    return temp


def get_concat(relation_1, relation_2, method='cudf'):
    if method == 'df':
        return pd.concat([relation_1, relation_2], ignore_index=True)
    return cudf.concat([relation_1, relation_2], ignore_index=True)


if __name__ == "__main__":
    dataset = "../data/data_5533214.txt"
    repeat = 100

    cudf_csv_read = timeit.timeit('get_read_csv(dataset)',
                                  number=repeat,
                                  globals=globals())
    display_time(cudf_csv_read, "CUDF read csv")
    relation_1 = get_read_csv(dataset)

    cudf_reverse_df = timeit.timeit('get_reverse(relation_1)',
                                    number=repeat,
                                    globals=globals())
    display_time(cudf_reverse_df, "CUDF reverse dataframe")
    relation_2 = get_reverse(relation_1)

    cudf_merge_df = timeit.timeit('get_merge(relation_1, relation_2)',
                                  number=repeat,
                                  globals=globals())
    display_time(cudf_merge_df, "CUDF merge dataframes")
    result = get_merge(relation_1, relation_2)

    cudf_drop = timeit.timeit('get_drop(result)',
                              number=repeat,
                              globals=globals())
    display_time(cudf_drop, "CUDF drop rows")
    result = get_drop(result)

    cudf_concat = timeit.timeit('get_concat(relation_1, relation_2)',
                                number=repeat,
                                globals=globals())
    display_time(cudf_concat, "CUDF concat relations")
    result = get_concat(relation_1, relation_2)
    print(f"CUDF final result length: {len(result)}")

    print("\n")
    method = 'df'

    pandas_csv_read = timeit.timeit('get_read_csv(dataset, method)',
                                    number=repeat,
                                    globals=globals())
    display_time(pandas_csv_read, "Pandas read csv")
    relation_1 = get_read_csv(dataset, method)

    pandas_reverse_df = timeit.timeit('get_reverse(relation_1)',
                                      number=repeat,
                                      globals=globals())
    display_time(pandas_reverse_df, "Pandas reverse dataframe")
    relation_2 = get_reverse(relation_1)

    pandas_merge_df = timeit.timeit('get_merge(relation_1, relation_2)',
                                    number=repeat,
                                    globals=globals())
    display_time(pandas_merge_df, "Pandas merge dataframes")
    result = get_merge(relation_1, relation_2)

    pandas_drop = timeit.timeit('get_drop(result)',
                                number=repeat,
                                globals=globals())
    display_time(pandas_drop, "Pandas drop rows")
    result = get_drop(result)

    pandas_concat = timeit.timeit('get_concat(relation_1, relation_2, method)',
                                  number=repeat,
                                  globals=globals())
    display_time(pandas_concat, "Pandas concat relations")
    result = get_concat(relation_1, relation_2, method)
    print(f"Pandas final result length: {len(result)}")

For the California road network (Leskovec 2009) benchmark result python performance_comparison.py:

CUDF read csv: 7.532238s
CUDF reverse dataframe: 0.031103s
CUDF merge dataframes: 2.354040s
CUDF drop rows: 4.165711s
CUDF concat relations: 0.345340s
CUDF final result length: 11066428

Pandas read csv: 67.287993s
Pandas reverse dataframe: 1.622508s
Pandas merge dataframes: 80.349599s
Pandas drop rows: 218.142479s
Pandas concat relations: 2.469050s
Pandas final result length: 11066428

The cuDF shows significant performance gains for the same dataset using the system configuration mentioned.

OperationcuDF (s)Pandas DF (s)Speedup
Read CSV7.53223867.2879938.9x
Reverse DF0.0311031.62250852.2x
Merge DFs2.35404080.34959934.1x
Drop column and rows4.165711218.14247952.4x
Concat DFs0.3453402.4690507.1x

Acknowledgement

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Reference

Advertisement

Citation

Click to select citation style

Shovon, A. R. (2022, August 1). cuDF vs Pandas dataframe performance comparison. Ahmedur Rahman Shovon. Retrieved December 3, 2024, from https://arshovon.com/blog/cudf-vs-df/

Shovon, Ahmedur Rahman. “cuDF vs Pandas dataframe performance comparison.” Ahmedur Rahman Shovon, 1 Aug. 2022. Web. 3 Dec. 2024. https://arshovon.com/blog/cudf-vs-df/.

@misc{ shovon_2022,
    author = "Shovon, Ahmedur Rahman",
    title = "cuDF vs Pandas dataframe performance comparison",
    year = "2022",
    url = "https://arshovon.com/blog/cudf-vs-df/",
    note = "[Online; accessed 3-December-2024]"
}
Related contents in this website