cuDF is a Python GPU DataFrame library with similar functionalities to Pandas DataFrame. It can be used for loading, joining, aggregating, filtering, and manipulating large datasets leveraging GPU programming models. cuDF provides a pandas-like API, so the developers or data scientists do not need to dive deeper into the CUDA programming model. It is a part of the Rapids suite that utilizes the NVIDIA CUDA® programming model to expose high bandwidth memory speed and GPU parallelism.
Install using conda
- Create and activate new
conda
environment:
conda create --name gpu_env
conda activate gpu_env
- Install packages:
conda install -c rapidsai -c nvidia -c numba -c conda-forge cudf=22.06 python=3.9 cudatoolkit=11.2
cuDF vs Pandas DataFrame performance comparison
The following sections include the system configuration, the dataset, the benchmark code, and the benchmarking result.
This result is generated using timeit
module of Python.
System configuration
- GPU information:
- NVIDIA A100-SXM4-40GB
- Driver Version: 470.129.06
- CUDA Version: 11.4
- Hardware information:
- Total Memory: 1.0T
- CPU name: AMD EPYC 7742 64-Core Processor
- CPU(s): 256
- OS information:
- Operating System: Ubuntu 20.04.4 LTS
- Kernel: Linux 5.4.0-121-generic
- Architecture: x86-64
- Python package information:
- Python version: 3.9.13
- Conda version: conda 4.13.0
- cuda-python: 11.7.0
- cudatoolkit: 11.2.72
- cudf: 22.06.01
- pandas: 1.4.3
Dataset
California road network (Leskovec 2009) has the following properties:
- Nodes 1965206
- Edges 2766607
- File size 84M
- Matrix size 5533214 x 2 (source - destination)
Benchmarks
Benchmark code (performance_comparison.py
) calculates execution times for a set of operations for both cuDF and Pandas DF on the same dataset in the mentioned environment:
import re
import pandas as pd
import cudf
import timeit
def display_time(time_took, message):
print(f"{message}: {time_took:.6f}s")
def get_read_csv(filename, method='cudf'):
column_names = ['column 1', 'column 2']
n = int(re.search('\d+|$', filename).group())
if method == 'df':
return pd.read_csv(filename, sep='\t', header=None,
names=column_names, nrows=n)
return cudf.read_csv(filename, sep='\t', header=None,
names=column_names, nrows=n)
def get_reverse(relation):
column_names = ['column 1', 'column 2']
reverse_relation = relation[relation.columns[::-1]]
reverse_relation.columns = column_names
return reverse_relation
def get_merge(relation_1, relation_2):
column_names = ['column 1', 'column 2']
return relation_1.merge(relation_2, on=column_names[0],
how="inner",
suffixes=('_relation_1', '_relation_2'))
def get_drop(result):
column_names = ['column 1', 'column 2']
temp = result.drop([column_names[0]], axis=1).drop_duplicates()
temp.columns = column_names
return temp
def get_concat(relation_1, relation_2, method='cudf'):
if method == 'df':
return pd.concat([relation_1, relation_2], ignore_index=True)
return cudf.concat([relation_1, relation_2], ignore_index=True)
if __name__ == "__main__":
dataset = "../data/data_5533214.txt"
repeat = 100
cudf_csv_read = timeit.timeit('get_read_csv(dataset)',
number=repeat,
globals=globals())
display_time(cudf_csv_read, "CUDF read csv")
relation_1 = get_read_csv(dataset)
cudf_reverse_df = timeit.timeit('get_reverse(relation_1)',
number=repeat,
globals=globals())
display_time(cudf_reverse_df, "CUDF reverse dataframe")
relation_2 = get_reverse(relation_1)
cudf_merge_df = timeit.timeit('get_merge(relation_1, relation_2)',
number=repeat,
globals=globals())
display_time(cudf_merge_df, "CUDF merge dataframes")
result = get_merge(relation_1, relation_2)
cudf_drop = timeit.timeit('get_drop(result)',
number=repeat,
globals=globals())
display_time(cudf_drop, "CUDF drop rows")
result = get_drop(result)
cudf_concat = timeit.timeit('get_concat(relation_1, relation_2)',
number=repeat,
globals=globals())
display_time(cudf_concat, "CUDF concat relations")
result = get_concat(relation_1, relation_2)
print(f"CUDF final result length: {len(result)}")
print("\n")
method = 'df'
pandas_csv_read = timeit.timeit('get_read_csv(dataset, method)',
number=repeat,
globals=globals())
display_time(pandas_csv_read, "Pandas read csv")
relation_1 = get_read_csv(dataset, method)
pandas_reverse_df = timeit.timeit('get_reverse(relation_1)',
number=repeat,
globals=globals())
display_time(pandas_reverse_df, "Pandas reverse dataframe")
relation_2 = get_reverse(relation_1)
pandas_merge_df = timeit.timeit('get_merge(relation_1, relation_2)',
number=repeat,
globals=globals())
display_time(pandas_merge_df, "Pandas merge dataframes")
result = get_merge(relation_1, relation_2)
pandas_drop = timeit.timeit('get_drop(result)',
number=repeat,
globals=globals())
display_time(pandas_drop, "Pandas drop rows")
result = get_drop(result)
pandas_concat = timeit.timeit('get_concat(relation_1, relation_2, method)',
number=repeat,
globals=globals())
display_time(pandas_concat, "Pandas concat relations")
result = get_concat(relation_1, relation_2, method)
print(f"Pandas final result length: {len(result)}")
For the California road network (Leskovec 2009) benchmark result python performance_comparison.py
:
CUDF read csv: 7.532238s
CUDF reverse dataframe: 0.031103s
CUDF merge dataframes: 2.354040s
CUDF drop rows: 4.165711s
CUDF concat relations: 0.345340s
CUDF final result length: 11066428
Pandas read csv: 67.287993s
Pandas reverse dataframe: 1.622508s
Pandas merge dataframes: 80.349599s
Pandas drop rows: 218.142479s
Pandas concat relations: 2.469050s
Pandas final result length: 11066428
The cuDF shows significant performance gains for the same dataset using the system configuration mentioned.
Operation | cuDF (s) | Pandas DF (s) | Speedup |
---|---|---|---|
Read CSV | 7.532238 | 67.287993 | 8.9x |
Reverse DF | 0.031103 | 1.622508 | 52.2x |
Merge DFs | 2.354040 | 80.349599 | 34.1x |
Drop column and rows | 4.165711 | 218.142479 | 52.4x |
Concat DFs | 0.345340 | 2.469050 | 7.1x |
Acknowledgement
This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Reference
- cuDF’s documentation
- Documentation on cuDF Drop
- Documentation on cuDF Drop Duplicates
- Documentation on cuDF concatenate
- California road network dataset
- (Leskovec 2009) J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6(1) 29--123, 2009.
Advertisement