[Investigation/Benchmarking] The use of CuPy over NumPy where applicable

Have some feature requests, feedback, cool stuff to share, or want to know where FreeCAD is going? This is the place.
Forum rules
Be nice to others! Read the FreeCAD code of conduct!
Post Reply
Syres
Veteran
Posts: 2902
Joined: Thu Aug 09, 2018 11:14 am

[Investigation/Benchmarking] The use of CuPy over NumPy where applicable

Post by Syres »

I've done a few searches and cannot find an instance where in the past someone has investigated the use of CuPy where a NVidia CUDA card is available as an alternative to NumPy, is my search criteria flawed?

On a preliminary check I can only find 13 Python files that have import numpy in them but I was going to note down all the steps in upgrading libraries to specific versions and what was required initially on Windows and then if successful on Linux. I need to benchmark some 'tough' operations as is and then with CuPy but I've only got a GTX670 so not expecting huge gains. I maybe able to get hold of a GTX1080ti in a couple of weeks time which should make a noticeable difference.
Syres
Veteran
Posts: 2902
Joined: Thu Aug 09, 2018 11:14 am

Re: [Investigation/Benchmarking] The use of CuPy over NumPy where applicable

Post by Syres »

Progress Report 1:

Installation steps Windows specific, must be Python 3.6.x or later (using 3.6.9):

1) Upgrade pip to version 20.1.1 or newer
2) Upgrade setuptools to version 46.4.0 or newer
3) Upgrade numpy to version 1.18.4 or newer
4) Install Cuda Toolkit (https://developer.nvidia.com/cuda-downl ... e=exelocal I used 10.2 but whatever you choose, only one can be resident at any one time), it's a 2.1Gb download and I used Express Install otherwise no deviation from Next, Next, Next...
5) Upgrade NVidia driver which also matches (at a minimum the Cuda version)
6) pip install cupy-cuda102 (note CuPy version must match the version of the Cuda Toolkit and it's a long install compared to what I'm used so expect minutes of hour glass before successful install completes)

Found a benchmark site and customised it for 'nicer' output:

Code: Select all

import numpy as np
import cupy as cp
import time
array_side = 625
total_array_size = array_side * array_side * array_side
print('Total Array to be Benchmarked: '+str(total_array_size))
### Numpy and CPU
s = time.time()
x_cpu = np.ones((array_side,array_side,array_side))
e = time.time()
print('Numpy & CPU operation to create array took '+str(e - s))
### CuPy and GPU
s = time.time()
x_gpu = cp.ones((array_side,array_side,array_side))
cp.cuda.Stream.null.synchronize()
e = time.time()
print('CuPy & GPU operation to create array took '+str(e - s))

### Numpy and CPU
s = time.time()
x_cpu *= 5
x_cpu *= x_cpu
x_cpu += x_cpu
e = time.time()
print('Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took '+str(e - s))
### CuPy and GPU
s = time.time()
x_gpu *= 5
x_gpu *= x_gpu
x_gpu += x_gpu
cp.cuda.Stream.null.synchronize()
e = time.time()
print('CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took '+str(e - s))

print('Finished\n\n')
Hardware used for benchmarking:

Intel Core i5-3570K CPU @ 3.4GHz
24Gb DDR3 memory
NVidia Geforce GTX670 with 2048Mb GDDR5 memory

Results using different size of array up to the point that my GTX670 ran out of memory:

Code: Select all

Total Array to be Benchmarked: 1000000
Numpy & CPU operation to create array took 0.002000093460083008
CuPy & GPU operation to create array took 0.021001100540161133
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.004000425338745117
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.0009999275207519531
Finished


Total Array to be Benchmarked: 15625000
Numpy & CPU operation to create array took 0.029001712799072266
CuPy & GPU operation to create array took 0.02300119400024414
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.06800389289855957
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.006000518798828125
Finished


Total Array to be Benchmarked: 125000000
Numpy & CPU operation to create array took 0.22901320457458496
CuPy & GPU operation to create array took 0.09800553321838379
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.5420310497283936
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.6530373096466064
Finished


Total Array to be Benchmarked: 244140625
Numpy & CPU operation to create array took 0.9410536289215088
CuPy & GPU operation to create array took 0.21601223945617676
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 1.076061725616455
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.24701428413391113
Finished
So there is a speed benefit depending on the size of the array. I'll carry out the same tests on a Windows 10 i5 box with a GTX1050 and report back.
sliptonic wrote: Thought you might be interested
Edit: Corrected CPU of Win10 box
Edit2: Add line for NVidia driver upgrade
Last edited by Syres on Thu May 28, 2020 3:44 pm, edited 1 time in total.
Syres
Veteran
Posts: 2902
Joined: Thu Aug 09, 2018 11:14 am

Re: [Investigation/Benchmarking] The use of CuPy over NumPy where applicable

Post by Syres »

Progress Report 2:

Repeated install steps on Win10 box but also needed to upgrade the NVidia Driver (which I thought the Toolkit did but apparently not).

Hardware used for benchmarking:

Intel i5-6400 @ 2.70GHz
16Gb DDR3 memory
GTX 1050Ti with 4096Mb GDDR5 memory

Same benchmarking as previous code gave the following results:

Code: Select all

Total Array to be Benchmarked: 1000000
Numpy & CPU operation to create array took 0.004010438919067383
CuPy & GPU operation to create array took 0.013570308685302734
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.0019960403442382812
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.0
Finished


Total Array to be Benchmarked: 15625000
Numpy & CPU operation to create array took 0.0688173770904541
CuPy & GPU operation to create array took 0.011492729187011719
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.02895045280456543
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.007977724075317383
Finished


Total Array to be Benchmarked: 125000000
Numpy & CPU operation to create array took 0.5505313873291016
CuPy & GPU operation to create array took 0.13450002670288086
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.23137784004211426
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.06183266639709473
Finished


Total Array to be Benchmarked: 244140625
Numpy & CPU operation to create array took 0.9773850440979004
CuPy & GPU operation to create array took 0.8108229637145996
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.46982836723327637
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.8389742374420166
Finished
Conclusion: my GTX670 is a much better card than I thought!! If you have some super fast AMD Threadripper or an Intel i9 then you're possibly going to need at least a 1080Ti to beat it. For older hardware a decent CUDA GPU and CuPy would be at least 4x faster and up to 10x faster as long as the memory capacity is sufficient for the size of arrays being manipulated.

I'll next look into the Linux installation routine but it maybe a few days before I give an update, real world work calls.
Syres
Veteran
Posts: 2902
Joined: Thu Aug 09, 2018 11:14 am

Re: [Investigation/Benchmarking] The use of CuPy over NumPy where applicable

Post by Syres »

Progress Report 3:

Installation Steps on Linux, minimum Python version 3.5.1:

Steps 1 to 3 are the same as Windows
4) sudo apt install nvidia-cuda-toolkit (note: no specific Cuda version needs to be known on Linux a the point in the process)
5) nvcc --version (to get the Cuda version, mine is 9.1)
6) install cupy-cuda91 (replace 91 with whatever the version yours is)


The output from the benchmarking is:

Code: Select all

Total Array to be Benchmarked: 1000000
Numpy & CPU operation to create array took 0.0014395713806152344
CuPy & GPU operation to create array took 0.3502366542816162
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.0034983158111572266
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.6336479187011719
Finished


Total Array to be Benchmarked: 15625000
Numpy & CPU operation to create array took 0.025122404098510742
CuPy & GPU operation to create array took 0.007489442825317383
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.06829309463500977
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.00591588020324707
Finished


Total Array to be Benchmarked: 125000000
Numpy & CPU operation to create array took 0.16916680335998535
CuPy & GPU operation to create array took 0.012218236923217773
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.5224223136901855
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.04356884956359863
Finished


Total Array to be Benchmarked: 229220928
Numpy & CPU operation to create array took 0.3057739734649658
CuPy & GPU operation to create array took 0.0193636417388916
Numpy & CPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.9581973552703857
CuPy & GPU operation to multiple the array by 5, multiple the array by itself and add the array to itself took 0.07995939254760742
Finished
Note how the same hardware as the Windows 7 test cannot handle the largest array, I had to reduce it slightly. I'm not sure whether this is purely the Cuda/driver version or something else.

Edit: Corrected minimum Python version for Linux CuPy
User avatar
Kunda1
Veteran
Posts: 13434
Joined: Thu Jan 05, 2017 9:03 pm

Re: [Investigation/Benchmarking] The use of CuPy over NumPy where applicable

Post by Kunda1 »

Great thread! Thank you for doing this and presenting your results :D
Alone you go faster. Together we go farther
Please mark thread [Solved]
Want to contribute back to FC? Checkout:
'good first issues' | Open TODOs and FIXMEs | How to Help FreeCAD | How to report Bugs
Post Reply