Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)
Forum rules
Be nice to others! Respect the FreeCAD code of conduct!
Be nice to others! Respect the FreeCAD code of conduct!
Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)
Hi mates,
I'm not (at all) an expert about compiling, furthermore I did not find any relevant thread on the forum about NEON instructions, so I ask to compiling experts:
- 1: Are NEON instructions already enabled in gcc when compiling FreeCAD on ARM plateform?
- 2: If not, do you think it's possible to enable NEON when compiling?
- 3: Does it make sens about a huge code with a lot of dependancies like FreeCAD?
AFAIK NEON instructions would increase CPU performances to run FreeCAD much faster on ARM plateform with ARMv8-A instruction set like Raspberry PI4
Thanks for your opinion and advices
I'm not (at all) an expert about compiling, furthermore I did not find any relevant thread on the forum about NEON instructions, so I ask to compiling experts:
- 1: Are NEON instructions already enabled in gcc when compiling FreeCAD on ARM plateform?
- 2: If not, do you think it's possible to enable NEON when compiling?
- 3: Does it make sens about a huge code with a lot of dependancies like FreeCAD?
AFAIK NEON instructions would increase CPU performances to run FreeCAD much faster on ARM plateform with ARMv8-A instruction set like Raspberry PI4
Thanks for your opinion and advices
Last edited by -alex- on Thu Jan 27, 2022 3:15 pm, edited 2 times in total.
Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)
I think you could try to use " readelf -A" to check that. Raspbian provides freecad.
1. No idea, use readelf. Probably not
2. Plenty of info here, it's worth to try [1] if you're into compiling on rpi4
3. Run "time FreeCAD -t 0" for both versions and you'll have an idea if i makes any changes.
[1] https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
Reference thread: about FreeCAD on rpi4 https://forum.freecadweb.org/viewtopic. ... &start=170
1. No idea, use readelf. Probably not
2. Plenty of info here, it's worth to try [1] if you're into compiling on rpi4
3. Run "time FreeCAD -t 0" for both versions and you'll have an idea if i makes any changes.
[1] https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html
Reference thread: about FreeCAD on rpi4 https://forum.freecadweb.org/viewtopic. ... &start=170
Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)
Thanks four your help, much appreciated.
Do you mean something like that?:
Code: Select all
pi@raspberrypi:~ $ cd freecad-build/bin/
pi@raspberrypi:~ $ readelf -a FreeCAD > readelf_-a_FreeCAD
Code: Select all
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: AArch64
Version: 0x1
Entry point address: 0x7da0
Start of program headers: 64 (bytes into file)
Start of section headers: 95696 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 9
Size of section headers: 64 (bytes)
Number of section headers: 31
Section header string table index: 30
.........
So, does this mean simd (NEON) is not enabled in FreeCAD binary, or am I wrong?
Last edited by -alex- on Tue Jun 30, 2020 10:13 pm, edited 1 time in total.
Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)
I'm not a specialist, so take your guess, but I think you're right. Do you compile freecad? I can help with that.
P.S. try to run it with LANG or LC_ALL,
or
P.S. try to run it with LANG or LC_ALL,
Code: Select all
LANG=C readelf -a FreeCAD
Code: Select all
LC_ALL=C readelf -a FreeCAD
Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)
That's ok about that for now, I'm not an expert though, so I'll will ask you if I need help, your offer is much appreciated thank you.
I have replaced the readelf text file above by the english version (my bad I forgot to delete some french locales, with dpkg -reconfigure locales and en_US-UTF-8 only it was ok...).P.S. try to run it with LANG or LC_ALL,
On the web some people says SIMD is mandatory with aarch64 (stackoverflow...), it seem the doc says pretty the same as Armv8-A is concerned: https://gcc.gnu.org/onlinedocs/gcc/AArc ... ions.html]
Code: Select all
arch value Architecture Includes by default
‘armv8-a’ Armv8-A ‘+fp’, ‘+simd’
Does this mean SIMD is enabled when gcc compiles FreeCAD on aarch64 system?...
My english and compilling skills are to low to fully understand all this stuffs, but I try....
NEON allows to run programs faster, so I want to check if NEON instructions are enabled when FreeCAD is compiled on Raspberry 4.
Otherwise maybe there is an additional power availlable for the RPI4...
Now I have to investigate a bit more about SIMD NEON and Aarch64 architecture, if ones are familliar with ARM and aarch64 stuffs, any information very appreciated
Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)
So, I'm still wondering about some possibilities to speed up FreeCAD on ARM architectures (or whatever...) and take advantage of SIMD (Single Instruction Multiple Data), i.e. NEON instructions for RPI4.
IMHO that's an important challenge because ARM devices are more and more popular, powerful and energy efficient.
For eg. with a Raspberry 4 the user can run FreeCAD pretty fine, noiseless, with consumption of about 7,5W/h. RPI4 is easy to move, easy to share, convenient for learning. That's a very suitable system for makers.
Actually I have no idea if SIMD instructions (or vector instructions) are already enabled or not in gcc compiler while compiling FreeCAD. But I don't think so, it seems to me that SIMD is not used extensively because no topic on this forum.
Remember, I'm an end user, not an expert, please be kind. But do not hesitate to give me your opinion and to tell me if I'm wrong.
AFAIK to speed up FreeCAD there is several ways:
1. improve the code: not easy
2. use a low level language: Cpp is already used
3. use multi-treading: not easy and already used for some features or libs (Part boolean, Calculix,..?)
4. take advantage of SIMD: that's the point!
Point 4 is maybe a way to speed up IMHO. But programing SIMD units of code is difficult, furthermore such code is not portable. In addition, third party libraries are out of the scope.
So, an easier way to do seems to use options of gcc compiler to enable automatic simdization of the "standard" code, but some people say that's not really efficient.
However, maybe there is a better way for automatic simdization with "partial SIMD parallelism", PArtial VEctorizeR "PARVER".
Here is a thesis "Compiler techniques for improving SIMD parallelism" by Hao Zhou http://unsworks.unsw.edu.au/fapi/datast ... ?view=true
BTW developers will understand this thesis better than me.
Developers: do you think such SIMD strategy makes sens for FreeCAD and third party libs?
Does that make sens to compile some critical libs (OCC, Coin, Gmsh, Calculix) then compile FreeCAD main code with partial simdization compiler?
Is it possible to implement such partial vectorizer techniques? Or is it rocket scientist stuffs?
IMHO that's an important challenge because ARM devices are more and more popular, powerful and energy efficient.
For eg. with a Raspberry 4 the user can run FreeCAD pretty fine, noiseless, with consumption of about 7,5W/h. RPI4 is easy to move, easy to share, convenient for learning. That's a very suitable system for makers.
Actually I have no idea if SIMD instructions (or vector instructions) are already enabled or not in gcc compiler while compiling FreeCAD. But I don't think so, it seems to me that SIMD is not used extensively because no topic on this forum.
Remember, I'm an end user, not an expert, please be kind. But do not hesitate to give me your opinion and to tell me if I'm wrong.
AFAIK to speed up FreeCAD there is several ways:
1. improve the code: not easy
2. use a low level language: Cpp is already used
3. use multi-treading: not easy and already used for some features or libs (Part boolean, Calculix,..?)
4. take advantage of SIMD: that's the point!
Point 4 is maybe a way to speed up IMHO. But programing SIMD units of code is difficult, furthermore such code is not portable. In addition, third party libraries are out of the scope.
So, an easier way to do seems to use options of gcc compiler to enable automatic simdization of the "standard" code, but some people say that's not really efficient.
However, maybe there is a better way for automatic simdization with "partial SIMD parallelism", PArtial VEctorizeR "PARVER".
Here is a thesis "Compiler techniques for improving SIMD parallelism" by Hao Zhou http://unsworks.unsw.edu.au/fapi/datast ... ?view=true
BTW developers will understand this thesis better than me.
Developers: do you think such SIMD strategy makes sens for FreeCAD and third party libs?
Does that make sens to compile some critical libs (OCC, Coin, Gmsh, Calculix) then compile FreeCAD main code with partial simdization compiler?
Is it possible to implement such partial vectorizer techniques? Or is it rocket scientist stuffs?
-
- Posts: 27
- Joined: Wed Aug 26, 2020 1:44 am
Re: Speedup FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)
Disclaimer: I am a developer familiar w/ x86, ARM and low-level optimization, however I am not a FreeCAD developer.
Simply compiling FreeCAD on ARM with an appropriate compiler and appropriate compiler flags may enable some level of SIMD NEON usage. However, well designed, built and shipped software usually does not do this. Why? Because when delivering binary packages you can't know at compile time exactly what CPU instructions will be supported by the execution CPU.
Does your ARM support ARMv7t, ARMv8t, Thumb, Thumb2, NEON extensions, etc? There's no way to predict in advance, therefore most compilers, by default, generate code for the lowest common denominator instruction set to guarantee that the application doesn't (1) die with SIGILLs or (2) Rely on kernel level instruction emulation to trap the SIGILL exception and emulate the instruction in software, which is really slow.
The way this is optimally done is with a run-time dispatcher that selects optimal implementations of various performance critical functions for the run-time detected CPU features. This makes for a more complicated compile environment and a lot of test scenarios. There are libraries, such as IPP or Oil or Framewave, that help do some of this for you.
High performance software, such as x265 or x264, libvp8, libvp9 typically use this kind of approach and therefore it makes it more difficult and error prone to port those applications to other architectures. Last I tried x265 on ARM it was non-functional.
Furthermore, it's not always sensible to approach CPU optimization, especially for graphics intensive programs. Again, without knowing where the hot spots internally are in FreeCAD I wouldn't even want to predict that NEON could help make it faster.
In summary, what you are asking about, to do properly is not an easy initial lift to get going and takes a bunch of work to maintain. Frankly, it would be better if the main framework library (specifically QT in this instance) were to do it for us. Given QT's place as an automotive infotainment library, I wouldn't be surprised if eventually it was. Furthermore, it may not even be sensible to do what you are suggesting as it many not make a noticeable difference.
Completely separately, there's a number of big architectural differences between a Raspberry PI, an M1 MAC and an X86-64 machine that also impact performance. Your ask presumes that your performance concern is a CPU bound issue, which there's no particular reason to believe is the case.
Hope this helps understand some of the technical nuances in your observations.
Simply compiling FreeCAD on ARM with an appropriate compiler and appropriate compiler flags may enable some level of SIMD NEON usage. However, well designed, built and shipped software usually does not do this. Why? Because when delivering binary packages you can't know at compile time exactly what CPU instructions will be supported by the execution CPU.
Does your ARM support ARMv7t, ARMv8t, Thumb, Thumb2, NEON extensions, etc? There's no way to predict in advance, therefore most compilers, by default, generate code for the lowest common denominator instruction set to guarantee that the application doesn't (1) die with SIGILLs or (2) Rely on kernel level instruction emulation to trap the SIGILL exception and emulate the instruction in software, which is really slow.
The way this is optimally done is with a run-time dispatcher that selects optimal implementations of various performance critical functions for the run-time detected CPU features. This makes for a more complicated compile environment and a lot of test scenarios. There are libraries, such as IPP or Oil or Framewave, that help do some of this for you.
High performance software, such as x265 or x264, libvp8, libvp9 typically use this kind of approach and therefore it makes it more difficult and error prone to port those applications to other architectures. Last I tried x265 on ARM it was non-functional.
Furthermore, it's not always sensible to approach CPU optimization, especially for graphics intensive programs. Again, without knowing where the hot spots internally are in FreeCAD I wouldn't even want to predict that NEON could help make it faster.
In summary, what you are asking about, to do properly is not an easy initial lift to get going and takes a bunch of work to maintain. Frankly, it would be better if the main framework library (specifically QT in this instance) were to do it for us. Given QT's place as an automotive infotainment library, I wouldn't be surprised if eventually it was. Furthermore, it may not even be sensible to do what you are suggesting as it many not make a noticeable difference.
Completely separately, there's a number of big architectural differences between a Raspberry PI, an M1 MAC and an X86-64 machine that also impact performance. Your ask presumes that your performance concern is a CPU bound issue, which there's no particular reason to believe is the case.
Hope this helps understand some of the technical nuances in your observations.
Re: Speedup FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)
Thank you for your reply and to give your feedback about such topicjmdzampieron wrote: ↑Fri Nov 27, 2020 12:55 am Disclaimer: I am a developer familiar w/ x86, ARM and low-level optimization, however I am not a FreeCAD developer.
Ok I get it, that's not a portable way to do because it's very architecture specific.Simply compiling FreeCAD on ARM with an appropriate compiler and appropriate compiler flags may enable some level of SIMD NEON usage. However, well designed, built and shipped software usually does not do this. Why? Because when delivering binary packages you can't know at compile time exactly what CPU instructions will be supported by the execution CPU.
Ok, if you didn't succed, I will certainly did not neither.The way this is optimally done is with a run-time dispatcher .....High performance software, such as x265 or x264, libvp8, libvp9 typically use this kind of approach and therefore it makes it more difficult and error prone to port those applications to other architectures. Last I tried x265 on ARM it was non-functional.
That is the point, can we expect some speed up or not? So, you would pretty say "no".Furthermore, it's not always sensible to approach CPU optimization, especially for graphics intensive programs. Again, without knowing where the hot spots internally are in FreeCAD I wouldn't even want to predict that NEON could help make it faster.
Humm, makes me feel to give up.In summary, what you are asking about, to do properly is not an easy initial lift to get going and takes a bunch of work to maintain. Frankly, it would be better if the main framework library (specifically QT in this instance) were to do it for us. Given QT's place as an automotive infotainment library, I wouldn't be surprised if eventually it was. Furthermore, it may not even be sensible to do what you are suggesting as it many not make a noticeable difference.
Well, I'm an end user, my statements are pretty basic:Completely separately, there's a number of big architectural differences between a Raspberry PI, an M1 MAC and an X86-64 machine that also impact performance. Your ask presumes that your performance concern is a CPU bound issue, which there's no particular reason to believe is the case.
- graphics on RPI4 with FreeCAD are pretty fast with Broadcome V3D 4.2 opengl driver: ok
- recompute a medium or large 3D model can be slow
- FreeCAD is single-threaded most of the time
- when I recompute on FreeCAD I can see in task manager that 1 cpu is 25% and the 3 others cpu are 0%: that makes me wonder how to improve that
Because on one hand I know FreeCAD will be really not easily fully multi-threaded, and because on the other hand I read SIMD could speed up cpu performances, I was wondering about the opportunity to use SIMD when compiling FreeCAD and third party libraries on RPI4.
Yes it does thank you. So, if I undertand correctly NEON instructions are more suitable for very specific algorithms or "small" softwares dedicated to a specific plateform.Hope this helps understand some of the technical nuances in your observations.
Maybe not so suitable as FreeCAD is concerned, right?
But maybe I could try to compile with NEON by myself anyway, a third party lib first, for eg. by compiling Calculix FEM solver with NEON? Do you think that would make sens?
Calculix is cpu expensive, multi-threaded, so that not so bad ok. But that could be a first step to test if SIMD NEON speed it up or not on RPI4?
Do you think it's a difficult task?
Thank you for your opinion about this last point, do not hesitate to tell me if you think it is a waist of time.
Re: Speedup FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)
I still have no strong idea were are bottlenecks in FC (FC, OCC,...),wmayer wrote:ping
and if using SIMD would be useful or not so much.
It seems some libs allow to use an SIMD code witch is portable on multiples platforms. For eg.:
https://github.com/aff3ct/MIPP
Werner, may I ask your opinion about SIMD strategy to speed up FreeCAD ? Would it make sens?
Others can reply as well, any opinion welcome.
Re: Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)
I don't know much about the low-level CPU stuff. But in order to see if SIMD/NEON is available you can write your own little test application with the compiler options that were suggested by the SO article.
So, put this content to a file main.cpp (see also https://gcc.gnu.org/projects/tree-ssa/v ... ation.html)
and build with
or
Then run
When building FreeCAD and you open the GUI version of cmake you should make sure that the build type (CMAKE_BUILD_TYPE) is "Release". When you check the variables CMAKE_CXX_FLAGS_RELEASE or CMAKE_C_FLAGS_RELEASE then they probably already include the option -O3
So, put this content to a file main.cpp (see also https://gcc.gnu.org/projects/tree-ssa/v ... ation.html)
Code: Select all
int a[256], b[256], c[256];
void foo () {
int i;
for (i=0; i<256; i++){
a[i] = b[i] + c[i];
}
}
int main()
{
foo();
return 0;
}
Code: Select all
g++ -O3 main.cpp
Code: Select all
g++ -O2 -ftree-vectorize main.cpp
Code: Select all
readelf -a a.out