Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)

-alex- · Post by **-alex-** » Sun Jun 21, 2020 9:33 am

Hi mates,

I'm not (at all) an expert about compiling, furthermore I did not find any relevant thread on the forum about NEON instructions, so I ask to compiling experts:

- 1: Are NEON instructions already enabled in gcc when compiling FreeCAD on ARM plateform?
- 2: If not, do you think it's possible to enable NEON when compiling?
- 3: Does it make sens about a huge code with a lot of dependancies like FreeCAD?

AFAIK NEON instructions would increase CPU performances to run FreeCAD much faster on ARM plateform with ARMv8-A instruction set like Raspberry PI4

Thanks for your opinion and advices

Post by **PrzemoF** » Tue Jun 23, 2020 12:21 pm

I think you could try to use " readelf -A" to check that. Raspbian provides freecad.

1. No idea, use readelf. Probably not
2. Plenty of info here, it's worth to try [1] if you're into compiling on rpi4
3. Run "time FreeCAD -t 0" for both versions and you'll have an idea if i makes any changes.

[1] https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html

Reference thread: about FreeCAD on rpi4 https://forum.freecadweb.org/viewtopic. ... &start=170

-alex- · Post by **-alex-** » Wed Jun 24, 2020 8:34 pm

PrzemoF wrote: ↑Tue Jun 23, 2020 12:21 pm I think you could try to use " readelf -A" to check that. Raspbian provides freecad.

Thanks four your help, much appreciated.
Do you mean something like that?:

Code: Select all

pi@raspberrypi:~ $ cd freecad-build/bin/
pi@raspberrypi:~ $ readelf -a FreeCAD > readelf_-a_FreeCAD

readelf_-a_FreeCAD_en.txt: (93.31 KiB) Downloaded 42 times

(sorry: french because french locales when compiled...)Edit 2020/07/01: english version

Code: Select all

ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           AArch64
  Version:                           0x1
  Entry point address:               0x7da0
  Start of program headers:          64 (bytes into file)
  Start of section headers:          95696 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         31
  Section header string table index: 30

  .........

no any string with "simd" or "neon" or "mfpu" in this stdout.
So, does this mean simd (NEON) is not enabled in FreeCAD binary, or am I wrong?

Post by **PrzemoF** » Thu Jun 25, 2020 6:42 am

I'm not a specialist, so take your guess, but I think you're right. Do you compile freecad? I can help with that.

P.S. try to run it with LANG or LC_ALL,

Code: Select all

LANG=C readelf -a FreeCAD

or

Code: Select all

LC_ALL=C readelf -a FreeCAD

-alex- · Post by **-alex-** » Tue Jun 30, 2020 10:50 pm

PrzemoF wrote: ↑Thu Jun 25, 2020 6:42 am Do you compile freecad? I can help with that.

That's ok about that for now, I'm not an expert though, so I'll will ask you if I need help, your offer is much appreciated thank you.

P.S. try to run it with LANG or LC_ALL,

I have replaced the readelf text file above by the english version (my bad I forgot to delete some french locales, with dpkg -reconfigure locales and en_US-UTF-8 only it was ok...).

On the web some people says SIMD is mandatory with aarch64 (stackoverflow...), it seem the doc says pretty the same as Armv8-A is concerned: https://gcc.gnu.org/onlinedocs/gcc/AArc ... ions.html]

Code: Select all

arch value	Architecture	Includes by default
‘armv8-a’	   Armv8-A	   ‘+fp’, ‘+simd’

However on stackoverflow they talk about some "-O3 or -O2 -ftree-vectorize" arguments to declare, that is not clear for me.

Does this mean SIMD is enabled when gcc compiles FreeCAD on aarch64 system?...
My english and compilling skills are to low to fully understand all this stuffs, but I try....
NEON allows to run programs faster, so I want to check if NEON instructions are enabled when FreeCAD is compiled on Raspberry 4.
Otherwise maybe there is an additional power availlable for the RPI4...
Now I have to investigate a bit more about SIMD NEON and Aarch64 architecture, if ones are familliar with ARM and aarch64 stuffs, any information very appreciated

-alex- · Post by **-alex-** » Thu Nov 19, 2020 12:17 am

So, I'm still wondering about some possibilities to speed up FreeCAD on ARM architectures (or whatever...) and take advantage of SIMD (Single Instruction Multiple Data), i.e. NEON instructions for RPI4.

IMHO that's an important challenge because ARM devices are more and more popular, powerful and energy efficient.
For eg. with a Raspberry 4 the user can run FreeCAD pretty fine, noiseless, with consumption of about 7,5W/h. RPI4 is easy to move, easy to share, convenient for learning. That's a very suitable system for makers.

Actually I have no idea if SIMD instructions (or vector instructions) are already enabled or not in gcc compiler while compiling FreeCAD. But I don't think so, it seems to me that SIMD is not used extensively because no topic on this forum.

Remember, I'm an end user, not an expert, please be kind. But do not hesitate to give me your opinion and to tell me if I'm wrong.
AFAIK to speed up FreeCAD there is several ways:

1. improve the code: not easy
2. use a low level language: Cpp is already used
3. use multi-treading: not easy and already used for some features or libs (Part boolean, Calculix,..?)
4. take advantage of SIMD: that's the point!

Point 4 is maybe a way to speed up IMHO. But programing SIMD units of code is difficult, furthermore such code is not portable. In addition, third party libraries are out of the scope.

So, an easier way to do seems to use options of gcc compiler to enable automatic simdization of the "standard" code, but some people say that's not really efficient.
However, maybe there is a better way for automatic simdization with "partial SIMD parallelism", PArtial VEctorizeR "PARVER".

Here is a thesis "Compiler techniques for improving SIMD parallelism" by Hao Zhou http://unsworks.unsw.edu.au/fapi/datast ... ?view=true
BTW developers will understand this thesis better than me.

Developers: do you think such SIMD strategy makes sens for FreeCAD and third party libs?
Does that make sens to compile some critical libs (OCC, Coin, Gmsh, Calculix) then compile FreeCAD main code with partial simdization compiler?
Is it possible to implement such partial vectorizer techniques? Or is it rocket scientist stuffs?

jmdzampieron · Post by **jmdzampieron** » Fri Nov 27, 2020 12:55 am

Disclaimer: I am a developer familiar w/ x86, ARM and low-level optimization, however I am not a FreeCAD developer.

Simply compiling FreeCAD on ARM with an appropriate compiler and appropriate compiler flags may enable some level of SIMD NEON usage. However, well designed, built and shipped software usually does not do this. Why? Because when delivering binary packages you can't know at compile time exactly what CPU instructions will be supported by the execution CPU.

Does your ARM support ARMv7t, ARMv8t, Thumb, Thumb2, NEON extensions, etc? There's no way to predict in advance, therefore most compilers, by default, generate code for the lowest common denominator instruction set to guarantee that the application doesn't (1) die with SIGILLs or (2) Rely on kernel level instruction emulation to trap the SIGILL exception and emulate the instruction in software, which is really slow.

The way this is optimally done is with a run-time dispatcher that selects optimal implementations of various performance critical functions for the run-time detected CPU features. This makes for a more complicated compile environment and a lot of test scenarios. There are libraries, such as IPP or Oil or Framewave, that help do some of this for you.

High performance software, such as x265 or x264, libvp8, libvp9 typically use this kind of approach and therefore it makes it more difficult and error prone to port those applications to other architectures. Last I tried x265 on ARM it was non-functional.

Furthermore, it's not always sensible to approach CPU optimization, especially for graphics intensive programs. Again, without knowing where the hot spots internally are in FreeCAD I wouldn't even want to predict that NEON could help make it faster.

In summary, what you are asking about, to do properly is not an easy initial lift to get going and takes a bunch of work to maintain. Frankly, it would be better if the main framework library (specifically QT in this instance) were to do it for us. Given QT's place as an automotive infotainment library, I wouldn't be surprised if eventually it was. Furthermore, it may not even be sensible to do what you are suggesting as it many not make a noticeable difference.

Completely separately, there's a number of big architectural differences between a Raspberry PI, an M1 MAC and an X86-64 machine that also impact performance. Your ask presumes that your performance concern is a CPU bound issue, which there's no particular reason to believe is the case.

Hope this helps understand some of the technical nuances in your observations.

-alex- · Post by **-alex-** » Fri Nov 27, 2020 12:12 pm

jmdzampieron wrote: ↑Fri Nov 27, 2020 12:55 am Disclaimer: I am a developer familiar w/ x86, ARM and low-level optimization, however I am not a FreeCAD developer.

Thank you for your reply and to give your feedback about such topic

Simply compiling FreeCAD on ARM with an appropriate compiler and appropriate compiler flags may enable some level of SIMD NEON usage. However, well designed, built and shipped software usually does not do this. Why? Because when delivering binary packages you can't know at compile time exactly what CPU instructions will be supported by the execution CPU.

Ok I get it, that's not a portable way to do because it's very architecture specific.

The way this is optimally done is with a run-time dispatcher .....High performance software, such as x265 or x264, libvp8, libvp9 typically use this kind of approach and therefore it makes it more difficult and error prone to port those applications to other architectures. Last I tried x265 on ARM it was non-functional.

Ok, if you didn't succed, I will certainly did not neither.

Furthermore, it's not always sensible to approach CPU optimization, especially for graphics intensive programs. Again, without knowing where the hot spots internally are in FreeCAD I wouldn't even want to predict that NEON could help make it faster.

That is the point, can we expect some speed up or not? So, you would pretty say "no".

In summary, what you are asking about, to do properly is not an easy initial lift to get going and takes a bunch of work to maintain. Frankly, it would be better if the main framework library (specifically QT in this instance) were to do it for us. Given QT's place as an automotive infotainment library, I wouldn't be surprised if eventually it was. Furthermore, it may not even be sensible to do what you are suggesting as it many not make a noticeable difference.

Humm, makes me feel to give up.

Completely separately, there's a number of big architectural differences between a Raspberry PI, an M1 MAC and an X86-64 machine that also impact performance. Your ask presumes that your performance concern is a CPU bound issue, which there's no particular reason to believe is the case.

Well, I'm an end user, my statements are pretty basic:
- graphics on RPI4 with FreeCAD are pretty fast with Broadcome V3D 4.2 opengl driver: ok
- recompute a medium or large 3D model can be slow
- FreeCAD is single-threaded most of the time
- when I recompute on FreeCAD I can see in task manager that 1 cpu is 25% and the 3 others cpu are 0%: that makes me wonder how to improve that

Because on one hand I know FreeCAD will be really not easily fully multi-threaded, and because on the other hand I read SIMD could speed up cpu performances, I was wondering about the opportunity to use SIMD when compiling FreeCAD and third party libraries on RPI4.

Hope this helps understand some of the technical nuances in your observations.

Yes it does thank you. So, if I undertand correctly NEON instructions are more suitable for very specific algorithms or "small" softwares dedicated to a specific plateform.
Maybe not so suitable as FreeCAD is concerned, right?
But maybe I could try to compile with NEON by myself anyway, a third party lib first, for eg. by compiling Calculix FEM solver with NEON? Do you think that would make sens?
Calculix is cpu expensive, multi-threaded, so that not so bad ok. But that could be a first step to test if SIMD NEON speed it up or not on RPI4?
Do you think it's a difficult task?
Thank you for your opinion about this last point, do not hesitate to tell me if you think it is a waist of time.

-alex- · Post by **-alex-** » Thu Jan 27, 2022 2:36 pm

wmayer wrote:ping

I still have no strong idea were are bottlenecks in FC (FC, OCC,...),
and if using SIMD would be useful or not so much.
It seems some libs allow to use an SIMD code witch is portable on multiples platforms. For eg.:
https://github.com/aff3ct/MIPP
Werner, may I ask your opinion about SIMD strategy to speed up FreeCAD ? Would it make sens?

Others can reply as well, any opinion welcome.

Post by **wmayer** » Thu Jan 27, 2022 5:00 pm

I don't know much about the low-level CPU stuff. But in order to see if SIMD/NEON is available you can write your own little test application with the compiler options that were suggested by the SO article.

So, put this content to a file main.cpp (see also https://gcc.gnu.org/projects/tree-ssa/v ... ation.html)

Code: Select all

int a[256], b[256], c[256];
void foo () {
  int i;

  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
}

int main()
{
    foo();
    return 0;
}

and build with

Code: Select all

g++ -O3 main.cpp

or

Code: Select all

g++ -O2 -ftree-vectorize main.cpp

Then run

Code: Select all

readelf -a a.out

When building FreeCAD and you open the GUI version of cmake you should make sure that the build type (CMAKE_BUILD_TYPE) is "Release". When you check the variables CMAKE_CXX_FLAGS_RELEASE or CMAKE_C_FLAGS_RELEASE then they probably already include the option -O3

Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)

Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)

Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)

Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)

Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)

Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)

Re: Enable NEON instructions when compiling on ARM (Raspberry PI4)

Re: Speedup FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)

Re: Speedup FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)

Re: Speedup FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)

Re: Speed up FC with SIMD NEON instructions when compiling on ARM (Raspberry PI4)