Direkt zum Hauptbereich

Speedup python on embedded systems

Have you ever considered to use python as a scripting language in an embedded system?
I've been using this on recent projects although it wasn't my first choice.

If I had to choose a scripting language to be used in embedded I always had a strong preference for shell/bash or lua, because they are either builtin or designed to have a significant lower footprint compared to others.

Nevertheless the choice was python3 (was out of my hands to decide).

When putting together the first builds using YOCTO I realized that there are two sides to python.
  • the starting phase, where the app is initializing
  • the execution phase, where the app just processes new data
In the 2nd phase python3 has good tradeoffs between maintainability of code vs. execution speed, so there is nothing to moan about.

Startup is the worst


But the 1st phase where the python3-interpreter is starting is really bad.
So I did some research where is might be coming from.
Just to give a comparison of how bad things are this quick example (using perf for averaging)

A simple test in shell

perf stat -r 10 -B echo "Hello world"     
Performance counter stats for 'echo Hello world' (10 runs):
         37.622565      task-clock (msec)         #    0.881 CPUs utilized            ( +-  4.76% )                10      context-switches          #    0.274 K/sec                    ( +-  5.02% )                 0      cpu-migrations            #    0.000 K/sec                                  57      page-faults               #    0.002 M/sec                    ( +-  0.67% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses       

0.04272 +- 0.00169 seconds time elapsed  ( +-  3.96%   

vs. a simple test in python

perf stat -r 10 -B python3 -c 'print("hello world")'
Performance counter stats for 'python3 -c print("hello world")' (10 runs):
        951.188380      task-clock (msec)         #    0.971 CPUs utilized            ( +-  2.51% )
               138      context-switches          #    0.145 K/sec                    ( +- 14.66% )
                 0      cpu-migrations            #    0.000 K/sec                 
               673      page-faults               #    0.707 K/sec                    ( +-  0.79% )
   <not supported>      cycles                                                     
   <not supported>      instructions                                               
   <not supported>      branches                                                   
   <not supported>      branch-misses                                             
            0.9801 +- 0.0299 seconds time elapsed  ( +-  3.05% )

That are dimensions between these!


Ideas welcome...

There are some very nice ideas on the web to be found
they are have their purpose, but they did not work for me, because it mostly implies some major code changes to gain some speed improvements.
Most stuff I use is originated from 3rd party (pypi, github) - chance of getting major code changes into mainline master tend to zero, especially with a corner-case like an embedded system.

It's all about the benjamins I/Os

So what's left to gain at least a little speed? After doing some straces I noticed that there is massive I/O going on when the interpreter is starting up (~1k reads for a simple print()).
There got to be some options to minimize these I/O calls...

RTFM

I started digging deep into python documentation and found the command line switches 

-BDon't write .py[co] files on import. See also PYTHONDONTWRITEBYTECODE.


-S


Disable the import of the module site and the site-dependent manipulations of sys.path that it entails.


-uForce stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode. Note that there is internal buffering in xreadlines(), readlines() and file-object iterators ("for line in sys.stdin") which is not influenced by this option. To work around this, you will want to use "sys.stdin.readline()" inside a "while 1:" loop.






Hacking the world a little into the right direction

So I started replacing the default interpreter (/usr/bin/python) with a small bash script like this

#!/bin/sh/usr/bin/python3.5 -B -u -S "$@"

This heavily hacks the python import system, meaning those switches imply that there will be no "site-packages" installed into system. But that's the case in mine...

So I needed to copy all stuff /usr/lib/python-<version>/site-packages to /usr/lib/python-<version>/ without overwriting any original file, which should be working for <99% of the setups.

Finally I ran once again a forced compiler-run of python with python3 -m compileall

So here are some results of the little hack

Results before patching

perf stat -r 10 -B python3 -c 'print("hello world")'
Performance counter stats for 'python3 -c print("hello world")' (10 runs):
        951.188380      task-clock (msec)         #    0.971 CPUs utilized            ( +-  2.51% )               138      context-switches          #    0.145 K/sec                    ( +- 14.66% )                 0      cpu-migrations            #    0.000 K/sec                                 673      page-faults               #    0.707 K/sec                    ( +-  0.79% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses                                               
            0.9801 +- 0.0299 seconds time elapsed  ( +-  3.05% )

perf stat -r 10 -B python3 -c 'import collections'
Performance counter stats for 'python3 -c import collections' (10 runs):
       1031.347709      task-clock (msec)         #    0.977 CPUs utilized            ( +-  1.97% )               136      context-switches          #    0.131 K/sec                    ( +-  1.42% )                 0      cpu-migrations            #    0.000 K/sec                                 753      page-faults               #    0.730 K/sec                    ( +-  0.09% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses                                               
            1.0560 +- 0.0241 seconds time elapsed  ( +-  2.29% )

perf stat -r 10 -B python3 -c 'import pbr'
Performance counter stats for 'python3 -c import pbr' (10 runs):
        915.234301      task-clock (msec)         #    0.973 CPUs utilized            ( +-  2.32% )               126      context-switches          #    0.138 K/sec                    ( +-  3.05% )                 0      cpu-migrations            #    0.000 K/sec                                 670      page-faults               #    0.732 K/sec                    ( +-  0.12% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses                                               
            0.9409 +- 0.0247 seconds time elapsed  ( +-  2.62% )

and after patching

Performance counter stats for 'python3 -c print("hello world")' (10 runs):
        731.388882      task-clock (msec)         #    0.963 CPUs utilized            ( +-  2.71% )                93      context-switches          #    0.127 K/sec                    ( +- 20.73% )                 0      cpu-migrations            #    0.000 K/sec                                 630      page-faults               #    0.861 K/sec                    ( +-  0.18% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses                                               
            0.7595 +- 0.0242 seconds time elapsed  ( +-  3.19% )

perf stat -r 10 -B python3 -c 'import collections'
Performance counter stats for 'python3 -c import collections' (10 runs):
        924.103968      task-clock (msec)         #    0.977 CPUs utilized            ( +-  2.55% )                99      context-switches          #    0.107 K/sec                    ( +-  2.84% )                 0      cpu-migrations            #    0.000 K/sec                                 733      page-faults               #    0.793 K/sec                    ( +-  0.12% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses                                               
            0.9462 +- 0.0259 seconds time elapsed  ( +-  2.74% )

perf stat -r 10 -B python3 -c 'import pbr'
Performance counter stats for 'python3 -c import pbr' (10 runs):
        720.160245      task-clock (msec)         #    0.975 CPUs utilized            ( +-  2.98% )                77      context-switches          #    0.107 K/sec                    ( +-  1.43% )                 0      cpu-migrations            #    0.000 K/sec                                 630      page-faults               #    0.874 K/sec                    ( +-  0.15% )   <not supported>      cycles                                                         <not supported>      instructions                                                   <not supported>      branches                                                       <not supported>      branch-misses                                               
            0.7384 +- 0.0239 seconds time elapsed  ( +-  3.24% )


Conclusion


For the builtin module collections we have at least ~11% speedup - which is okay for the effort I put into it. Really good is it for the 3rd-party pbr-module. Here we have ~21% speedup, which is awesome.

I totally agree that the result are far from satisfactory but at least we gain a little speed for the things we really need to do in python. The rest has to face a major rewrite in another language if performance is of matter for you... 

Ready-to-use convenience 


To integrate this all smoothly into my YOCTO based build I created a little helper class called python-speedups.bbclass, which you could reuse if you like.

Further reading...

If you want to learn more about the startup speed of different languages you may want to have a look here

Kommentare

Beliebte Posts aus diesem Blog

Sharing is caring... about task hashes

The YOCTO-project can do amazing things, but requires a very decent build machine, as by nature when you build everything from scratch it does require a lot of compilation. So the ultimate goal has to be to perform only the necessary steps in each run. Understanding task hashing The thing is that bitbake uses a task hashing to determine, which tasks (such as compilation, packaging, a.s.o.) are actually required to be performed. As tasks depend on each other, this information is also embedded into a hash, so the last task for a recipe is ultimately depending on the variable that are used for this specific task and every task before. You could visualize this by using a utility called bitbake-dumpsig , which produces output like this basewhitelist: {'SOURCE_DATE_EPOCH', 'FILESEXTRAPATHS', 'PRSERV_HOST', 'THISDIR', 'TMPDIR', 'WORKDIR', 'EXTERNAL_TOOLCHAIN', 'FILE', 'BB_TASKHASH', 'USER', 'BBSERVER&

Making go not a no-go

Anyone that dealt with container engines came across go - a wonderful language, that was built to provide a right way of what C++ intended to do. The language itself is pretty straight forward and upstream poky support is given since ages... In the go world one would just run 1 2 go get github.com/foo/bar go build github.com/foo/bar and magically the go ecosystem would pull all the needed sources and build them into an executable. This is where the issues start... In the Openembedded world, one would have  one provider (aka recipe) for each dependency each recipe comes with a (remote) artifact (e.g. tarball, git repo, a.s.o.) which can be archived (so one can build the same software at a later point in time without any online connectivity) dedicated license information all this information is pretty useful when working is an environment (aka company) that has restrictions, such as reproducible builds license compliance security compliance (for instance no unpatched CVE) but when us