Speedup python on embedded systems

Have you ever considered to use python as a scripting language in an embedded system?
I've been using this on recent projects although it wasn't my first choice.

If I had to choose a scripting language to be used in embedded I always had a strong preference for shell/bash or lua, because they are either builtin or designed to have a significant lower footprint compared to others.

Nevertheless the choice was python3 (was out of my hands to decide).

When putting together the first builds using YOCTO I realized that there are two sides to python.

the starting phase, where the app is initializing
the execution phase, where the app just processes new data

In the 2nd phase python3 has good tradeoffs between maintainability of code vs. execution speed, so there is nothing to moan about.

Startup is the worst

But the 1st phase where the python3-interpreter is starting is really bad.

So I did some research where is might be coming from.

Just to give a comparison of how bad things are this quick example (using perf for averaging)

A simple test in shell

perf stat -r 10 -B echo "Hello world"
Performance counter stats for 'echo Hello world' (10 runs):
37.622565 task-clock (msec) # 0.881 CPUs utilized ( +- 4.76% ) 10 context-switches # 0.274 K/sec ( +- 5.02% ) 0 cpu-migrations # 0.000 K/sec 57 page-faults # 0.002 M/sec ( +- 0.67% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses

0.04272 +- 0.00169 seconds time elapsed ( +- 3.96%

vs. a simple test in python

perf stat -r 10 -B python3 -c 'print("hello world")'
Performance counter stats for 'python3 -c print("hello world")' (10 runs):
951.188380 task-clock (msec) # 0.971 CPUs utilized ( +- 2.51% )
138 context-switches # 0.145 K/sec ( +- 14.66% )
0 cpu-migrations # 0.000 K/sec
673 page-faults # 0.707 K/sec ( +- 0.79% )
<not supported> cycles
<not supported> instructions
<not supported> branches
<not supported> branch-misses
0.9801 +- 0.0299 seconds time elapsed ( +- 3.05% )

That are dimensions between these!

Ideas welcome...

There are some very nice ideas on the web to be found

they are have their purpose, but they did not work for me, because it mostly implies some major code changes to gain some speed improvements.
Most stuff I use is originated from 3rd party (pypi, github) - chance of getting major code changes into mainline master tend to zero, especially with a corner-case like an embedded system.

It's all about the benjamins I/Os

So what's left to gain at least a little speed? After doing some straces I noticed that there is massive I/O going on when the interpreter is starting up (~1k reads for a simple print()).

There got to be some options to minimize these I/O calls...

RTFM

I started digging deep into python documentation and found the command line switches

-BDon't write .py[co] files on import. See also PYTHONDONTWRITEBYTECODE.

-S

Disable the import of the module site and the site-dependent manipulations of sys.path that it entails.

-uForce stdin, stdout and stderr to be totally unbuffered. On systems where it matters, also put stdin, stdout and stderr in binary mode. Note that there is internal buffering in xreadlines(), readlines() and file-object iterators ("for line in sys.stdin") which is not influenced by this option. To work around this, you will want to use "sys.stdin.readline()" inside a "while 1:" loop.

Hacking the world a little into the right direction

So I started replacing the default interpreter (/usr/bin/python) with a small bash script like this

#!/bin/sh/usr/bin/python3.5 -B -u -S "$@"

This heavily hacks the python import system, meaning those switches imply that there will be no "site-packages" installed into system. But that's the case in mine...

So I needed to copy all stuff /usr/lib/python-<version>/site-packages to /usr/lib/python-<version>/ without overwriting any original file, which should be working for <99% of the setups.

Finally I ran once again a forced compiler-run of python with python3 -m compileall

So here are some results of the little hack

Results before patching

perf stat -r 10 -B python3 -c 'print("hello world")'
Performance counter stats for 'python3 -c print("hello world")' (10 runs):
951.188380 task-clock (msec) # 0.971 CPUs utilized ( +- 2.51% ) 138 context-switches # 0.145 K/sec ( +- 14.66% ) 0 cpu-migrations # 0.000 K/sec 673 page-faults # 0.707 K/sec ( +- 0.79% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses
0.9801 +- 0.0299 seconds time elapsed ( +- 3.05% )

perf stat -r 10 -B python3 -c 'import collections'
Performance counter stats for 'python3 -c import collections' (10 runs):
1031.347709 task-clock (msec) # 0.977 CPUs utilized ( +- 1.97% ) 136 context-switches # 0.131 K/sec ( +- 1.42% ) 0 cpu-migrations # 0.000 K/sec 753 page-faults # 0.730 K/sec ( +- 0.09% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses
1.0560 +- 0.0241 seconds time elapsed ( +- 2.29% )

perf stat -r 10 -B python3 -c 'import pbr'
Performance counter stats for 'python3 -c import pbr' (10 runs):
915.234301 task-clock (msec) # 0.973 CPUs utilized ( +- 2.32% ) 126 context-switches # 0.138 K/sec ( +- 3.05% ) 0 cpu-migrations # 0.000 K/sec 670 page-faults # 0.732 K/sec ( +- 0.12% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses
0.9409 +- 0.0247 seconds time elapsed ( +- 2.62% )

and after patching

Performance counter stats for 'python3 -c print("hello world")' (10 runs):
731.388882 task-clock (msec) # 0.963 CPUs utilized ( +- 2.71% ) 93 context-switches # 0.127 K/sec ( +- 20.73% ) 0 cpu-migrations # 0.000 K/sec 630 page-faults # 0.861 K/sec ( +- 0.18% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses
0.7595 +- 0.0242 seconds time elapsed ( +- 3.19% )

perf stat -r 10 -B python3 -c 'import collections'
Performance counter stats for 'python3 -c import collections' (10 runs):
924.103968 task-clock (msec) # 0.977 CPUs utilized ( +- 2.55% ) 99 context-switches # 0.107 K/sec ( +- 2.84% ) 0 cpu-migrations # 0.000 K/sec 733 page-faults # 0.793 K/sec ( +- 0.12% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses
0.9462 +- 0.0259 seconds time elapsed ( +- 2.74% )

perf stat -r 10 -B python3 -c 'import pbr'
Performance counter stats for 'python3 -c import pbr' (10 runs):
720.160245 task-clock (msec) # 0.975 CPUs utilized ( +- 2.98% ) 77 context-switches # 0.107 K/sec ( +- 1.43% ) 0 cpu-migrations # 0.000 K/sec 630 page-faults # 0.874 K/sec ( +- 0.15% ) <not supported> cycles <not supported> instructions <not supported> branches <not supported> branch-misses
0.7384 +- 0.0239 seconds time elapsed ( +- 3.24% )

Conclusion

For the builtin module collections we have at least ~11% speedup - which is okay for the effort I put into it. Really good is it for the 3rd-party pbr-module. Here we have ~21% speedup, which is awesome.

I totally agree that the result are far from satisfactory but at least we gain a little speed for the things we really need to do in python. The rest has to face a major rewrite in another language if performance is of matter for you...

Ready-to-use convenience

To integrate this all smoothly into my YOCTO based build I created a little helper class called python-speedups.bbclass, which you could reuse if you like.

Bit-baking with soda

Dieses Blog durchsuchen