Direkt zum Hauptbereich

The 'a' in compliance stands for... - automation

The 'a' in compliance stands for

  • abstract things, that most developer try to avoid
  • artificial created rules (as it's the nature of laws to be sometimes illogical;-))
  • an absolute must-have if you're working with open-source software.
for me it additionally stand for
  • automation potential

Introducing the issue


Just think of the following situation:
a random guy, you never worked with before (and you don't know personally) tries to contribute a bitbake recipe for a new component.
As we all know each recipe has a LICENSE entry, which offers a SPDX compatible representation of a LICENSE (or multiple licenses) applied to the source code, that the recipe offers build information for.

So how can you be sure, that this setting is correct?
Well you can scan the source code all by yourself and try to figure out under what license the source code is provided.
For a single source repository, I guess that's  fairly easy, but time consuming.
But what about a recipe that pulls in like ~1000 files from different repositories (yeah, NPM, I'm looking at you!)? 

You all want to check that manually? I guess not!

That's where it's time to use some automation.

For the following example we will scan the code base referenced by the recipe and compare the outcome of the step to the string set by LICENSE.

First step: scan the codebase


There are a few tools out there, which are able to get any kind of licensing information from source files, e.g.
While the first one is written in python it comes with a miles long dependency tree (and a bunch of strange bugs, according to their own bugtracker), the last one is a dedicated server (with databases and all that), which doesn't suit my personal approach to do everything with a batteries-included system (so no external dependencies, everything included into a layer).
That leaves us with lc, which is written in GO (so it does bring multiprocessing already by design) and doesn't come with a very long set of dependencies.

The output of the tool (run as CSV output) looks like this

filename directory license confidence rootlicenses
simple-hello-world.c . Zlib 90.62%
LICENSE test1 GPL-2.0+ AND MIT 100.00% MIT,GPL-2.0+
LICENSE.gpl2 test1 MIT AND GPL-2.0+ 99.48% MIT,GPL-2.0+
test-1.sh test1 (MIT OR GPL-2.0+) 100.00% MIT,GPL-2.0+
LICENSE test2 GPL-3.0-only 99.68% GPL-3.0-only
test-2.sh test2 GPL-3.0-only 100.00% GPL-3.0-only

Here we can see that parts of the component are licensed under Zlib, while others are licensed under a version of GPL and even others are MIT licensed.

Putting the pieces together


From the above table one can see the following
  • Files like LICENSE, do hold the licensing information, but seem to be misinterpreted by the tool, because in this case
    • LICENSE seems to be the MIT license text
    • LICENSE.gpl2 seems to be the GPL-2.0 license text
  • That does usually mean that you can choose between MIT or GPL-2.0 and not like the column "license" suggests, that you need to apply both licenses
So for the further processing it might be useful to ignore those files.
That leaves us (in this example) with 
Zlib AND GPL-3.0-only AND (MIT OR GPL-2.0+)
which is a SPDX compatible licensing string.

Making it more flexible


As you might have guessed it, the order of the different terms is pretty much exchangeable, so a plain string comparison will give you plenty of false positives.
So we need some code to enable us to compare these settings independent of their order.

After some googling I found license-expression a pretty small python library, which does split a SPDX license string into it's components.

So all we have to do is 
  • take all found entries of "license" from the above table and AND combine it all together
  • compare it to the setting in LICENSE
Sounds pretty easy right? Well, it is for the simple cases, but just read on

Equation matching


Now think of a combination like 
MIT AND (GPL-2.0 OR BSD-3-Clause)
extracted by the tool.
In the recipe 
MIT AND BSD-3-Clause
is set - for a human it is the same, but not for the machine, so it would rightfully complain that the extracted license string is not the same a the one set in the recipe.

To catch these cases, one can take all the literals from the recipe setting and put it into the equation created by the scanner - use a "1" for being set and "0" for not being set.

With this in mind 
MIT AND (GPL-2.0 OR BSD-3-Clause)
will turn into 
1 AND (0 OR 1)
If we now replace "AND" by "&" and "OR" by "|", we can use the python builtin eval* function to get the result of the logical equation.

This result is much less prone to false positives.

Further improvements


The SPDX specification does offers things like
  • GPL-2.0+ - so any GPL version starting from 2.0
  • GPL-2.0-or-later - which is pretty much the same as the above
and now think of an extracted license string like this 
GPL-3.0 AND (MIT AND GPL-2.0+)
 wouldn't it be enough to set 
GPL-3.0 AND MIT
 in the recipe?
I think that is the case, but INAL.

So need to make the tool aware that "GPL-2.0+" can have multiple meanings.
If we "explode" each of these into all their possibilities "OR"ed together, like 
GPL-3.0 AND (MIT AND (GPL-2.0 OR GPL-2.1 or GPL-3.0))
then it would be possible for the algorithm to see that, the setting chosen in the correct one.

Conclusion


With the help of the tools lc, license-expression and some glue logic it is possible to check the correct licensing of any source code base in YOCTO.

I know you already expected it


You can play around with the resulting code, using my meta-sca from GitHub on your very own YOCTO based project. Please see here and here for more details.

Wait, if you want to do it even more professionally


The way, I've shown in this blog post is suitable for occasional scanning of a code base to catch licensing issue.
If you want to do this type of scanning more frequently: the standard poky layer does ship with builtin support for your own fossology server, which then gives you way more performance, than my approach, as it uses caching and hashing to avoid scanning a entire code base over and over again.

Be sure to check out this brilliant slide set, if you like to know more about fossology and YOCTO.

Any thoughts?


If you liked this post, have any additions, comments, a.s.o. be sure to use the comments below or over at LinkedIn


* Before you start screaming - I'm pretty aware of the fact that eval is a security nightmare and shouldn't be used, so suggestions for improvements are welcome.

Kommentare

Beliebte Posts aus diesem Blog

Sharing is caring... about task hashes

The YOCTO-project can do amazing things, but requires a very decent build machine, as by nature when you build everything from scratch it does require a lot of compilation. So the ultimate goal has to be to perform only the necessary steps in each run. Understanding task hashing The thing is that bitbake uses a task hashing to determine, which tasks (such as compilation, packaging, a.s.o.) are actually required to be performed. As tasks depend on each other, this information is also embedded into a hash, so the last task for a recipe is ultimately depending on the variable that are used for this specific task and every task before. You could visualize this by using a utility called bitbake-dumpsig , which produces output like this basewhitelist: {'SOURCE_DATE_EPOCH', 'FILESEXTRAPATHS', 'PRSERV_HOST', 'THISDIR', 'TMPDIR', 'WORKDIR', 'EXTERNAL_TOOLCHAIN', 'FILE', 'BB_TASKHASH', 'USER', 'BBSERVER&

Making go not a no-go

Anyone that dealt with container engines came across go - a wonderful language, that was built to provide a right way of what C++ intended to do. The language itself is pretty straight forward and upstream poky support is given since ages... In the go world one would just run 1 2 go get github.com/foo/bar go build github.com/foo/bar and magically the go ecosystem would pull all the needed sources and build them into an executable. This is where the issues start... In the Openembedded world, one would have  one provider (aka recipe) for each dependency each recipe comes with a (remote) artifact (e.g. tarball, git repo, a.s.o.) which can be archived (so one can build the same software at a later point in time without any online connectivity) dedicated license information all this information is pretty useful when working is an environment (aka company) that has restrictions, such as reproducible builds license compliance security compliance (for instance no unpatched CVE) but when us

Speedup python on embedded systems

Have you ever considered to use python as a scripting language in an embedded system? I've been using this on recent projects although it wasn't my first choice. If I had to choose a scripting language to be used in embedded I always had a strong preference for shell/bash or lua, because they are either builtin or designed to have a significant lower footprint compared to others. Nevertheless the choice was python3 (was out of my hands to decide). When putting together the first builds using YOCTO I realized that there are two sides to python. the starting phase, where the app is initializing the execution phase, where the app just processes new data In the 2nd phase python3 has good tradeoffs between maintainability of code vs. execution speed, so there is nothing to moan about. Startup is the worst But the 1st phase where the python3-interpreter is starting is really bad. So I did some research where is might be coming from. Just to give a comparison of