The 'a' in compliance stands for...

The 'a' in compliance stands for... - automation

The 'a' in compliance stands for

abstract things, that most developer try to avoid
artificial created rules (as it's the nature of laws to be sometimes illogical;-))
an absolute must-have if you're working with open-source software.

for me it additionally stand for

automation potential

Introducing the issue

Just think of the following situation:
a random guy, you never worked with before (and you don't know personally) tries to contribute a bitbake recipe for a new component.

As we all know each recipe has a LICENSE entry, which offers a SPDX compatible representation of a LICENSE (or multiple licenses) applied to the source code, that the recipe offers build information for.

So how can you be sure, that this setting is correct?

Well you can scan the source code all by yourself and try to figure out under what license the source code is provided.
For a single source repository, I guess that's fairly easy, but time consuming.
But what about a recipe that pulls in like ~1000 files from different repositories (yeah, NPM, I'm looking at you!)?

You all want to check that manually? I guess not!

That's where it's time to use some automation.

For the following example we will scan the code base referenced by the recipe and compare the outcome of the step to the string set by LICENSE.

First step: scan the codebase

There are a few tools out there, which are able to get any kind of licensing information from source files, e.g.

While the first one is written in python it comes with a miles long dependency tree (and a bunch of strange bugs, according to their own bugtracker), the last one is a dedicated server (with databases and all that), which doesn't suit my personal approach to do everything with a batteries-included system (so no external dependencies, everything included into a layer).

That leaves us with lc, which is written in GO (so it does bring multiprocessing already by design) and doesn't come with a very long set of dependencies.

The output of the tool (run as CSV output) looks like this

filename	directory	license	confidence	rootlicenses
simple-hello-world.c	.	Zlib	90.62%
LICENSE	test1	GPL-2.0+ AND MIT	100.00%	MIT,GPL-2.0+
LICENSE.gpl2	test1	MIT AND GPL-2.0+	99.48%	MIT,GPL-2.0+
test-1.sh	test1	(MIT OR GPL-2.0+)	100.00%	MIT,GPL-2.0+
LICENSE	test2	GPL-3.0-only	99.68%	GPL-3.0-only
test-2.sh	test2	GPL-3.0-only	100.00%	GPL-3.0-only

Here we can see that parts of the component are licensed under Zlib, while others are licensed under a version of GPL and even others are MIT licensed.

Putting the pieces together

From the above table one can see the following

Files like LICENSE, do hold the licensing information, but seem to be misinterpreted by the tool, because in this case

LICENSE seems to be the MIT license text
LICENSE.gpl2 seems to be the GPL-2.0 license text

That does usually mean that you can choose between MIT or GPL-2.0 and not like the column "license" suggests, that you need to apply both licenses

So for the further processing it might be useful to ignore those files.

That leaves us (in this example) with

Zlib AND GPL-3.0-only AND (MIT OR GPL-2.0+)

which is a SPDX compatible licensing string.

Making it more flexible

As you might have guessed it, the order of the different terms is pretty much exchangeable, so a plain string comparison will give you plenty of false positives.
So we need some code to enable us to compare these settings independent of their order.

After some googling I found license-expression a pretty small python library, which does split a SPDX license string into it's components.

So all we have to do is

take all found entries of "license" from the above table and AND combine it all together
compare it to the setting in LICENSE

Sounds pretty easy right? Well, it is for the simple cases, but just read on

Equation matching

Now think of a combination like

MIT AND (GPL-2.0 OR BSD-3-Clause)

extracted by the tool.
In the recipe

MIT AND BSD-3-Clause

is set - for a human it is the same, but not for the machine, so it would rightfully complain that the extracted license string is not the same a the one set in the recipe.

To catch these cases, one can take all the literals from the recipe setting and put it into the equation created by the scanner - use a "1" for being set and "0" for not being set.

With this in mind

MIT AND (GPL-2.0 OR BSD-3-Clause)

will turn into

1 AND (0 OR 1)

If we now replace "AND" by "&" and "OR" by "|", we can use the python builtin eval* function to get the result of the logical equation.

This result is much less prone to false positives.

Further improvements

The SPDX specification does offers things like

GPL-2.0+ - so any GPL version starting from 2.0
GPL-2.0-or-later - which is pretty much the same as the above

and now think of an extracted license string like this

GPL-3.0 AND (MIT AND GPL-2.0+)

wouldn't it be enough to set

GPL-3.0 AND MIT

in the recipe?
I think that is the case, but INAL.

So need to make the tool aware that "GPL-2.0+" can have multiple meanings.
If we "explode" each of these into all their possibilities "OR"ed together, like

GPL-3.0 AND (MIT AND (GPL-2.0 OR GPL-2.1 or GPL-3.0))

then it would be possible for the algorithm to see that, the setting chosen in the correct one.

Conclusion

With the help of the tools lc, license-expression and some glue logic it is possible to check the correct licensing of any source code base in YOCTO.

I know you already expected it

You can play around with the resulting code, using my meta-sca from GitHub on your very own YOCTO based project. Please see here and here for more details.

Wait, if you want to do it even more professionally

The way, I've shown in this blog post is suitable for occasional scanning of a code base to catch licensing issue.
If you want to do this type of scanning more frequently: the standard poky layer does ship with builtin support for your own fossology server, which then gives you way more performance, than my approach, as it uses caching and hashing to avoid scanning a entire code base over and over again.

Be sure to check out this brilliant slide set, if you like to know more about fossology and YOCTO.

Any thoughts?

If you liked this post, have any additions, comments, a.s.o. be sure to use the comments below or over at LinkedIn

* Before you start screaming - I'm pretty aware of the fact that eval is a security nightmare and shouldn't be used, so suggestions for improvements are welcome.

Speedup python on embedded systems

Have you ever considered to use python as a scripting language in an embedded system? I've been using this on recent projects although it wasn't my first choice. If I had to choose a scripting language to be used in embedded I always had a strong preference for shell/bash or lua, because they are either builtin or designed to have a significant lower footprint compared to others. Nevertheless the choice was python3 (was out of my hands to decide). When putting together the first builds using YOCTO I realized that there are two sides to python. the starting phase, where the app is initializing the execution phase, where the app just processes new data In the 2nd phase python3 has good tradeoffs between maintainability of code vs. execution speed, so there is nothing to moan about. Startup is the worst But the 1st phase where the python3-interpreter is starting is really bad. So I did some research where is might be coming from. Just to give a comparison of ...

Bit-baking with soda

Dieses Blog durchsuchen