The 'a' in compliance stands for
- abstract things, that most developer try to avoid
- artificial created rules (as it's the nature of laws to be sometimes illogical;-))
- an absolute must-have if you're working with open-source software.
for me it additionally stand for
- automation potential
Introducing the issue
Just think of the following situation:
a random guy, you never worked with before (and you don't know personally) tries to contribute a bitbake recipe for a new component.
a random guy, you never worked with before (and you don't know personally) tries to contribute a bitbake recipe for a new component.
As we all know each recipe has a LICENSE entry, which offers a SPDX compatible representation of a LICENSE (or multiple licenses) applied to the source code, that the recipe offers build information for.
So how can you be sure, that this setting is correct?
Well you can scan the source code all by yourself and try to figure out under what license the source code is provided.
For a single source repository, I guess that's fairly easy, but time consuming.
But what about a recipe that pulls in like ~1000 files from different repositories (yeah, NPM, I'm looking at you!)?
For a single source repository, I guess that's fairly easy, but time consuming.
But what about a recipe that pulls in like ~1000 files from different repositories (yeah, NPM, I'm looking at you!)?
You all want to check that manually? I guess not!
That's where it's time to use some automation.
For the following example we will scan the code base referenced by the recipe and compare the outcome of the step to the string set by LICENSE.
For the following example we will scan the code base referenced by the recipe and compare the outcome of the step to the string set by LICENSE.
First step: scan the codebase
There are a few tools out there, which are able to get any kind of licensing information from source files, e.g.
While the first one is written in python it comes with a miles long dependency tree (and a bunch of strange bugs, according to their own bugtracker), the last one is a dedicated server (with databases and all that), which doesn't suit my personal approach to do everything with a batteries-included system (so no external dependencies, everything included into a layer).
That leaves us with lc, which is written in GO (so it does bring multiprocessing already by design) and doesn't come with a very long set of dependencies.
The output of the tool (run as CSV output) looks like this
filename | directory | license | confidence | rootlicenses |
simple-hello-world.c | . | Zlib | 90.62% | |
LICENSE | test1 | GPL-2.0+ AND MIT | 100.00% | MIT,GPL-2.0+ |
LICENSE.gpl2 | test1 | MIT AND GPL-2.0+ | 99.48% | MIT,GPL-2.0+ |
test-1.sh | test1 | (MIT OR GPL-2.0+) | 100.00% | MIT,GPL-2.0+ |
LICENSE | test2 | GPL-3.0-only | 99.68% | GPL-3.0-only |
test-2.sh | test2 | GPL-3.0-only | 100.00% | GPL-3.0-only |
Here we can see that parts of the component are licensed under Zlib, while others are licensed under a version of GPL and even others are MIT licensed.
Putting the pieces together
From the above table one can see the following
- Files like LICENSE, do hold the licensing information, but seem to be misinterpreted by the tool, because in this case
- LICENSE seems to be the MIT license text
- LICENSE.gpl2 seems to be the GPL-2.0 license text
- That does usually mean that you can choose between MIT or GPL-2.0 and not like the column "license" suggests, that you need to apply both licenses
So for the further processing it might be useful to ignore those files.
That leaves us (in this example) with
Zlib AND GPL-3.0-only AND (MIT OR GPL-2.0+)
which is a SPDX compatible licensing string.
Making it more flexible
As you might have guessed it, the order of the different terms is pretty much exchangeable, so a plain string comparison will give you plenty of false positives.
So we need some code to enable us to compare these settings independent of their order.
So we need some code to enable us to compare these settings independent of their order.
After some googling I found license-expression a pretty small python library, which does split a SPDX license string into it's components.
So all we have to do is
- take all found entries of "license" from the above table and AND combine it all together
- compare it to the setting in LICENSE
Sounds pretty easy right? Well, it is for the simple cases, but just read on
Equation matching
Now think of a combination like
MIT AND (GPL-2.0 OR BSD-3-Clause)
extracted by the tool.
In the recipe
In the recipe
MIT AND BSD-3-Clause
is set - for a human it is the same, but not for the machine, so it would rightfully complain that the extracted license string is not the same a the one set in the recipe.
To catch these cases, one can take all the literals from the recipe setting and put it into the equation created by the scanner - use a "1" for being set and "0" for not being set.
With this in mind
MIT AND (GPL-2.0 OR BSD-3-Clause)
will turn into
1 AND (0 OR 1)If we now replace "AND" by "&" and "OR" by "|", we can use the python builtin eval* function to get the result of the logical equation.
This result is much less prone to false positives.
Further improvements
The SPDX specification does offers things like
- GPL-2.0+ - so any GPL version starting from 2.0
- GPL-2.0-or-later - which is pretty much the same as the above
and now think of an extracted license string like this
GPL-3.0 AND (MIT AND GPL-2.0+)
wouldn't it be enough to set
GPL-3.0 AND MIT
in the recipe?
I think that is the case, but INAL.
I think that is the case, but INAL.
So need to make the tool aware that "GPL-2.0+" can have multiple meanings.
If we "explode" each of these into all their possibilities "OR"ed together, like
If we "explode" each of these into all their possibilities "OR"ed together, like
GPL-3.0 AND (MIT AND (GPL-2.0 OR GPL-2.1 or GPL-3.0))
then it would be possible for the algorithm to see that, the setting chosen in the correct one.
Conclusion
With the help of the tools lc, license-expression and some glue logic it is possible to check the correct licensing of any source code base in YOCTO.
I know you already expected it
You can play around with the resulting code, using my meta-sca from GitHub on your very own YOCTO based project. Please see here and here for more details.
Wait, if you want to do it even more professionally
The way, I've shown in this blog post is suitable for occasional scanning of a code base to catch licensing issue.
If you want to do this type of scanning more frequently: the standard poky layer does ship with builtin support for your own fossology server, which then gives you way more performance, than my approach, as it uses caching and hashing to avoid scanning a entire code base over and over again.
If you want to do this type of scanning more frequently: the standard poky layer does ship with builtin support for your own fossology server, which then gives you way more performance, than my approach, as it uses caching and hashing to avoid scanning a entire code base over and over again.
Be sure to check out this brilliant slide set, if you like to know more about fossology and YOCTO.
Any thoughts?
If you liked this post, have any additions, comments, a.s.o. be sure to use the comments below or over at LinkedIn
* Before you start screaming - I'm pretty aware of the fact that eval is a security nightmare and shouldn't be used, so suggestions for improvements are welcome.
Kommentare
Kommentar veröffentlichen