Direkt zum Hauptbereich

Size really matters

 Let that title settle... and now we're getting back to a more serious issue :-).

The issue

When you're using bitbake layers you usually clone them on the fly when working in a cloud based setup, meaning a full clone of a repo, that could be highly expensive (just look at the size of the linux git repo for instance).
As cloud based setups mostly don't supply a good way to sync those resources, unless you invent something yourself or pay for it, every bit counts.
Not only as a matter of time but also as a matter of resulting cost.

The layer meta-sca I maintain, has grown over the time quite much, so it became very very huge.
Also because I made the mistake in the past to put large blobs (in this case tarballs) into the repository.
I learned that lesson but I cannot undo it, as we all know each published git revision should stay untouched for all eternity. Mainly this is because of the linked-list nature of git - if I change one commit at the bottom I will alter any commit that relies on it.

So bottom line - once published a revision will stay as it is forever.

Size matters

Which brings us to the question how can we reduce the size of a git clone for the above mentioned setup, without altering the history of the git repo.

There is basically just one option to choose: it's using shallow clones.
Shallow clone will just clone the repository at the given revision without much of history.
Basically those are like the tarball or zip downloads you know from Github or Gitlab - plain copies of the repository at the point in time of the given revision.

This concept is pretty well established and quite heavily used in various setups.

...or does technique?

BUT there is an alternative way - for instance for all the tools and workflows that don't support shallow clones.

Luckily I use tags (aka releases or versions) in my repository.

After various people asked me how to reduce the size of meta-sca, I came up with the following

  • get each release of meta-sca
  • create a diff between each release and apply as a single commit to a different repository

This strips all the "noise" between releases and builds up a fairly small repo one can use for the above mentioned cloud based setup.

and now enter the nerdy stuff...

And as I like to automate stuff, I coded a script which does exactly that
This script will loop over all branches and tags and extracts the diffs between releases.
In addition it removes stuff that isn't needed in just a "minified" environment, like extensive documentation, CI scripts, testing scripts a.s.o.

The result is astonishing


As I encouraged anyone to use release version only, it's pretty clear what to choose in your CI/CD cloud setup to safe some money and time.

You can find the resulting repo at https://github.com/priv-kweihmann/meta-sca-minified.
And let me know how it's working for you

Kommentare

Beliebte Posts aus diesem Blog

Sharing is caring... about task hashes

The YOCTO-project can do amazing things, but requires a very decent build machine, as by nature when you build everything from scratch it does require a lot of compilation. So the ultimate goal has to be to perform only the necessary steps in each run. Understanding task hashing The thing is that bitbake uses a task hashing to determine, which tasks (such as compilation, packaging, a.s.o.) are actually required to be performed. As tasks depend on each other, this information is also embedded into a hash, so the last task for a recipe is ultimately depending on the variable that are used for this specific task and every task before. You could visualize this by using a utility called bitbake-dumpsig , which produces output like this basewhitelist: {'SOURCE_DATE_EPOCH', 'FILESEXTRAPATHS', 'PRSERV_HOST', 'THISDIR', 'TMPDIR', 'WORKDIR', 'EXTERNAL_TOOLCHAIN', 'FILE', 'BB_TASKHASH', 'USER', 'BBSERVER&

Making go not a no-go

Anyone that dealt with container engines came across go - a wonderful language, that was built to provide a right way of what C++ intended to do. The language itself is pretty straight forward and upstream poky support is given since ages... In the go world one would just run 1 2 go get github.com/foo/bar go build github.com/foo/bar and magically the go ecosystem would pull all the needed sources and build them into an executable. This is where the issues start... In the Openembedded world, one would have  one provider (aka recipe) for each dependency each recipe comes with a (remote) artifact (e.g. tarball, git repo, a.s.o.) which can be archived (so one can build the same software at a later point in time without any online connectivity) dedicated license information all this information is pretty useful when working is an environment (aka company) that has restrictions, such as reproducible builds license compliance security compliance (for instance no unpatched CVE) but when us

Speedup python on embedded systems

Have you ever considered to use python as a scripting language in an embedded system? I've been using this on recent projects although it wasn't my first choice. If I had to choose a scripting language to be used in embedded I always had a strong preference for shell/bash or lua, because they are either builtin or designed to have a significant lower footprint compared to others. Nevertheless the choice was python3 (was out of my hands to decide). When putting together the first builds using YOCTO I realized that there are two sides to python. the starting phase, where the app is initializing the execution phase, where the app just processes new data In the 2nd phase python3 has good tradeoffs between maintainability of code vs. execution speed, so there is nothing to moan about. Startup is the worst But the 1st phase where the python3-interpreter is starting is really bad. So I did some research where is might be coming from. Just to give a comparison of