Too many CPUs to build
Lately, we’ve been facing strange build errors on one of our build servers. Building a Yocto based firmware sometimes failed to build. While the error message was clear - bitbake’s git fetcher was unable to pull sources from a remote git server - it was less clear what caused the problem. The error message from bitbake indicated a connection reset. Cloning the affected repository manually worked perfectly fine, and also building the failed target by manual execution of bitbake worked.
When the issue occured first, we thought it was a network glitch and didn’t bother much. But the issue persisted and occurred again. With more failures, we were able to characterize the problem:
- Incremental rebuilds failed most likely
- Full builds with an empty
sstate-cache
never failed - Full builds with a populated
sstate-cache
failed often - Manual rebuild of a single tailed target always worked
- Only recipes with
AUTOREV
were affected - Only recipes that fetch from a specific git server using ssh were affected
In the meantime, we suspected the connection to the git server - but none of our tests indicated a problem. Reaching the git server in question involves another ssh jump host and a VPN connection to a customer. So there were quite a few components which were not under our control.
Finally, setting the log level of the ssh jump host to verbose gave us the crucial clue, as it logged:
drop connection #10 from hidden:48324 on hidden:22 past MaxStartups
So, we’ve been overrunning the jump host by reaching more than 10 concurrent unauthenticated connections. This was the reason behind the strange build errors!
Due to Yocto’s parallelism, it ran git ls-remote
in parallel and established at least one connection per CPU.
The affected build server has way more than 10 CPUs.
This explains why only our most powerful server was affected, and only rebuilds or builds with populated sstate-cache
failed sometimes.
All other builds ran enough other tasks in between to never reach the connection limit.
The morale of the story?
Always make sure that you allow more parallel connections than you have CPUs.
In our case, adjusting sshd’s MaxSessions
and MaxStartups
settings fixed the problem.