Some time ago we fixed issues with multi-thread building in SConscripts. Since then, we have encountered strange, flaky compilation errors occurring randomly in about 5% of our builds. We couldn’t leave it that way – our Windows CI is used by many developers working on multiple parts of the system and we must ensure that it is 100% stable.

We all know those pesky flaky bugs that won’t let us sleep peacefully. A menace to developers, they are happening once a week and are impossible to catch and debug. Our team deals with them too, and since we were able to find the source of the problems and fix it, we want to share our new-found knowledge with you.

Tungsten Fabric (TF) (previously known as OpenContrail) is a network virtualization software developed as an open source project. It consists of two main components: the TF Controller and the TF vRouter, the latter of which runs on a compute node and provides virtual network connection. The main responsibility of the TF Controller is to manage, control and collect statistics from the compute nodes running TF vRouter instances.

Our team works on porting a vRouter component for Microsoft Windows to enable it to be used as a TF compute node.

Tungsten Fabric consists of multiple modules built using SCons – a cross-platform software construction tool which makes it possible to automate a build process using Python scripts – the SConscript files.

The bug encounter

Some time ago we fixed issues with multi-thread building in SConscripts. This allowed us to speed up the compilation process in our continuous integration system and on developers’ machines. Since then, we have encountered strange, flaky compilation errors that have proven very problematic. They occurred randomly in about 5% of our builds. Let’s say we compile 1000 files for every build – that gives us one failure every 20,000 files. We couldn’t leave it in that state for a long time – our Windows CI is used by many developers working on multiple parts of the system and we aim to make it 100% stable. We couldn’t just disable a multi-threaded building, because it would massively slow down the build process (two hours in case of a single threaded vs 20 minutes in case of a multi-threaded build). The only way to resolve the issue was to find and fix the problem as soon as possible.

We started our investigation by looking at build scripts’ fragments that were involved in processing files we have found in our CI error logs. After investigating multiple cases, we noticed that the errors occur for files autogenerated by Python.

In case of log1, the file css_DT_bootstrap_css.cpp is created by an SConscript and then compiled. As we see, there are some missing identifiers, so we know that the file has been generated incorrectly. Let’s look at the simplified code that’s responsible for that:

In this case, the line with xxd (which generates two variables – an array and it’s length) failed silently, since there is no exit code checking. It definitely should be added, but in case of our investigation it would change nothing – we’d just see an error a bit earlier. We had no idea why this line failed – maybe it failed while accessing the input or the output file. Let’s move on and look at another compilation log:

In this case, multicast_html.cpp is generated roughly the same way as css_DT_bootstrap_css.cpp. That’s why it’s rather surprising that we get a “No such file or directory” error. A quick check reveals that the file exists, but could it have been created after the compiler tried to use it? That would mean a bug in SCons because the compiler shouldn’t be started before the file is done being generated. Since we weren’t able to find the bug just by looking at build logs and code, and we weren’t sure where the bug actually was – in our code, in CPython, in some third-party library or in the OS itself – we decided to add filesystem and process activity monitoring.

Debugging with ProcMon

The most commonly used tool for file system and process activity monitoring is ProcMon from Microsoft SysInternals. We should be able to see which processes use which files and exactly when that happens. One thing you should know is that ProcMon logs are very big, so we can’t start ProcMon once and keep building Tungsten Fabric until it fails, because processing that many logs would take hours. But nor can we manually start both ProcMon and building because the bug is too rare. For those reasons we decided to write a short script to automate this process:

After starting this script and waiting for a few hours we got our logs for one failed build (still almost 9GB). We excluded unimportant stuff, and the result can be seen below:

Image1: Filtered ProcMon logs showing operations on a single file.

  1. The first write to the file came from the 2nd line of the SCons code snippet
  2. Then it was opened by cmd because we used shell redirection (>>)
  3. The second write is done by xxd – the 4th line
  4. The third write came from lines 7 and 8 (as one write because of flushing)
  5. After that it’s probably scanned by SCons
  6. Then it was passed to cl.exe that tried to compile it but failed
  7. Cmd.exe (different than the one from point 2) closed the file

The first thing we noticed is that “No such file or directory” shown by cl.exe actually means a file sharing violation has occurred (but more on that later). The next thing is that Python opened more file handles than it closed. The highlighted line shows a CreateFile (which is used to open files as well as create them) that is not followed by a CloseFile. It looks like that handle was closed by cmd.exe way later but that’s actually not possible. The same process must both open and close a handle!Even if you spawn a new process and it inherits that handle, both copies of the handle have to be closed separately.

Sharing modes explained

Let’s explain how sharing a file between processes actually works. According to MSDN, ShareMode (parameter from CreateFile) set to 0 means no one else can have a handle to that file opened. If some process has already opened this file and another process tries to open it with ShareMode set to 0, then it will fail due to a “sharing violation”. ShareMode set to “Read” means other processes can open this file only for reading and the same goes for “Write”.

Back to the screenshot: in the highlighted line, Python opened the file for writing and later cl.exe tried to open the file with ShareMode set to “Read”. Since the second call failed with a “sharing violation”, it looks like the first handle was not closed – just as ProcMon shows. But as we previously stated, handles can be copied between processes but not moved (as you can with files on a disk).

Debugging with Frida

We were almost out of ideas. We needed something more powerful, something that would allow us to do more flexible and scriptable system calls monitoring. We discussed this problem with our friend redford, who suggested some possible next steps and showed us an awesome tool, Frida, a free, scriptable, dynamic instrumentation toolkit that allows one to inject scripts into processes for hooking, monitoring and tracing function calls. Frida actually injects a V8 JavaScript interpreter into the process being monitored. As you can see below, we used Python as the main API but the script that’s run inside the process has to be written in JS.

We created a simple script monitoring CreateFile and CloseHandle calls, injected it into python.exe and discovered that Python in fact calls CloseHandle, and it succeeds. After a short testing session with two simple applications that open, duplicate and close handles, we discovered that ProcMon shows only the last CloseHandle call for each handle (all duplicated/inherited handles refer to a single kernel object, and ProcMon reports the destruction of this object). This fact in combination with the closing of our handle by cmd.exe means that Python’s child processes sometimes inherit its file handles.

Reproducing the bug

The next thing we did was to analyze CreateFile and CreateProcess API calls. As image2 shows, the time needed to open the multicast_html.cpp file (13:12:21:8898394) and the time needed to create the cmd.exe process (13:12:21.9087108) – which inherited the handle – were nearly identical. After collecting multiple logs of failed builds we realized that this was true in every case: when the handle was leaked,  there was a CreateProcess call shortly after the handle was created. It looked like a race condition in CPython’s implementation of creating processes or an issue in Windows API.

Image2: ProcMon logs showing CreateFile and CreateProcess API calls.

We debugged SCons and realized it uses os.spawnve for running compiler processes. os.spawnve is implemented via _spawnve from Windows API. To check if there’s something wrong with it, we wrote a simple piece of code which infinitely opens and closes a file in one thread and uses os.spawnve to run cmd.exe in the second thread:

If you monitor handles with ProcMon while running this script, you’ll see cmd.exe inheriting many handles for “sample.txt”. This should not be happening. At this point we were finally sure that we were not crazy and that there’s definitely something wrong outside of our code.

To check if it’s a problem in the Windows API, we rewrote this code using a (subprocess actually uses CreateProcess underneath) but the problem persisted. That meant the bug is probably in CPython.

Resolving the problem

The logical next step was to run that code on the newest Python 3 and, as it turned out, the bug disappeared (for both spawnve and the subprocess). We realized the best solution would be to rewrite our build system to Python 3 and see if that helps. Fortunately it did, and we were able to run 200 builds without a single failure.

The issue was solved and we could have just forgotten about it but decided to actually find the commit that fixed it. After a lot of rummaging through the Internet and a binary search through the commits, we finally found it. This is the exact line that fixed this bug (for Python 3.4 and newer) and here you can find the PEP (Python Enhancement Proposal) that describes it. As the authors have said, “we are aware of the code breakage this is likely to cause”. So backporting this fix to Python 2 is rather impossible.

Image3: Building old Python versions requires Visual C++ 2008 which isn’t supported on Windows 7 and newer platforms so we decided to use Windows XP (fun!)

Our build system in CI was fixed and now it’s 100% stable. Now, even though Python 2 is still maintained, this bug is a one more reason to migrate to Python 3 as soon as possible.