New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel panic during build #2960
Comments
I'll add again here that I've seen this same panic on three different machines, two AUFS and one device-mapper. Was running 3.11 on one of the AUFS machines, and 3.10.17 on the other two (but curiously enough didn't see it on 3.10.7 - perhaps that was just good luck, because it doesn't seem to be necessarily consistent). I'm configuring one of my machines for kdump right now, so hopefully I'll get a kdump next time. |
I see this regularly too on container exit: |
@alexlarsson Do you have any insights on this issue? |
@crosbymichael There is a potential fix linked to in the redhat bug above, i have not had time to try it though. |
I tried to reproduce this, but its kinda hard. I wrote a script that launches lots of containers, but on a freshly booted machine it totally fails to trigger this. However, on my devel workstation which had been running for a few days it triggered almost instantly when i ran the script. So, it seems like its a combination of a namespace exit and something else. |
I'm having a similar issue, although not totally sure if it's the same crash since I'm not sure how to capture the oops. Nothing shows up in the logs after reboot, and kdump looks pretty scary to try to set up. I'm on gentoo, using device-mapper, and running the 3.12.7-gentoo kernel which already has the fix in the linked redhat bug. I regularly get the crash when running a 'docker build' command. |
https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c18 mentions a possible fix that is in 3.13-rc1, so it would be interesting to know if anyone sees this on 3.13. |
Unfortunately doesn't seem to have helped much. I'm on 3.13.1-gentoo which definitely has that fix, and just experienced a crash. |
@mschulkind do you have a capture of the panic you got? |
I don't. Not totally sure how when the panic happens when X is running. I can try to reproduce without X running though. |
I'm hitting this on 3.13.7, on Arch Linux, btrfs + native. I've hit this 3 times only today, a true flow killer :( It's a bit unpredictable though. I haven't been able to correlate with high load, or low memory, or with any particular build step in my app. Is there any particular bits of information from the trace that I should be looking to gather the next time this happens? Should I search how to enable |
I'm not really a kernel guy so i don't know how to debug this further. I know one thing though, if we could figure out a way to reproduce it easier I could get the right people to look at it. However, I've had a hard time reproducing this. It seems to be triggered on container exit, so i created some scripts that just spawned lots of containers. However, the script could run for thousands of containers on my newly booted laptop without any sign of the crash. Then i ran it on my desktop which had been running for at least a week doing a lot of random stuff. It crashed in less than 100 container exits. Then i rebooted and ran it again, but was unable to get any crash... So, it is triggered by container exits in combination with something else, and we need to figure out what. It seems so unpredictable though... The only thing I've seen is that it seems like its more likely to happen if i'm running something playing audio in the background. |
I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal. |
For those on Arch that have are hitting this and may want a workaround: I'm using now the linux-lts kernel from the official repo (3.10.34-1-lts), and haven't bumped into this issue yet. Interesting point about the virtualization... It might have to do with some code path that is not taken via the virtualization drivers, or that the virtualization overhead doesn't allow the same levels of concurrency. This weekend I will try to come up with some way to reproduce this consistently. |
Maybe virtualisation itself is not the common factor, but people may be shutting down their virtual machines more often? VMs can be quite hungry on resources and if I don't need them, I'll shut them down. As @alexlarsson mentioned, the problem occurred quite soon on a computer that has been running for a longer period. |
I saw some commits in 3.14 that I was hoping would resolve this. Unfortunately, after upgrading a machine to 3.14 to test that hypothesis, it seems like that's not the case. Still seeing the same race condition. |
@rohansingh Have you been able to reproduce this consistently? I'm not seeing this behavior anymore (with the same 3.13.7 from before), even while actively trying to trigger it. |
@Riccieri Fairly consistently. To clarify, I haven't been actively trying to repro. Instead I'm just monitoring a machine that other users are using to do builds, largely during business hours. Previously it was running 3.13.0, and is now at 3.14.0. I see the machine reboot due to this issue every few hours, and so far three times today since upgrading to 3.14. |
To build on @rohansingh's comment, this happens on both VMs and real hardware for us, and quite consistently on both types of machines. |
This is a hard panic to debug because it happens in the nf_conntrack destroy path. Can anybody recreate with kdump enabled and provide a kdump? That would probably improve our chances of fixing it. |
I will look into enabling kdump on my dev machine (which is where I was getting the panic before), so that if I'm lucky (!!!) to stumble on the crash again, I'll be able to report back with more info. |
ok, so we also get kernel panics using anything later than docker 0.7.2. Just tested 0.9.1 and still get the panics. 0.7.2 is fine. Is anyone on the docker team able to verify this please? |
@joelmoss Can you elaborate please? Does the panic occur on build or container exit or elsewhere? Also what platform and kernel release are you running? Thanks! |
The problem is that we have been unable to pin down when exactly it happens, so I couldn't say what action causes the panic. We are on Ubuntu 13.10 (GNU/Linux 3.11.0-15-generic x86_64) |
@joelmoss Please update the system and the kernel. It's likely that you might be running into a kernel bug. Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates. |
Thanks @joelmoss - Can you capture any output with kdump? |
We've seen this on the 3.13 and 3.14 kernels provided by Ubuntu as well, so if @joelmoss is hitting the same issue, upgrading is unlikely to help. @joelmoss Even if you don't have a kdump output, do you have the stacktrace from the system log so that we can verify that it's the same |
@rohansingh I haven't said that you're not encountering the issue on 3.13 and 3.14. I've only said that upgrading to the latest 3.11 kernel packages and keeping the system up to date is a good idea. I'm using the latest 3.11 on some systems and I'm not running into this particular problem, that's why I've recommended it. |
By the way, here is a text version of a similar stacktrace to complement the screenshot above:
|
@rohansingh Could you provide more details about the host where you can reproduce this? I couldn't reproduce this so far, so I think getting the system into the right state to make it panic during a build is related to some sequence of events which isn't very common. If you have a sequence of steps you follow to get it to crash, could you let us know how to reproduce this, please? |
@unclejack In terms of hardware and network setup, this is a paravirtual machine on EC2. The general procedure we have for reproducing this is to kick off a build process that starts 16 parallel containers to run various integration tests. The issue occurs intermittently, around two minutes after the containers are stopped. Unfortunately the situation isn't that great in terms of reproducibility, in that it's tied up with a bunch of internal code and build tools. Right now I'm trying to simplify that down to a simple script for reproducing the issue, which I hope to finish and be able to provide in the next couple days. |
I've been seeing this error too, outside of docker, with plain ol lxc. So far it seems to be a combination of SMP, lxc and using nat over a bridge(?). I think I have an idea of what's going on, but due to local hardware issues, I'm unable to get a kernel dump. Does someone have a recent one around that I can take a look at? |
@rohansingh Did you make progress with building something to be used to reproduce this problem? |
@unclejack Negative. Currently unable to reproduce outside of a specific set of EC2 instances, and not for lack of effort. |
I'm now able to consistently reproduce this issue on physical hardware and produce a kernel crash dump by running part of a build process a non-public project. Next step is to isolate what exactly we're doing in that project that causes this and produce a shareable crash dump that doesn't contain proprietary data. |
This happened right when I did a $ docker version
Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2
Last stable version: 0.11.1
$ docker info
Containers: 3
Images: 29
Storage Driver: devicemapper
Pool Name: docker-8:19-19268241-pool
Data file: /var/lib/docker/devicemapper/devicemapper/data
Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
Data Space Used: 3165.4 Mb
Data Space Total: 102400.0 Mb
Metadata Space Used: 3.0 Mb
Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.12.13-gentoo
$ uname -a
Linux minas-morgul 3.12.13-gentoo #2 SMP Mon May 12 10:07:16 MDT 2014 x86_64 AMD Phenom(tm) II X6 1090T Processor AuthenticAMD GNU/Linux |
@rohansingh any progress on your efforts of isolating root cause? |
I am still getting this crash: The host is a Xen VM as far as I know, and this did NOT happen during a build... Any ideas how to fix this? It's happening with Ubuntu 14, I would like to know which patches we need to push upstream for a fix. Update: I think this might be the upstream kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191 Other trackers:
There is no fix yet apparently :( |
I had this too, solved by going to 3.10.34. |
@gdm85 Some progress, but nothing quite useful yet. Note that I'm no longer working on this issue personally, but have a teammate who is. Here are our findings so far:
Apologies for not having anything more concrete, but we're still working on it. |
Have you tried with 3.14.x? I used to have this almost once a day, and now it hasn't happened to me in months (with no change in workflow). Of course that doesn't mean the bug is fixed, but it might at least have become less likely to trigger on later kernels. EDIT: @rohansingh how long it usually takes for you to hit a failure with |
@rohansingh thanks for your feedback, it is indeed a blocker for any production usage idea. The only workaround I can think of is to somehow serialize the killing of containers, to reduce the cross-section of multiple conntrack cleanup...but this would be just a hack, and not even guaranteed to completely address the issue. |
Though I'm not a docker user, we were also seeing the same with libvirt+lxc. Workaround: |
there is now a (tentative) patch upstream if somebody is already compiling his kernel, maybe he can give this a spin? |
I can confirm the patch posted on upstream bug prevent the crash with a pure-lxc test case. |
@rsampaio nice to hear that! I patched kernel for Ubuntu 14.04 LTS and I am going to publish Dockerfile's and .debs shortly |
For people interested at testing the first and second of the two patches available upstream: patched Ubuntu .deb packages in release v0.1.0 and release v0.2.0. You can build same packages I did by using this script to debootstrap Trusty and then my Dockerfile for a kernel builder image. UPDATE: now I have built both patched kernels and I am testing the second for intense container start/kill |
Been running the patched kernel for 12 days now, I confirm issue has gone. Now if upstream would merge that patch, this bug could be closed and pressure be on invidual distro maintainers instead |
As mentioned on the redhat tracker (https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c30), the fix for this is finally in the upstream kernel source! (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=945b2b2d259d1a4364a2799e80e8ff32f8c6ee6f) 💃 |
Will be in the next round of stable kernel releases, so you can mark this one closed. |
@gregkh Thanks! |
I had at least two kernel panics with stable docker versions (0.6.7, 0.7). I am running 3.12.0 with genpatches and aufs-patches.
The text was updated successfully, but these errors were encountered: