Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel panic during build #2960

Closed
Tranquility opened this issue Nov 29, 2013 · 53 comments
Closed

Kernel panic during build #2960

Tranquility opened this issue Nov 29, 2013 · 53 comments

Comments

@Tranquility
Copy link
Contributor

I had at least two kernel panics with stable docker versions (0.6.7, 0.7). I am running 3.12.0 with genpatches and aufs-patches.

img_20131128_104318

@tianon
Copy link
Member

tianon commented Nov 30, 2013

I'll add again here that I've seen this same panic on three different machines, two AUFS and one device-mapper. Was running 3.11 on one of the AUFS machines, and 3.10.17 on the other two (but curiously enough didn't see it on 3.10.7 - perhaps that was just good luck, because it doesn't seem to be necessarily consistent). I'm configuring one of my machines for kdump right now, so hopefully I'll get a kdump next time.

@alexlarsson
Copy link
Contributor

I see this regularly too on container exit:
https://bugzilla.redhat.com/show_bug.cgi?id=1015989

@crosbymichael
Copy link
Contributor

@alexlarsson Do you have any insights on this issue?

@alexlarsson
Copy link
Contributor

@crosbymichael There is a potential fix linked to in the redhat bug above, i have not had time to try it though.

@alexlarsson
Copy link
Contributor

I tried to reproduce this, but its kinda hard. I wrote a script that launches lots of containers, but on a freshly booted machine it totally fails to trigger this. However, on my devel workstation which had been running for a few days it triggered almost instantly when i ran the script. So, it seems like its a combination of a namespace exit and something else.

@mschulkind
Copy link

I'm having a similar issue, although not totally sure if it's the same crash since I'm not sure how to capture the oops. Nothing shows up in the logs after reboot, and kdump looks pretty scary to try to set up.

I'm on gentoo, using device-mapper, and running the 3.12.7-gentoo kernel which already has the fix in the linked redhat bug. I regularly get the crash when running a 'docker build' command.

@alexlarsson
Copy link
Contributor

https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c18 mentions a possible fix that is in 3.13-rc1, so it would be interesting to know if anyone sees this on 3.13.

@mschulkind
Copy link

Unfortunately doesn't seem to have helped much. I'm on 3.13.1-gentoo which definitely has that fix, and just experienced a crash.

@pnasrat
Copy link
Contributor

pnasrat commented Feb 28, 2014

@mschulkind do you have a capture of the panic you got?

@mschulkind
Copy link

I don't. Not totally sure how when the panic happens when X is running. I can try to reproduce without X running though.

@renato-zannon
Copy link
Contributor

I'm hitting this on 3.13.7, on Arch Linux, btrfs + native. I've hit this 3 times only today, a true flow killer :(

It's a bit unpredictable though. I haven't been able to correlate with high load, or low memory, or with any particular build step in my app.

Is there any particular bits of information from the trace that I should be looking to gather the next time this happens? Should I search how to enable kdump? Is this the correct place to ask for help about this? :)

@alexlarsson
Copy link
Contributor

I'm not really a kernel guy so i don't know how to debug this further. I know one thing though, if we could figure out a way to reproduce it easier I could get the right people to look at it.

However, I've had a hard time reproducing this. It seems to be triggered on container exit, so i created some scripts that just spawned lots of containers. However, the script could run for thousands of containers on my newly booted laptop without any sign of the crash. Then i ran it on my desktop which had been running for at least a week doing a lot of random stuff. It crashed in less than 100 container exits. Then i rebooted and ran it again, but was unable to get any crash...

So, it is triggered by container exits in combination with something else, and we need to figure out what. It seems so unpredictable though... The only thing I've seen is that it seems like its more likely to happen if i'm running something playing audio in the background.

@alexlarsson
Copy link
Contributor

I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal.

@renato-zannon
Copy link
Contributor

For those on Arch that have are hitting this and may want a workaround: I'm using now the linux-lts kernel from the official repo (3.10.34-1-lts), and haven't bumped into this issue yet.

Interesting point about the virtualization... It might have to do with some code path that is not taken via the virtualization drivers, or that the virtualization overhead doesn't allow the same levels of concurrency.

This weekend I will try to come up with some way to reproduce this consistently.

@thaJeztah
Copy link
Member

Maybe virtualisation itself is not the common factor, but people may be shutting down their virtual machines more often? VMs can be quite hungry on resources and if I don't need them, I'll shut them down. As @alexlarsson mentioned, the problem occurred quite soon on a computer that has been running for a longer period.

@rohansingh
Copy link

I saw some commits in 3.14 that I was hoping would resolve this. Unfortunately, after upgrading a machine to 3.14 to test that hypothesis, it seems like that's not the case. Still seeing the same race condition.

@renato-zannon
Copy link
Contributor

@rohansingh Have you been able to reproduce this consistently? I'm not seeing this behavior anymore (with the same 3.13.7 from before), even while actively trying to trigger it.

@rohansingh
Copy link

@Riccieri Fairly consistently. To clarify, I haven't been actively trying to repro. Instead I'm just monitoring a machine that other users are using to do builds, largely during business hours. Previously it was running 3.13.0, and is now at 3.14.0.

I see the machine reboot due to this issue every few hours, and so far three times today since upgrading to 3.14.

@eandre
Copy link

eandre commented Apr 3, 2014

To build on @rohansingh's comment, this happens on both VMs and real hardware for us, and quite consistently on both types of machines.

@jpoimboe
Copy link
Contributor

jpoimboe commented Apr 3, 2014

This is a hard panic to debug because it happens in the nf_conntrack destroy path. Can anybody recreate with kdump enabled and provide a kdump? That would probably improve our chances of fixing it.

@rohansingh
Copy link

@jpoimboe Unfortunately the machine on which we see this happening pretty often is a virtual machine in EC2 running under PV-GRUB, so a kdump is not possible. I'll work with @eandre on seeing if we can reproduce this on a machine where we can kdump.

@renato-zannon
Copy link
Contributor

I will look into enabling kdump on my dev machine (which is where I was getting the panic before), so that if I'm lucky (!!!) to stumble on the crash again, I'll be able to report back with more info.

@joelmoss
Copy link

joelmoss commented Apr 8, 2014

ok, so we also get kernel panics using anything later than docker 0.7.2. Just tested 0.9.1 and still get the panics. 0.7.2 is fine.

Is anyone on the docker team able to verify this please?

@jamtur01
Copy link
Contributor

jamtur01 commented Apr 8, 2014

@joelmoss Can you elaborate please? Does the panic occur on build or container exit or elsewhere? Also what platform and kernel release are you running? Thanks!

@joelmoss
Copy link

joelmoss commented Apr 8, 2014

The problem is that we have been unable to pin down when exactly it happens, so I couldn't say what action causes the panic.

We are on Ubuntu 13.10 (GNU/Linux 3.11.0-15-generic x86_64)

@unclejack
Copy link
Contributor

@joelmoss Please update the system and the kernel. It's likely that you might be running into a kernel bug. Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

@jamtur01
Copy link
Contributor

jamtur01 commented Apr 8, 2014

Thanks @joelmoss - Can you capture any output with kdump?

@rohansingh
Copy link

@unclejack

Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

We've seen this on the 3.13 and 3.14 kernels provided by Ubuntu as well, so if @joelmoss is hitting the same issue, upgrading is unlikely to help.

@joelmoss Even if you don't have a kdump output, do you have the stacktrace from the system log so that we can verify that it's the same nf_conntrack issue?

@unclejack
Copy link
Contributor

@rohansingh I haven't said that you're not encountering the issue on 3.13 and 3.14. I've only said that upgrading to the latest 3.11 kernel packages and keeping the system up to date is a good idea. I'm using the latest 3.11 on some systems and I'm not running into this particular problem, that's why I've recommended it.

@rohansingh
Copy link

By the way, here is a text version of a similar stacktrace to complement the screenshot above:

[16314069.877834] BUG: unable to handle kernel paging request at ffffc900029fdb58
[16314069.877857] IP: [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.877870] PGD 1b6426067 PUD 1b6427067 PMD 1019e5067 PTE 0
[16314069.877879] Oops: 0002 [#1] SMP 
[16314069.877886] Modules linked in: nf_conntrack_netlink nfnetlink veth xt_addrtype xt_conntrack xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl aufs nfs lockd sunrpc bridge fscache 8021q garp stp mrp llc intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack xen_kbdfront xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq [last unloaded: ipmi_devintf]
[16314069.877968] CPU: 2 PID: 97 Comm: kworker/u16:1 Not tainted 3.13.0-18-generic #38-Ubuntu
[16314069.877982] Workqueue: netns cleanup_net
[16314069.877987] task: ffff8801affd17f0 ti: ffff8801affc4000 task.ti: ffff8801affc4000
[16314069.877994] RIP: e030:[<ffffffffa0289200>]  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878005] RSP: e02b:ffff8801affc5cb8  EFLAGS: 00010246
[16314069.878010] RAX: 0000000000000000 RBX: ffff880004a5ce08 RCX: ffff8800b5a8e988
[16314069.878016] RDX: ffffc900029fdb58 RSI: 0000000037d437d2 RDI: ffffffffa028c4c0
[16314069.878022] RBP: ffff8801affc5cc0 R08: 0000000000000200 R09: 0000000000000000
[16314069.878029] R10: ffffea0005150940 R11: ffffffff812247fd R12: ffff880004a5cd80
[16314069.878035] R13: ffff8800d24e6750 R14: ffff8800d24e6758 R15: ffff8800b5a8e000
[16314069.878046] FS:  00007f8a03029700(0000) GS:ffff8801bec80000(0000) knlGS:0000000000000000
[16314069.878052] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[16314069.878057] CR2: ffffc900029fdb58 CR3: 00000000d702a000 CR4: 0000000000002660
[16314069.878064] Stack:
[16314069.878068]  0000000000000001 ffff8801affc5ce8 ffffffffa00995a4 ffff8800d24e6750
[16314069.878078]  ffff8800b5a8e000 ffffffffa007e2c0 ffff8801affc5d08 ffffffffa00912d5
[16314069.878088]  ffff8800d24e6750 ffff8800b5a8e000 ffff8801affc5d28 ffffffffa00927b4
[16314069.878097] Call Trace:
[16314069.878112]  [<ffffffffa00995a4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack]
[16314069.878125]  [<ffffffffa00912d5>] nf_conntrack_free+0x25/0x60 [nf_conntrack]
[16314069.878136]  [<ffffffffa00927b4>] destroy_conntrack+0xb4/0x110 [nf_conntrack]
[16314069.878149]  [<ffffffffa0096260>] ? nf_conntrack_helper_fini+0x30/0x30 [nf_conntrack]
[16314069.878159]  [<ffffffff81645767>] nf_conntrack_destroy+0x17/0x20
[16314069.878170]  [<ffffffffa009223b>] nf_ct_iterate_cleanup+0x12b/0x150 [nf_conntrack]
[16314069.878183]  [<ffffffffa009653d>] nf_ct_l3proto_pernet_unregister+0x1d/0x20 [nf_conntrack]
[16314069.878194]  [<ffffffffa007c309>] ipv4_net_exit+0x19/0x50 [nf_conntrack_ipv4]
[16314069.878202]  [<ffffffff8160e549>] ops_exit_list.isra.1+0x39/0x60
[16314069.878210]  [<ffffffff8160edd0>] cleanup_net+0x110/0x250
[16314069.878221]  [<ffffffff810824a2>] process_one_work+0x182/0x450
[16314069.878228]  [<ffffffff81083241>] worker_thread+0x121/0x410
[16314069.878235]  [<ffffffff81083120>] ? rescuer_thread+0x3e0/0x3e0
[16314069.878243]  [<ffffffff81089ed2>] kthread+0xd2/0xf0
[16314069.878249]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878258]  [<ffffffff817219bc>] ret_from_fork+0x7c/0xb0
[16314069.878264]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878269] Code: 53 0f b6 58 11 84 db 74 45 48 01 c3 74 40 48 83 7b 10 00 74 39 48 c7 c7 c0 c4 28 a0 e8 3a fe 48 e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 b8 00 02 20 00 00 00 ad de 48 c7 
[16314069.878332] RIP  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878341]  RSP <ffff8801affc5cb8>
[16314069.878345] CR2: ffffc900029fdb58
[16314069.878353] ---[ end trace 98cfb73f60c69903 ]---

@unclejack
Copy link
Contributor

@rohansingh Could you provide more details about the host where you can reproduce this?
Knowing the particular network setup (VPN, openvswitch, bridges, any network hardware offloading engines, etc) and some approximate steps you've taken to reproduce that would help.

I couldn't reproduce this so far, so I think getting the system into the right state to make it panic during a build is related to some sequence of events which isn't very common.

If you have a sequence of steps you follow to get it to crash, could you let us know how to reproduce this, please?

@rohansingh
Copy link

@unclejack In terms of hardware and network setup, this is a paravirtual machine on EC2.

The general procedure we have for reproducing this is to kick off a build process that starts 16 parallel containers to run various integration tests. The issue occurs intermittently, around two minutes after the containers are stopped.

Unfortunately the situation isn't that great in terms of reproducibility, in that it's tied up with a bunch of internal code and build tools. Right now I'm trying to simplify that down to a simple script for reproducing the issue, which I hope to finish and be able to provide in the next couple days.

@konobi
Copy link

konobi commented Apr 23, 2014

I've been seeing this error too, outside of docker, with plain ol lxc.

So far it seems to be a combination of SMP, lxc and using nat over a bridge(?).

I think I have an idea of what's going on, but due to local hardware issues, I'm unable to get a kernel dump. Does someone have a recent one around that I can take a look at?

@unclejack
Copy link
Contributor

@rohansingh Did you make progress with building something to be used to reproduce this problem?

@rohansingh
Copy link

@unclejack Negative. Currently unable to reproduce outside of a specific set of EC2 instances, and not for lack of effort.

@rohansingh
Copy link

I'm now able to consistently reproduce this issue on physical hardware and produce a kernel crash dump by running part of a build process a non-public project. Next step is to isolate what exactly we're doing in that project that causes this and produce a shareable crash dump that doesn't contain proprietary data.

@yosifkit
Copy link
Contributor

This happened right when I did a docker kill on a container that was created during docker build (apt-get install specifically).

$ docker version
Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2
Last stable version: 0.11.1
$ docker info
Containers: 3
Images: 29
Storage Driver: devicemapper
 Pool Name: docker-8:19-19268241-pool
 Data file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 3165.4 Mb
 Data Space Total: 102400.0 Mb
 Metadata Space Used: 3.0 Mb
 Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.12.13-gentoo
$ uname -a
Linux minas-morgul 3.12.13-gentoo #2 SMP Mon May 12 10:07:16 MDT 2014 x86_64 AMD Phenom(tm) II X6 1090T Processor AuthenticAMD GNU/Linux

@gdm85
Copy link
Contributor

gdm85 commented May 26, 2014

@rohansingh any progress on your efforts of isolating root cause?

@gdm85
Copy link
Contributor

gdm85 commented Jun 3, 2014

I am still getting this crash:

crash

The host is a Xen VM as far as I know, and this did NOT happen during a build...

Any ideas how to fix this? It's happening with Ubuntu 14, I would like to know which patches we need to push upstream for a fix.

Update: I think this might be the upstream kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191

Other trackers:

There is no fix yet apparently :(

@wwadge
Copy link

wwadge commented Jun 3, 2014

I had this too, solved by going to 3.10.34.

@rohansingh
Copy link

@gdm85 Some progress, but nothing quite useful yet. Note that I'm no longer working on this issue personally, but have a teammate who is. Here are our findings so far:

  • As you and @yosifkit have discovered, this doesn't actually occur during builds. Rather, it occurs sometime after containers are stopped or killed and conntrack cleanup is occurring.
  • Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.
  • We have now been able to reproduce this a few times using docker-stress rather than any internal build processes. This puts us a lot closer to having crash dumps and other detailed information that we can share with the community.

Apologies for not having anything more concrete, but we're still working on it.

@renato-zannon
Copy link
Contributor

Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.

Have you tried with 3.14.x? I used to have this almost once a day, and now it hasn't happened to me in months (with no change in workflow). Of course that doesn't mean the bug is fixed, but it might at least have become less likely to trigger on later kernels.

EDIT: @rohansingh how long it usually takes for you to hit a failure with docker-stress? I could try it out on my machine to see if I can reproduce in a reasonable amount of time.

@gdm85
Copy link
Contributor

gdm85 commented Jun 3, 2014

@rohansingh thanks for your feedback, it is indeed a blocker for any production usage idea.

The only workaround I can think of is to somehow serialize the killing of containers, to reduce the cross-section of multiple conntrack cleanup...but this would be just a hack, and not even guaranteed to completely address the issue.

@konobi
Copy link

konobi commented Jun 3, 2014

Though I'm not a docker user, we were also seeing the same with libvirt+lxc.

Workaround:
We had a bridge interface (virbr0) that we weren't even using. Once we removed the extraneous bridge, we've not seen this issue again. It seems even having that bridge around for nat purposes, causes everything to get connection tracked, regardless of wether or not there's actually any NAT going on.

@gdm85
Copy link
Contributor

gdm85 commented Jun 6, 2014

there is now a (tentative) patch upstream

if somebody is already compiling his kernel, maybe he can give this a spin?

@rsampaio
Copy link
Contributor

I can confirm the patch posted on upstream bug prevent the crash with a pure-lxc test case.

@gdm85
Copy link
Contributor

gdm85 commented Jun 10, 2014

@rsampaio nice to hear that! I patched kernel for Ubuntu 14.04 LTS and I am going to publish Dockerfile's and .debs shortly

@gdm85
Copy link
Contributor

gdm85 commented Jun 10, 2014

For people interested at testing the first and second of the two patches available upstream: patched Ubuntu .deb packages in release v0.1.0 and release v0.2.0.

You can build same packages I did by using this script to debootstrap Trusty and then my Dockerfile for a kernel builder image.

UPDATE: now I have built both patched kernels and I am testing the second for intense container start/kill

@unclejack
Copy link
Contributor

As @f0 commented on #6439:

this seems like the fix for the problem https://bugzilla.kernel.org/show_bug.cgi?id=65191

@gdm85
Copy link
Contributor

gdm85 commented Jun 24, 2014

Been running the patched kernel for 12 days now, I confirm issue has gone.

Now if upstream would merge that patch, this bug could be closed and pressure be on invidual distro maintainers instead

@tianon
Copy link
Member

tianon commented Jun 27, 2014

As mentioned on the redhat tracker (https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c30), the fix for this is finally in the upstream kernel source! (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=945b2b2d259d1a4364a2799e80e8ff32f8c6ee6f) 💃

@gregkh
Copy link

gregkh commented Jul 7, 2014

Will be in the next round of stable kernel releases, so you can mark this one closed.

@unclejack
Copy link
Contributor

@gregkh Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests