Kernel panic during build #2960

Tranquility · 2013-11-29T23:31:11Z

I had at least two kernel panics with stable docker versions (0.6.7, 0.7). I am running 3.12.0 with genpatches and aufs-patches.

tianon · 2013-11-30T04:52:28Z

I'll add again here that I've seen this same panic on three different machines, two AUFS and one device-mapper. Was running 3.11 on one of the AUFS machines, and 3.10.17 on the other two (but curiously enough didn't see it on 3.10.7 - perhaps that was just good luck, because it doesn't seem to be necessarily consistent). I'm configuring one of my machines for kdump right now, so hopefully I'll get a kdump next time.

alexlarsson · 2013-12-02T10:20:08Z

I see this regularly too on container exit:
https://bugzilla.redhat.com/show_bug.cgi?id=1015989

crosbymichael · 2013-12-14T01:14:49Z

@alexlarsson Do you have any insights on this issue?

alexlarsson · 2013-12-16T12:44:11Z

@crosbymichael There is a potential fix linked to in the redhat bug above, i have not had time to try it though.

alexlarsson · 2013-12-17T14:40:39Z

I tried to reproduce this, but its kinda hard. I wrote a script that launches lots of containers, but on a freshly booted machine it totally fails to trigger this. However, on my devel workstation which had been running for a few days it triggered almost instantly when i ran the script. So, it seems like its a combination of a namespace exit and something else.

mschulkind · 2014-01-23T06:55:11Z

I'm having a similar issue, although not totally sure if it's the same crash since I'm not sure how to capture the oops. Nothing shows up in the logs after reboot, and kdump looks pretty scary to try to set up.

I'm on gentoo, using device-mapper, and running the 3.12.7-gentoo kernel which already has the fix in the linked redhat bug. I regularly get the crash when running a 'docker build' command.

alexlarsson · 2014-01-30T19:26:33Z

https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c18 mentions a possible fix that is in 3.13-rc1, so it would be interesting to know if anyone sees this on 3.13.

mschulkind · 2014-01-31T23:41:59Z

Unfortunately doesn't seem to have helped much. I'm on 3.13.1-gentoo which definitely has that fix, and just experienced a crash.

pnasrat · 2014-02-28T22:43:03Z

@mschulkind do you have a capture of the panic you got?

mschulkind · 2014-03-02T23:30:46Z

I don't. Not totally sure how when the panic happens when X is running. I can try to reproduce without X running though.

renato-zannon · 2014-03-27T02:55:39Z

I'm hitting this on 3.13.7, on Arch Linux, btrfs + native. I've hit this 3 times only today, a true flow killer :(

It's a bit unpredictable though. I haven't been able to correlate with high load, or low memory, or with any particular build step in my app.

Is there any particular bits of information from the trace that I should be looking to gather the next time this happens? Should I search how to enable kdump? Is this the correct place to ask for help about this? :)

alexlarsson · 2014-03-27T08:41:37Z

I'm not really a kernel guy so i don't know how to debug this further. I know one thing though, if we could figure out a way to reproduce it easier I could get the right people to look at it.

However, I've had a hard time reproducing this. It seems to be triggered on container exit, so i created some scripts that just spawned lots of containers. However, the script could run for thousands of containers on my newly booted laptop without any sign of the crash. Then i ran it on my desktop which had been running for at least a week doing a lot of random stuff. It crashed in less than 100 container exits. Then i rebooted and ran it again, but was unable to get any crash...

So, it is triggered by container exits in combination with something else, and we need to figure out what. It seems so unpredictable though... The only thing I've seen is that it seems like its more likely to happen if i'm running something playing audio in the background.

alexlarsson · 2014-03-27T17:49:19Z

I asked in the docker meeting today for people who have seen this, and a bunch of people had never seen it and some had. One thing that seemed to be consistent with not seeing the panic is running the kernel in a VM. So, maybe this only triggers on bare metal.

renato-zannon · 2014-03-27T18:07:17Z

For those on Arch that have are hitting this and may want a workaround: I'm using now the linux-lts kernel from the official repo (3.10.34-1-lts), and haven't bumped into this issue yet.

Interesting point about the virtualization... It might have to do with some code path that is not taken via the virtualization drivers, or that the virtualization overhead doesn't allow the same levels of concurrency.

This weekend I will try to come up with some way to reproduce this consistently.

thaJeztah · 2014-03-27T22:21:29Z

Maybe virtualisation itself is not the common factor, but people may be shutting down their virtual machines more often? VMs can be quite hungry on resources and if I don't need them, I'll shut them down. As @alexlarsson mentioned, the problem occurred quite soon on a computer that has been running for a longer period.

rohansingh · 2014-04-01T20:28:19Z

I saw some commits in 3.14 that I was hoping would resolve this. Unfortunately, after upgrading a machine to 3.14 to test that hypothesis, it seems like that's not the case. Still seeing the same race condition.

renato-zannon · 2014-04-01T20:33:27Z

@rohansingh Have you been able to reproduce this consistently? I'm not seeing this behavior anymore (with the same 3.13.7 from before), even while actively trying to trigger it.

rohansingh · 2014-04-01T20:43:13Z

@Riccieri Fairly consistently. To clarify, I haven't been actively trying to repro. Instead I'm just monitoring a machine that other users are using to do builds, largely during business hours. Previously it was running 3.13.0, and is now at 3.14.0.

I see the machine reboot due to this issue every few hours, and so far three times today since upgrading to 3.14.

eandre · 2014-04-03T07:52:50Z

To build on @rohansingh's comment, this happens on both VMs and real hardware for us, and quite consistently on both types of machines.

jpoimboe · 2014-04-03T13:32:55Z

This is a hard panic to debug because it happens in the nf_conntrack destroy path. Can anybody recreate with kdump enabled and provide a kdump? That would probably improve our chances of fixing it.

rohansingh · 2014-04-03T17:37:56Z

@jpoimboe Unfortunately the machine on which we see this happening pretty often is a virtual machine in EC2 running under PV-GRUB, so a kdump is not possible. I'll work with @eandre on seeing if we can reproduce this on a machine where we can kdump.

renato-zannon · 2014-04-03T18:12:10Z

I will look into enabling kdump on my dev machine (which is where I was getting the panic before), so that if I'm lucky (!!!) to stumble on the crash again, I'll be able to report back with more info.

joelmoss · 2014-04-08T09:18:28Z

ok, so we also get kernel panics using anything later than docker 0.7.2. Just tested 0.9.1 and still get the panics. 0.7.2 is fine.

Is anyone on the docker team able to verify this please?

jamtur01 · 2014-04-08T12:26:36Z

@joelmoss Can you elaborate please? Does the panic occur on build or container exit or elsewhere? Also what platform and kernel release are you running? Thanks!

joelmoss · 2014-04-08T12:30:28Z

The problem is that we have been unable to pin down when exactly it happens, so I couldn't say what action causes the panic.

We are on Ubuntu 13.10 (GNU/Linux 3.11.0-15-generic x86_64)

unclejack · 2014-04-08T12:32:26Z

@joelmoss Please update the system and the kernel. It's likely that you might be running into a kernel bug. Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

jamtur01 · 2014-04-08T12:33:26Z

Thanks @joelmoss - Can you capture any output with kdump?

rohansingh · 2014-04-08T14:02:11Z

@unclejack

Ubuntu has updated packages for the 3.11.0 kernel and you can install them by installing updates.

We've seen this on the 3.13 and 3.14 kernels provided by Ubuntu as well, so if @joelmoss is hitting the same issue, upgrading is unlikely to help.

@joelmoss Even if you don't have a kdump output, do you have the stacktrace from the system log so that we can verify that it's the same nf_conntrack issue?

unclejack · 2014-04-08T14:05:04Z

@rohansingh I haven't said that you're not encountering the issue on 3.13 and 3.14. I've only said that upgrading to the latest 3.11 kernel packages and keeping the system up to date is a good idea. I'm using the latest 3.11 on some systems and I'm not running into this particular problem, that's why I've recommended it.

rohansingh · 2014-04-08T14:10:05Z

By the way, here is a text version of a similar stacktrace to complement the screenshot above:

[16314069.877834] BUG: unable to handle kernel paging request at ffffc900029fdb58
[16314069.877857] IP: [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.877870] PGD 1b6426067 PUD 1b6427067 PMD 1019e5067 PTE 0
[16314069.877879] Oops: 0002 [#1] SMP 
[16314069.877886] Modules linked in: nf_conntrack_netlink nfnetlink veth xt_addrtype xt_conntrack xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl aufs nfs lockd sunrpc bridge fscache 8021q garp stp mrp llc intel_rapl crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack xen_kbdfront xen_fbfront syscopyarea sysfillrect sysimgblt fb_sys_fops raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq [last unloaded: ipmi_devintf]
[16314069.877968] CPU: 2 PID: 97 Comm: kworker/u16:1 Not tainted 3.13.0-18-generic #38-Ubuntu
[16314069.877982] Workqueue: netns cleanup_net
[16314069.877987] task: ffff8801affd17f0 ti: ffff8801affc4000 task.ti: ffff8801affc4000
[16314069.877994] RIP: e030:[<ffffffffa0289200>]  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878005] RSP: e02b:ffff8801affc5cb8  EFLAGS: 00010246
[16314069.878010] RAX: 0000000000000000 RBX: ffff880004a5ce08 RCX: ffff8800b5a8e988
[16314069.878016] RDX: ffffc900029fdb58 RSI: 0000000037d437d2 RDI: ffffffffa028c4c0
[16314069.878022] RBP: ffff8801affc5cc0 R08: 0000000000000200 R09: 0000000000000000
[16314069.878029] R10: ffffea0005150940 R11: ffffffff812247fd R12: ffff880004a5cd80
[16314069.878035] R13: ffff8800d24e6750 R14: ffff8800d24e6758 R15: ffff8800b5a8e000
[16314069.878046] FS:  00007f8a03029700(0000) GS:ffff8801bec80000(0000) knlGS:0000000000000000
[16314069.878052] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[16314069.878057] CR2: ffffc900029fdb58 CR3: 00000000d702a000 CR4: 0000000000002660
[16314069.878064] Stack:
[16314069.878068]  0000000000000001 ffff8801affc5ce8 ffffffffa00995a4 ffff8800d24e6750
[16314069.878078]  ffff8800b5a8e000 ffffffffa007e2c0 ffff8801affc5d08 ffffffffa00912d5
[16314069.878088]  ffff8800d24e6750 ffff8800b5a8e000 ffff8801affc5d28 ffffffffa00927b4
[16314069.878097] Call Trace:
[16314069.878112]  [<ffffffffa00995a4>] __nf_ct_ext_destroy+0x44/0x60 [nf_conntrack]
[16314069.878125]  [<ffffffffa00912d5>] nf_conntrack_free+0x25/0x60 [nf_conntrack]
[16314069.878136]  [<ffffffffa00927b4>] destroy_conntrack+0xb4/0x110 [nf_conntrack]
[16314069.878149]  [<ffffffffa0096260>] ? nf_conntrack_helper_fini+0x30/0x30 [nf_conntrack]
[16314069.878159]  [<ffffffff81645767>] nf_conntrack_destroy+0x17/0x20
[16314069.878170]  [<ffffffffa009223b>] nf_ct_iterate_cleanup+0x12b/0x150 [nf_conntrack]
[16314069.878183]  [<ffffffffa009653d>] nf_ct_l3proto_pernet_unregister+0x1d/0x20 [nf_conntrack]
[16314069.878194]  [<ffffffffa007c309>] ipv4_net_exit+0x19/0x50 [nf_conntrack_ipv4]
[16314069.878202]  [<ffffffff8160e549>] ops_exit_list.isra.1+0x39/0x60
[16314069.878210]  [<ffffffff8160edd0>] cleanup_net+0x110/0x250
[16314069.878221]  [<ffffffff810824a2>] process_one_work+0x182/0x450
[16314069.878228]  [<ffffffff81083241>] worker_thread+0x121/0x410
[16314069.878235]  [<ffffffff81083120>] ? rescuer_thread+0x3e0/0x3e0
[16314069.878243]  [<ffffffff81089ed2>] kthread+0xd2/0xf0
[16314069.878249]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878258]  [<ffffffff817219bc>] ret_from_fork+0x7c/0xb0
[16314069.878264]  [<ffffffff81089e00>] ? kthread_create_on_node+0x190/0x190
[16314069.878269] Code: 53 0f b6 58 11 84 db 74 45 48 01 c3 74 40 48 83 7b 10 00 74 39 48 c7 c7 c0 c4 28 a0 e8 3a fe 48 e1 48 8b 03 48 8b 53 08 48 85 c0 <48> 89 02 74 04 48 89 50 08 48 b8 00 02 20 00 00 00 ad de 48 c7 
[16314069.878332] RIP  [<ffffffffa0289200>] nf_nat_cleanup_conntrack+0x40/0x70 [nf_nat]
[16314069.878341]  RSP <ffff8801affc5cb8>
[16314069.878345] CR2: ffffc900029fdb58
[16314069.878353] ---[ end trace 98cfb73f60c69903 ]---

unclejack · 2014-04-10T18:50:31Z

@rohansingh Could you provide more details about the host where you can reproduce this?
Knowing the particular network setup (VPN, openvswitch, bridges, any network hardware offloading engines, etc) and some approximate steps you've taken to reproduce that would help.

I couldn't reproduce this so far, so I think getting the system into the right state to make it panic during a build is related to some sequence of events which isn't very common.

If you have a sequence of steps you follow to get it to crash, could you let us know how to reproduce this, please?

rohansingh · 2014-04-11T15:06:16Z

@unclejack In terms of hardware and network setup, this is a paravirtual machine on EC2.

The general procedure we have for reproducing this is to kick off a build process that starts 16 parallel containers to run various integration tests. The issue occurs intermittently, around two minutes after the containers are stopped.

Unfortunately the situation isn't that great in terms of reproducibility, in that it's tied up with a bunch of internal code and build tools. Right now I'm trying to simplify that down to a simple script for reproducing the issue, which I hope to finish and be able to provide in the next couple days.

konobi · 2014-04-23T01:19:46Z

I've been seeing this error too, outside of docker, with plain ol lxc.

So far it seems to be a combination of SMP, lxc and using nat over a bridge(?).

I think I have an idea of what's going on, but due to local hardware issues, I'm unable to get a kernel dump. Does someone have a recent one around that I can take a look at?

unclejack · 2014-04-23T01:23:21Z

@rohansingh Did you make progress with building something to be used to reproduce this problem?

rohansingh · 2014-04-24T15:45:59Z

@unclejack Negative. Currently unable to reproduce outside of a specific set of EC2 instances, and not for lack of effort.

rohansingh · 2014-05-09T20:09:01Z

I'm now able to consistently reproduce this issue on physical hardware and produce a kernel crash dump by running part of a build process a non-public project. Next step is to isolate what exactly we're doing in that project that causes this and produce a shareable crash dump that doesn't contain proprietary data.

yosifkit · 2014-05-22T22:54:18Z

This happened right when I did a docker kill on a container that was created during docker build (apt-get install specifically).

$ docker version
Client version: 0.11.1
Client API version: 1.11
Go version (client): go1.2
Git commit (client): fb99f99
Server version: 0.11.1
Server API version: 1.11
Git commit (server): fb99f99
Go version (server): go1.2
Last stable version: 0.11.1
$ docker info
Containers: 3
Images: 29
Storage Driver: devicemapper
 Pool Name: docker-8:19-19268241-pool
 Data file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata file: /var/lib/docker/devicemapper/devicemapper/metadata
 Data Space Used: 3165.4 Mb
 Data Space Total: 102400.0 Mb
 Metadata Space Used: 3.0 Mb
 Metadata Space Total: 2048.0 Mb
Execution Driver: native-0.2
Kernel Version: 3.12.13-gentoo
$ uname -a
Linux minas-morgul 3.12.13-gentoo #2 SMP Mon May 12 10:07:16 MDT 2014 x86_64 AMD Phenom(tm) II X6 1090T Processor AuthenticAMD GNU/Linux

gdm85 · 2014-05-26T11:01:17Z

@rohansingh any progress on your efforts of isolating root cause?

gdm85 · 2014-06-03T13:46:29Z

I am still getting this crash:

The host is a Xen VM as far as I know, and this did NOT happen during a build...

Any ideas how to fix this? It's happening with Ubuntu 14, I would like to know which patches we need to push upstream for a fix.

Update: I think this might be the upstream kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=65191

Other trackers:

There is no fix yet apparently :(

wwadge · 2014-06-03T14:11:44Z

I had this too, solved by going to 3.10.34.

rohansingh · 2014-06-03T16:04:02Z

@gdm85 Some progress, but nothing quite useful yet. Note that I'm no longer working on this issue personally, but have a teammate who is. Here are our findings so far:

As you and @yosifkit have discovered, this doesn't actually occur during builds. Rather, it occurs sometime after containers are stopped or killed and conntrack cleanup is occurring.
Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.
We have now been able to reproduce this a few times using docker-stress rather than any internal build processes. This puts us a lot closer to having crash dumps and other detailed information that we can share with the community.

Apologies for not having anything more concrete, but we're still working on it.

renato-zannon · 2014-06-03T16:09:08Z

Newer kernels (contrary to reports by others) don't seem to solve this. We've been consistently reproducing with 3.13.0.

Have you tried with 3.14.x? I used to have this almost once a day, and now it hasn't happened to me in months (with no change in workflow). Of course that doesn't mean the bug is fixed, but it might at least have become less likely to trigger on later kernels.

EDIT: @rohansingh how long it usually takes for you to hit a failure with docker-stress? I could try it out on my machine to see if I can reproduce in a reasonable amount of time.

gdm85 · 2014-06-03T16:57:14Z

@rohansingh thanks for your feedback, it is indeed a blocker for any production usage idea.

The only workaround I can think of is to somehow serialize the killing of containers, to reduce the cross-section of multiple conntrack cleanup...but this would be just a hack, and not even guaranteed to completely address the issue.

konobi · 2014-06-03T18:59:57Z

Though I'm not a docker user, we were also seeing the same with libvirt+lxc.

Workaround:
We had a bridge interface (virbr0) that we weren't even using. Once we removed the extraneous bridge, we've not seen this issue again. It seems even having that bridge around for nat purposes, causes everything to get connection tracked, regardless of wether or not there's actually any NAT going on.

gdm85 · 2014-06-06T08:25:47Z

there is now a (tentative) patch upstream

if somebody is already compiling his kernel, maybe he can give this a spin?

rsampaio · 2014-06-10T04:32:08Z

I can confirm the patch posted on upstream bug prevent the crash with a pure-lxc test case.

gdm85 · 2014-06-10T08:25:04Z

@rsampaio nice to hear that! I patched kernel for Ubuntu 14.04 LTS and I am going to publish Dockerfile's and .debs shortly

gdm85 · 2014-06-10T15:05:16Z

For people interested at testing the first and second of the two patches available upstream: patched Ubuntu .deb packages in release v0.1.0 and release v0.2.0.

You can build same packages I did by using this script to debootstrap Trusty and then my Dockerfile for a kernel builder image.

UPDATE: now I have built both patched kernels and I am testing the second for intense container start/kill

unclejack · 2014-06-18T10:58:01Z

As @f0 commented on #6439:

this seems like the fix for the problem https://bugzilla.kernel.org/show_bug.cgi?id=65191

gdm85 · 2014-06-24T09:17:08Z

Been running the patched kernel for 12 days now, I confirm issue has gone.

Now if upstream would merge that patch, this bug could be closed and pressure be on invidual distro maintainers instead

tianon · 2014-06-27T21:31:20Z

As mentioned on the redhat tracker (https://bugzilla.redhat.com/show_bug.cgi?id=1015989#c30), the fix for this is finally in the upstream kernel source! (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=945b2b2d259d1a4364a2799e80e8ff32f8c6ee6f) 💃

gregkh · 2014-07-07T22:31:39Z

Will be in the next round of stable kernel releases, so you can mark this one closed.

unclejack · 2014-07-07T22:32:47Z

@gregkh Thanks!

Tranquility mentioned this issue Nov 29, 2013

Btrfs rm issue #2953

Closed

tianon mentioned this issue Dec 2, 2013

remove kernel version warning on rhel 6.5 #2993

Merged

unclejack added the kernel label May 28, 2014

tianon mentioned this issue Jun 17, 2014

Kernel OOPS when building a Dockerfile #6439

Closed

unclejack closed this as completed Jul 7, 2014

jbergstroem mentioned this issue Aug 26, 2014

Make linux kernel >=3.8 a requirement tianon/docker-overlay#23

Closed

houstar mentioned this issue Jun 28, 2016

dockerd can't be compiled successfully for GOOS=linux GOARCH=ppc64 #23894

Closed

Kernel panic during build #2960

Kernel panic during build #2960

Comments

Tranquility commented Nov 29, 2013

tianon commented Nov 30, 2013

alexlarsson commented Dec 2, 2013

crosbymichael commented Dec 14, 2013

alexlarsson commented Dec 16, 2013

alexlarsson commented Dec 17, 2013

mschulkind commented Jan 23, 2014

alexlarsson commented Jan 30, 2014

mschulkind commented Jan 31, 2014

pnasrat commented Feb 28, 2014

mschulkind commented Mar 2, 2014

renato-zannon commented Mar 27, 2014

alexlarsson commented Mar 27, 2014

alexlarsson commented Mar 27, 2014

renato-zannon commented Mar 27, 2014

thaJeztah commented Mar 27, 2014

rohansingh commented Apr 1, 2014

renato-zannon commented Apr 1, 2014

rohansingh commented Apr 1, 2014

eandre commented Apr 3, 2014

jpoimboe commented Apr 3, 2014

rohansingh commented Apr 3, 2014

renato-zannon commented Apr 3, 2014

joelmoss commented Apr 8, 2014

jamtur01 commented Apr 8, 2014

joelmoss commented Apr 8, 2014

unclejack commented Apr 8, 2014

jamtur01 commented Apr 8, 2014

rohansingh commented Apr 8, 2014

unclejack commented Apr 8, 2014

rohansingh commented Apr 8, 2014

unclejack commented Apr 10, 2014

rohansingh commented Apr 11, 2014

konobi commented Apr 23, 2014

unclejack commented Apr 23, 2014

rohansingh commented Apr 24, 2014

rohansingh commented May 9, 2014

yosifkit commented May 22, 2014

gdm85 commented May 26, 2014

gdm85 commented Jun 3, 2014

wwadge commented Jun 3, 2014

rohansingh commented Jun 3, 2014

renato-zannon commented Jun 3, 2014

gdm85 commented Jun 3, 2014

konobi commented Jun 3, 2014

gdm85 commented Jun 6, 2014

rsampaio commented Jun 10, 2014

gdm85 commented Jun 10, 2014

gdm85 commented Jun 10, 2014

unclejack commented Jun 18, 2014

gdm85 commented Jun 24, 2014

tianon commented Jun 27, 2014

gregkh commented Jul 7, 2014

unclejack commented Jul 7, 2014