[Suspend2-devel] oops during compile after resume
Theodore Tso
tytso at mit.edu
Sun Jul 23 14:05:33 UTC 2006
On Sat, Jul 22, 2006 at 10:59:14PM +0200, Johannes Berg wrote:
> Don't count on that too much. I had a weird problem recently:
> EXT3-fs error (device sda1): ext3_add_entry: bad entry in directory #283460: rec_len % 4 != 0 - offset=0, inode=1969317987, rec_len=25646, name_len=101
> Aborting journal on device sda1.
> ext3_abort called.
> EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal
> Remounting filesystem read-only
> EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device sda1) in start_transaction: Journal has aborted
>
> e2fsck did *NOT* repair this, I had it 4 hours later exactly the same.
> The second time around, e2fsck did detect and repair it though. I know,
> very strange, but I actually have the whole session saved in a log file
> so I know :)
What this indicates is that you have a problem with your memory or
your I/O controller, or there is some kernel/propietary module bug
(such as what might be caused by a wild pointer bug in a
shoddily-written propietary module) which is corrupting your memory.
The kernel is reporting the EXT3-fs error when it notes that a
directory entry has corrupted information in it. At that point it
puts the filesystem into read/only mode to prevent any further damage
to the filesystem. But note there are multiple ways that the
corrupted directory information could have been delivered to
ext3_add_entry():
1) The directory could have been corrupted on disk --- if it
was, then e2fsck would have found it, particularly
this kind of error which e2fsck *definitely* checks for.
2) The directory could have been corrupted on the I/O path
between the disk and memory. By making the filesystem
be read/only this prevents the corrupted data from
being written back to disk, but on the next reboot,
the in-memory cached copy of the data is gone, and
hopefully the next time the data is read (by e2fsck)
it will not be corrupted.
3) The directory could have been corrupted in memory (in the
buffer cache). By making the filesystem
be read/only this prevents the corrupted data from
being written back to disk, but on the next reboot,
the in-memory cached copy of the data is gone, and
hopefully the next time the data is read (by e2fsck)
the buggy propietary kernel module has not yet been
loaded, or the kernel will not have executed the buggy
path, or the buggy path has corrupted some other piece
of memory, and so e2fsck runs clean.
So this could be caused by suspend/resume corrupted the buffer cache,
but it could have been caused by a million other things as well,
including propietary kernel modules. In the past, some propietary
kernel modules, particularly the video drivers, had been notorious for
having random memory corruption bugs that would show up as bugs in
other kernel subsystems, which is why YES, a video driver can cause
what looks like an ext3 bug or a hardware bug, and which is why many
kernel developers will refuse to help you if have propietary binary
modules loaded. It's nothing personal, it's just that some folks have
wasted a lot of time chasing phantom bugs that were caused by buggy
propietary video drivers in particular.
- Ted
More information about the TuxOnIce-devel
mailing list