cancel
Showing results for 
Search instead for 
Did you mean: 

ext4 - corrupt file - backup copy is fine

CX
Grafter
Posts: 750
Thanks: 4
Registered: ‎16-09-2010

ext4 - corrupt file - backup copy is fine

I recently audited a selection of files on my home server (running Ubuntu with Kernel 3.2), comparing hashes to those on my backup server. One of the files I compared had a different hash, and upon further investigation it has indeed become corrupted towards the end of the file.
The copy on the backup server is correct, so the "live" copy has been corrupted somehow. Yet, the file is only accessed from a read-only share. The only time it was written to was when it was initially copied onto the disk. That is the point that I would have expected the corruption to creep in, but it can't be that, because the backup was created whilst it was read-only (and the backup would have been equally corrupted!). The modified times of the file and its backup are identical (so it hasn't been changed by normal means since the backup). I can read the file multiple times, and every time it returns the same, wrong, hash. The data on the disk is obviously wrong, but SMART doesn't indicate any pending/reallocated sectors on the disk, or even any read errors. I don't see how the disk can then misread a sector consistently wrongly! ext4 is mounted using defaults (so barriers will be enabled).
Oh, and this is just a JBOD arrangement, no RAID, so there should be absolutely no reason for that file, or the sectors it is stored on, to have been written to.
How on earth can this happen? Is it just one of those random bitrot occurences that the disk's ECC can't detect? Or is it more likely that I've run into a software issue?
3 REPLIES 3
CX
Grafter
Posts: 750
Thanks: 4
Registered: ‎16-09-2010

Re: ext4 - corrupt file - backup copy is fine

I've done a bit more digging, comparing the binary data of the two files.
This corrupted copy has a 16KB chunk of mostly all zeros. Towards the end of this zero'd chunk is the string "EFI PART" and some more characters of garbage, which leads me to believe that this is the backup copy of the GPT, written smack bang in the middle of my data.
CX
Grafter
Posts: 750
Thanks: 4
Registered: ‎16-09-2010

Re: ext4 - corrupt file - backup copy is fine

It looks like my problem was caused by the 2.2TB limit.
In case anyone is interested, here is how I came to that conclusion.
First job was to find the location of damage within the file. For this, I used vbindiff. Comparing both files showed the corruption to be in the range 27C6 DA00 to 29DC 1453 (in hex). This corresponds to 667343360 to 667359835 (in bytes). Since the file system uses 4096-byte sectors, the corruption is therefore between the 162925th and 162930th sectors (of the file).
Next job was to find the inode number of the file:
ls -li badfile.mkv

which gave the following output:
41943087 -rwxrw-rw- 1 user user 702288980 2012-03-02 03:27 badfile.mkv

This gave the inode to be 41943087.
Then I used
debugfs -R "stat <41943087>" /dev/sde1
which gave the following output:
Inode: 41943087   Type: regular    Mode:  0766   Flags: 0x80000
Generation: 2811018407    Version: 0x00000000:00000001
User:  1000   Group:  1000   Size: 702288980
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 1371672
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x51101a61:b9b1b374 -- Mon Feb  4 20:30:25 2013
atime: 0x4fafb570:00000000 -- Sun May 13 14:21:52 2012
mtime: 0x4f503e0f:00000000 -- Fri Mar  2 03:27:11 2012
crtime: 0x4fafb564:02362118 -- Sun May 13 14:21:40 2012
Size of extra inode fields: 28
EXTENTS:
(0-32767): 195442688-195475455, (32768-65535): 195508224-195540991, (65536-98303): 195598336-195631103, (98304-131071): 195631104-195663871, (131072-163839): 195663872-195696639, (163840-171457): 195696640-195704257
(END)

The filesystem was ext4, hence the use of extents. We know that the corrupt portion of the file is the 162925-162930th blocks, and which means it's the second to last extent range used by this file (i.e. 195663872-195696639). Using 4kB sized blocks, this means the damaged portion of the file is between 801439219712 and 801573433344 Bytes.
The total size of the partition is 3000590351872 Bytes (it's a 3TB disk). The damaged portion of the file is therefore 2199151132160 Bytes from the END of the partition. This is simply too close to the (2^32 * 512) limit (2199023255552 Bytes) to be a coincidence.
A Seagate knowledge base article references a particular phenomenon.
[quote="Seagate"]Like other kinds of software, the Intel RST drivers are updated to keep pace with new technology. The Intel drivers found in retail releases of Windows 7 have a 2.2TB limitation. Rather than cut off the capacity at 2.2TB, the limitation expresses itself as the remainder above 2.2TB. In other words, the driver causes the Windows operating to see a 3TB drive as 746.52 GiB (or 800GB).
This is exactly where my file was located, and I suspect that this is what happened. I'm using Linux though.
My disks are partitioned using GPT. GPT stores the main GPT header at the start of the disk and a backup GPT header at the end. I suspect that what happened is that I accidentally connected 3TB disk to an old 2.2TB-limited controller (there is an old SII3132 card in my server), booted up and it was detected as 800GB; hence the backup GPT was "missing". Something (e.g. fsck) then "helpfully" put the backup GPT back where it thought it should be. Once the disk was moved back onto a 3TB-aware controller, the full capacity was once-again exposed, so there is now the main GPT header, backup GPT header and a phantom GPT header (showing as corruption in a file).
Lesson to be learned - make sure you connect your disks to a compatible controller! I recently replaced the SII3132 card with an ASM1061 for other reasons, so the chances of this happening again should be slim.
HairyMcbiker
All Star
Posts: 6,792
Thanks: 266
Fixes: 21
Registered: ‎16-02-2009

Re: ext4 - corrupt file - backup copy is fine

Never had a 3Tb disk, have enough issues with 2Tb ones and usb caddies. (I have 3, 2 of which will not work with 2Tb disks)
Glad you found the issue, strange one though.