Recently, I stopped being able to read mail sent to my CClub email address on my local machine. This was problematic, since I receive more than half of my email on this address, and rely on my local email client to handle it efficiently. All programs involved were extraordinarily unhelpful about what was wrong, though I have now solved the issue. Here's how.

My Local Email Setup

My email client is notmuch (specifically, the emacs interface). It is a tag-based client that reads local maildirs. To deliver these maildirs, I used offlineimap to transfer the email over secured (kerberized) IMAP. To send outgoing mails, notmuch/emacs hands off to msmtp, again secured with the GSSAPI. The offlineimap && notmuch new sequence was run as a cronjob (with errors not being discarded).

Remote Email Setup

CClub has three servers configured round-robin for SMTP, one of which doubles as the IMAP server. Because it was the standard at the time the machine setups were designed, they all run qmail for MTA, and the IMAP server runs Dovecot, configured to serve IMAP. Delivered mail is stored in the user's home directories (at ~/Maildir), which are located in AFS (recalling that the initial 'C' in "CClub" stands for for "CMU"), unless the user chooses to forward their email to another address (which I have purposefully not done, though almost all other users do).

Failure Mode

I noticed that something was wrong when people started mentioning emails to lists that I was on that, somehow, I had not seen. I tested sending mail from my CClub account to another account, and that worked without trouble; offlineimap had not been throwing any errors, either. At some point in disbelief that I hadn't gotten any new emails in days, I ran notmuch by hand, expecting that it had ignored files. In the process, I ran offlineimap by hand as well, and noticed that it was taking a long time (upward of 40 minutes) to sync my email. And after those 40 minutes, why, I still had no new mail. I ran it again as verbosely as possible, and nothing peculiar caught my eye; every so often, it would append "Hang in there!" to a string, and that was about it. Eventually it would exit. With no new messages.

At this point, I picked up the lead pipe labeled "root" and started swinging. I waded into the jungle of the IMAP server, wishing all the while I had a machete instead of a pipe. The first casualty was kern.log, where I discovered many entries that looked like

Jul 16 04:37:30 hostname kernel: afs: Tokens for user of AFS id 0 for cell our.cell.addr have expired

Or rather, I would have, had I gone to kern.log first, and not used dmesg. dmesg doesn't show the timestamps, you see. Once it was decided that those were irrelevant, I went after dovecot.log. Except that there was more than one. This isn't out of the ordinary, but... maybe it's better if I show what I saw.

imap:/var/log# ls -lh dovecot*
-rw------- 1 root root  18M 2013-07-16 05:01 dovecot.log
-rw------- 1 root root  53M 2013-07-07 00:55 dovecot.log.1
-rw------- 1 root root 8.5M 2013-05-17 21:06 dovecot.log.2.gz
-rw------- 1 root root 9.3M 2011-09-14 09:32 dovecot.log.3.gz

I opened up dovecot.log first, and it was full of

dovecot: 2013-07-07 03:26:02 Error: IMAP(myusername): rename(/afs/club/usr/myusername/Maildir/new/number.othernumber.imaphost, /afs/club/usr/myusername/Maildir/cur/number.othernumber.imaphost:2,) failed: File too large

tail -f while I connected with offlineimap showed first that the kern.log entries were unrelated, and second that the spew of errors into dovecot.log happened every time I attempted to download my messages.

I looked at a random one of these messages (7k in size), and verified that dovecot was set to allow files this large (it was).

What Went Wrong

This is the command that finally explained everything.

root@imap:/afs/club/usr/myusername/Maildir/cur$ ls | wc -l
31706

So why is that a problem? Well, looking at a random assortment of entries in that directory, they're all about 31-35 characters in filename length. The problem, as spelled out on this post to the OpenAFS mailing list, is that there are too many files in this directory.

If you're reading this and are worried about how many files your mail directory is taking up on disk, don't be. I looked. If you're using a modern file system that's not AFS (including ext3, ext4, btrfs, and a number of others), realistically the files per directory maximum is not something that is likely to be hit. Of course, my words should be taken with caution here, for the AFS developers likely said the same thing about AFS.

Resolution

The key to keeping this from happening again is to stop storing mail on the imap server. This means deleting messages as they are downloaded. Unfortunately, offlineimap does not support this. Fortunately, getmail does. And it's even easier to configure. (Side note: when I ran getmail for the first time against the CClub IMAP server, it immediately told me that something was wrong with the server, which offlineimap had never done. So maybe it wasn't so unfortunate that I had to switch after all.)

Unfortunately, getmail would only download a few messages before the server offered it messages that it couldn't deliver (because it had to "rename" them from new to cur), and would spew errors to dovecot.log before exiting again. I'm sure there's a more elegant solution than what this, but I just ran getmail in a loop all night until it had downloaded all of the messages. Five minutes after I woke up, it finally finished. (Remember, I had ~32,000 messages in just the cur directory.) Then I cleared the directories on the imap server.

I won't go too deeply into the notmuch side of the fix; I've written the steps needed on the notmuch website already (section "Automatically retagging the database").

And if nothing else, I now have can relate the one case where the system designer should have used mbox instead of maildir.