Code performance, and the quest for it, does funny things to judgement. I have been playing with a simple, lightweight Java driver for 10gen's Mongo database.
It started out as a project just to prove I understood the wire protocol (I was trying to write the spec for something we wanted to do...). The protocol is fairly simple, so the driver comes out to be simple. I did add a few things, like a "dynamic" version to make it easy for people using dynamic or -ish languages on the JVM to use it (e.g. from Clojure, Scala, etc)
Tonight I decided to futz with it a little, and switched the I/O from Socket to NIO's SocketChannel, and then decided to see how much speed I could get if I switched to "direct" allocation of the ByteBuffers. (Basically, the storage for the buffer is from outside of the JVM heap, so that I/O can happen much faster since it doesn't have to be copied out of object on the JVM heap.)
Performance tripled. Instead of being able to write 8k objects per second to the database, I could write 28k objects per second to the database. Fantastic!
I thought, now what can I do - I now where I can get some further optimizations by avoid non-direct buffer allocations, and just....
The folly struck me. I'm guessing that a middle of the road server is going to let me do 250-500 flushes/second to disk..... maybe more, but how much more?
Other than bragging rights, for what little that's worth, since there's wasn't much effort required to do this, I now have to deal with added complexity of the software going forward. I may roll the change back :)

FYI - since you've already switched to NIO, you might want to try out the zero copy optimization as well (see http://www.ibm.com/developerworks/linux/library/j-zerocopy/index.html) and see if that improves things any. Similar to the "direct buffer" optimization you already employed, zero copy is a pretty low-impact code change but which has the potential for significant performance improvement.
HTH,
DR