I recently wrote a disk driver for my x86-64 OS. I also wrote a block caching mechanism with delayed writeback to disk
Block Caching
reading/writing blocks is at a layer under the filesystem. so there is no notion of available/used blocks. This layer only reads/writes and caches blocks.
Reading a block
when a read request is executed, the requested block is first searched for in the cache. If a block is already cached, that data is returned. If the block does not exist in the cache, a new cache entry is created and is marked as "pending read". The new cache entry is associated with the device and block number that is requested. The request will then block the current thread until the block is gets its "pending read" status cleared. This will be done by the IRQ. When a new block needs to be created, it is done atomically so that only one block at a time can be created. That mechanism will prevent two similar read access that occur at the same time to issue 2 read requests.
When a block is read from disk, it is kept in the cache. Everytime it is accessed, a timestamp is update to keep track of the latest access.
Scheduling
A function called schedule_io() is called at the following times:
- At the end of a disk IRQ.
- After a cache entry is marked "pending read" and the disk driver is not busy (so no pending operation would trigger an IRQ)
The schedule_io() function iterates through the list of cache entries and finds an entry that is "pending read" and then requests the disk driver to read the sector from disk. Several different algorithms can be used in this function to make schedule_io() choose which "pending read" entry to use. A common algorithm is the "elevator" algorithm where the scheduler will choose to execute a read operation for a sector that is the closest to the last one read. This is limit seeking on disk. An elevator that needs to go to floors 5,2,8,4 will not jump around all those floors. If the elevator is currently at floor 3, it will do: 2,4,5,8.
That is not the algorithm I chose to implement though. To keep it simple (and tgus very inneficient), my scheduler just picks the first "pending read" entry it sees in the block cache list. When there is no more read requests, the scheduler proceeds with write requests. So read requests will always have higher priority. This is good for speed, but bad for reliability of data persistance.
Updating a block
when data needs to be written to an existing block, the block could be loaded in memory previously. This means that it was either read earlier for some other reasons or it was read and a small portion of it was updated. Either way, it is already in the cache and it needs to be written back to disk. In that case, the "pending write" flag will be set on it and when the scheduler picks it up, it will send a write request to the disk driver.
The following scenarios could occur:
- Trying to read while write pending
It doesn't matter. The block will be be read directly (from memory). This could happen after writing into a block and reading it right away. You would want the updated version. - Trying to write a block that does not exist yet in the cache
This means that the block was never read and we just wanna overwrite whatever is in it. A cache entry will be created for the block and data will be copied in it. The Write pending" flag will be set - Trying to update while write pending
This call would need to block until the block is finished writing back on disk. because we want to avoid updating in the middle of write
Block cache list
To keep things simple (and again very inneficient), I chose to implement the block cache list as a fixed-size array. A better approach would be to store the entries in a tree and let it grow as long as there is available memory.
Each cache entry is as follows:
#define CACHE_WRITE_PENDING 1
#define CACHE_READ_PENDING 2
#define CACHE_BLOCK_VALID 4
#define CACHE_IN_USE 8
#define CACHE_FILL_PENDING 16
struct block_cache_entry
{
unsigned long block;
char *data;
unsigned char device;
volatile unsigned char flags;
unsigned long lastAccess;
} __attribute((packed))__;
Each entry has a field to determine the sector number on disk and the device number on which the sector belongs. lastAccess is used for the cache purging alorithm. The flags field is a combination of the following bits:
- CACHE_WRITE_PENDING: The block does contain valid data but is not flushed to disk yet, but it should be.
- CACHE_READ_PENDING: The block does not contain data yet and is waiting for a disk read operation to fill it
- CACHE_BLOCK_VALID: The entry is valid. If 0, the entry is invalid and is free to use for caching. if 1, it contains valid data that belongs to a sector on disk.
- CACHE_IN_USE: The entry is in use by the cache subsystem and should be be purged.
- CACHE_FILL_PENDING: The entry was created to a write operation but does not contain data yet. So it cannot be read nor flushed to disk, but it should not be purged either.
Clearing cached block
when there is no space left in the cache block list (in my case, because the fixed-size array is full, but when the tree cannot grow anymore for the tree version), cached blocks must cleared. The block cache will find the blocks with oldest access time and that are not pending write or read and will free them. obviously, this is a very simple algorithm that does not take into account the frequency of access, or probability of access given the location on disk. But it works.
ATA driver
Just for reference, here is a sample of the disk driver. The full source code can be download at the end of this article, but I will show a portion of it here anyway