In an earlier post I said that the x86_64 architecture did not have a way to tag TLB entries. Apparently, it is possible. I don't know how I missed it. But there are caveats. I wrote another post about memory paging: Memory Paging
What the TLB does
The TLB is a cache for page translation descriptors. It stores information about virtual to physical memory mapping. When the CPU wants to access memory, it looks for a translation in the TLB. If a translation is not cached in the TLB, then this is considered a "TLB miss". The CPU then has to fetch the page descriptor from RAM using the table pointer stored in register cr3. Once the descriptor is loaded in the TLB, it stays there until the TLB gets full. When the TLB is full, the CPU purges entries to to replace them with newer ones. Access to the TLB is a lot faster than accesses to RAM. So The TLB really is just a page descriptor cache.
Benefits of TLB tagging
On the x86 architecture, when loading the cr3 register with a new PML4 address, the whole TLB gets invalidated automatically. This is because entries in the TLB might not describe pages correctly according to the new tables loaded. Entries in there are considered stale. Adress vX might point pX, but before the PML4 change, it was pointing to pY. You don't want the CPU to use those mappings.
Normally, you would change the mapping on a process change. Since the page mapping is different for each process. But the TLB is quite large. It could easily hold entries for 2 processes. So a TLB flush would be expensive for no reason. That is why Process Context Identifiers (PCID) were introduced.
First, you need to enable the PCID feature on the CPU in the cr4 register. With PCID enabled, a load to cr3 will no longer invalidate the TLB So this is dangerous if you do not maintain the PCID for each process at that point. Now, everytime you load cr3, you must change the current PCID. Each Entry added in the TLB will be tagged against the current PCID. This "tag" is invisible to us, it is only used internally in the TLB (The whole TLB is invisible to us anymway). So now, if two process access adrress vX with two different physical mapping, those two addresses can reside in the TLB without conflicting with each other. When doing a lookup in the TLB the CPU will ignore any entries tagged against a PCID that is different than the current one. So if virtual address vA exists in the TLB, and is tagged with PCID-1, then process 2 tries to use address vA, the CPU will generate a TLB-miss. Exactly what we want.
But then what about pages such as the one that contains the interrupt handler code? Every thread will want such a page mapped and the mapping would be identical. Using PCIDs would result in several TLB entries describing the same mapping but for different PCID. Those entries pollute the TLB (and waste precious cache space) since they all map to the same place. Well this is exactly why there is a "g" bit, the Global flag, in each page descriptors. The CPU will ignore the PCID for pages that are global, It would be considered an incorrect usage of the paging system if a page is global but has a differnt physical address on different threads. So the global flag is to be used carefully. I use it for kernel code and MMIO.
Advantages at a certain cost
So PCIDs are a way to avoid flushing the TLB on each cr3 load, which would become VERY expensive as the CPU would generate TLB-misses at every context switch. Now, there are no more TLB flushes but the tagging still guarantees integrity of page mapping accross threads. Can you see how important that feature is and how big the befefits are? This will considerably increase performances in your OS.
But there is one thing you must consider. When destroying a process and then, possibly, recycling the processID for a new process, you must make sure that there are no stale entries of that last process in the TLB. The INVPCID instruction is just for that. It will allow you to invalidate all TLB entries associated to a particular PCID. But then, if you are running in a multi-CPU system, things get complicated. The INVPCID instruction will only execute on the CPU executing it (obviously). But what if other CPUS have stale entries in their TLB? You then need to do a "TLB shootdown". In my OS, a TLB shootdown is done by sending an Inter-Processor Interrupt (IPI) to all CPUs. The IPI tells them to invalidate their TLB with the PCID that was shared as part of the IPI. As you can guess, this can be very costly. Sending an IPI is very expensive as all CPUs will acquire a lock, disabled their interrupts plus all the needed processing. But it would only happen everytime a new process gets created.
How to use it
First, enable the PCID feature. This is simply done by setting bit 17 in CR4. Then, everytime you load CR3, the lower 12bits are used as the PCID. So you need to guarantee a unique 12bit ID for every process. But that can be a problem. 12bits only allows 4096 processes. If you plan on supporting more than 4096 process simultaneously, then you will need to come up with some dynamic PCID scheme.
Unfortunately, my CPU does not support INVPCID. It does support PCID though. It make no sense, in my head, that the CPU would support PCID but not INVPID. So I had to work around it. I found the solution by starting a thread at forum.osdev.org. By setting bit 63 of cr3, the CPU will delete all TLB entries associated with the loaded PCID. So I came up with the following solution
////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////
// emulate_invpcid(rdx=PCID)
// will invalidate all TLB entries associated with PCID specified in RDI
// This emulation uses bit63 of cr3 to do the invalidation.
// It loads a temporary address
////////////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////////////
emulate_invpcid:
push %rax
and $0xFFF,%rdi
or $PML4TABLE,%rdi
btsq $63,%rdi
mov %cr3,%rax
// We need to mask interrupts because getting preempted
// while using the temporary page table address
// things would go bad.
pushfq
cli
mov %rdi,%cr3
mov %rax,%cr3
popfq
pop %rax
ret