AVX/SSE AND CONTEXT SWITCHING

2014-03-18

This article describes the way I designed AVX/SSE support in my homebrew OS.

AVX registers

In long mode, there are 16 XMM registers. These registers are 128bit long. With AVX, these registers are extended to 256 bit and named YMM. The YMM registers are not new registers, they are only extensions. YMM0 is to XMM0 what AX is to AL. Meaning that XMM0 represents the lower 128bit of the YMM0 register.

The xcr0 register enables processor states saving for XSAVE and XRSTOR instructions. The way to set bits in xcr0 is by using the XSETBV instruction. These bits represents feature sets.

0b001: FPU feature set. Will save/restore content of FPU registers
0b010: XMM feature set. Will save/restore all XMM registers (128bit)
0b100: YMM feature set. Will save/restore upper half of YMM registers

Since YMM registers are 256 bit registers, and that XMM registers aliases the lower 128 bits of the YMM register, it is important to enable bit 2 and 1 in order to save the entire content of the YMM registers.

Enabling AVX support

Enable monitoring media instruction to generate #NM when CR0.TS is set: CR0.MP (bit 1) = 1
Disable coprocessor emulation: CR0.EM (bit 2) = 0
Enable fxsave instruction: CR4.OSFXSR (bit 9) = 1
Enable #XF instead of #UD when a SIMD exception occurs: CR4.OSXMMEXCPT (bit 10) = 1
Enable XSETBV: CR4.OSXSAVE (bit 18)= 1
Enable FPU, SSE, and AVX processor states: XCR0 = 0b111


mov     %cr0,%rax
or      $0b10,%rax
and     $FFFFFFFFFFFFFFFD,%rax
mov     %rax,%cr0

mov     %cr4,%rax
or      $0x40600,%rax
mov     %rax,%cr4

mov     $0,%edx
mov     $0b111,%eax
mov     $0,%ecx
xsetbv

Context Switching

On a context switch, it is important to save the state of all 16 YMM registers if we want to avoid data corruption between threads. Saving/restoring 16 256bit registers can add a lot of overhead to a context switch (we could even wonder if implementing a fast_memcpy() is worth it because of that overhead). Saving/restoring is done with the XSAVE and XRSTOR instruction. Each instruction take a memory operand that specifies the save area where registers will be dumped/restored. These instructions also looks at the content of EDX:EAX to know with processor states to save. EDX:EAX will be bitwise ANDed with XCR0 to determine which processor state to save/restore. In my case, I want to use EDX:EAX= 0b110 to save XMM, YMM, but fpu. Remember, if we set 0b100, we will only get the upper half of YMMx saved/restored. To get the lower half, we need to set bit 1 to enable XMM state saving.

Optimizing context switching - lazy switching

Since media instructions are not used extensively by all threads, it is possible that one thread does not use any media instructions during a time slice (or even during its whole lifetime). In such a case, saving/restoring the whole AVX state would add a lot of overhead to the context switch for absolutely nothing.

There is a workaround for this. In my OS, everytime there is a task switch, I explicitely set the TS bit in register CR0. Everytime a media instruction is executed and that the CR0.TS bit is set, a #NM exception will be raised (Device Non Available). My OS then handles that exception to save/restore the AVX context. So if a task does not use media instructions during a time slice, then no #NM will be triggered so there will be no AVX context switch. The logic is simple.

Assume that there is a global kernel variable called LastTaskThatRestoredAVX.
On task switch, set CR0.TS=1
media instruction is executed, so #NM is generated
on #NM:
- clear CR0.TS
- if LastTaskThatRestoredAVX==current task, return from exception (still the same context!)
- XSAVE into LastTaskThatRestoredAVX's save area
- XRSTOR from current task's save area
- LastTaskThatRestoredAVX = current task
Next media instruction to be executed will not trigger #NM, because we cleared CR0.TS

Save area

The memory layout of the saved registers will look like this (notice how highest 128bits of YMM registers are saved separately)

HOW TO ANSWER A QUESTION THE SMART WAY.

2014-01-06

Intro

The way I see it, the internet has made it easier for everyone to get answers and solutions for different problems. That's the beauty of the internet: information is easily accessed. If you think about web forums, they allow people to talk to each other. They allow you to ask a question and get an answer. Asking a question on a forum is easier that posting a question in a magazine or trying to find something in an encyclopedia. If you look at the section "Before you ask" in Eric Steven Raymond's "How To Ask Questions The Smart Way", he lists 7 steps that you should do before asking a question. Attempting those 7 steps defeats the whole point of making information easily accessible. So what if a person asks a question on a forum without having performed those 7 steps? Does it make it harder for you to answer the question? If you don't want to answer the question, then just don't answer. In my opinion, if the question was asked before and the answer was already provided, there is no harm in providing the answer a second time. The more the information is duplicated, the more it gets easy to find that information. If you understand how the Google search engine works, you will know that this is true.

replying "Google it"

When a person asks a question and someone else replies "let me Google that for you" or just gives a link to a Google search, that person should just not reply at all. How many times did I Google something, clicked the first result and landed on a forum where the OP asked the exact same question that I am asking myself and the only answer is "Google it". Well I did Google it actually, and I am landing on a page that says to Google it. Was it really hard to provide the right answer or to just ignore the OP?

replying "why would you wanna do it like that" or "you shouldn't do that"

I see that too often. The OP asks something like "I wanna print a document that I just scanned.... blah blah... how do I do it?" and someone replies "why would you do that? just use the original document". Never mind why he wants to do it that way. Do you know the answer or not? If you don't, then don't reply. The other day I was searching for "how to create SSH keys on behalf of another user". I landed on a forum with where the OP asked that same question and there was one reply: "You should not do that because the private key is private blah blah blah.". The person who replied that may find it stupid to do such a thing but I had very specific constraints that pushed me into doing that. Maybe I have a script running as root that creates keys for users. Maybe I have other reasons too. So if that person just found it odd to do such a thing and did not know the answer, maybe that person should have ignored the question.

Questions not to ask

in Eric Steven Raymond's "How To Ask Questions The Smart Way", you can find this:


Q: Where can I find program or resource X?
A: The same place I'd find it, fool at the other end of a web search. Ghod, doesn't everybody know how to use Google yet?

Let me get this straight, because you used to walk 4 miles in 4feet of snow to go to school, I shouldn't take the bus? You just said that you found it at the other end of a web search, so do us a favor and share the information so we don't have to do a big search like that. And by giving us the link and duplicating that answer, the link will end up ranking high in Google.

Conclusion

"How To Ask Questions The Smart Way" seems to have been written by a smart person who is really tech savvy but has neither the skills and patience to share his knowledge. That person should not become a teacher.

My philosophy is: Make the information easy to find. Why would I search a word in a dictionary when the guy sitting across me knows the definition and could tell me right now? The days of the teachers saying "You'll learn more if you work at finding it" are over. Make the information accessible. Duplicate the information and spend less time looking for answers. That's the whole point of the "information super highway". At least that's how my employer thinks. My boss will be very mad if I spend 8 hours searching for a solution on Google because a co-worker, who knows the answer, replies "Google it".

REALTEK 8139 NETWORK CARD DRIVER

2013-12-03

While building my homebrew OS, I go to the point where I needed a netcard driver. I run my os in QEMU, which provides a RealTek 8139 netcard. The specs for that card are very easy to find.

Before I continue, you should know that when the datasheet specifies a register that is 2 bytes long (like ISR), it is important to read it as a 16bit word even if all you need is the first 8bit. I was reading ISR with "inb" and couldn't make my software work event if all I needed was the first byte. Changing "inb" for "inw" worked. The datasheet indicates that some registers need to be read or written as words or dwords even if it looks like they could be accessed as bytes.

Initializing

Enable the card: OUTPORTB(0,iobase+0x52);
Reset the card:
You need to write the "reset" bit in register 0x37, and then wait until that bit gets cleared unsigned char v=0x10; OUTPORTB(v,iobase+0x37); while ((v&0x10)!=0) INPORTB(v,iobase+0x37);
enable TX and RX interrupts: OUTPORTB(0b101, iobase+0x3C); There are other interrupts in register 0x3C that can be interesting but I just need TOK and ROK for now.
enable 100mbps full duplex: OUTPORTB(0b00100001, iobase+0x63)
Set the Receive Configuration Register (RCR):
OUTPORTL(0x8F, iobase+0x44);
Looking at the datasheet, you can see what those bits mean. Bascically what we did is:
- set promiscuous mode
- accept frames for our MAC address
- accept frames for out multicast address
- accept broadcasted frames
- Do not accept runts and erroneous frames
- set the RX buffer size to 8k
- disable WRAP. This means that is a frame is received and we are near the end of the RX buffer, the card will continue copying data after the buffer. We are basically allowing buffer overflow here. so for this reason, we need to give extra space to our buffer. I chose to use a 10k buffer just to be sure
Set the RX buffer address. The details of this buffer will be explained in the next section. For now, let's just reserve a buffer of 34k and tell the card about it: OUTPORTL(buf_addr, iobase+0x30)
Warning: The addresses for TX and RX buffers must be physical addresses. Not virtual addresses
Set the Transmit Configuration Register (TCR): The default values after reset are fine. So I'm not touching that register.
Set the tx descriptors for now, I won't go in the details of those buffers, this will be explained in the next section all you need to know right now is that you need 4 2k buffers and tell the card about them
OUTPORTL(buf_addr_desc0, iobase+0x20);
OUTPORTL(buf_addr_desc1, iobase+0x24);
OUTPORTL(buf_addr_desc2, iobase+0x28);
OUTPORTL(buf_addr_desc3, iobase+0x2C);
enable TX and RX: OUTPORTB(0b00001100,iobase+0x37);

This is my init code. Note that there is some PCI stuff in there that I don't describe. I am assuming that you have a PCI driver written at this point


void initrtl8139()
{
    unsigned int templ;
    unsigned short tempw;
    unsigned long i;
    unsigned long tempq;

    deviceAddress = pci_getDevice(0x10EC,0x8139); // vendor, device. Realtek 8139
    if (deviceAddress == 0xFFFFFFFF)
    {
        pf("No network card found\r\n");
        return;
    }

    for (i=0;i<6;i++)
    {
        unsigned int m = pci_getBar(deviceAddress,i);
        if (m==0) continue;
        if (m&1)
        {
            iobase = m & 0xFFFC;
        }
        else
        {
            memoryAddress = m & 0xFFFFFFF0;
        }
    }

    irq = pci_getIRQ(deviceAddress);
    registerIRQ(&handler,irq);
    pci_enableBusMastering(deviceAddress);

    // Activate card
    OUTPORTB(0,iobase+0x52);

    // reset
    unsigned char v=0x10;
    OUTPORTB(v,iobase+0x37);
    while ((v&0x10)!=0)
    {
        INPORTB(v,iobase+0x37);
    }

    INPORTL(templ,iobase+4);
    tempq = templ;
    tempq = tempq <<32;
    INPORTL(templ,iobase);
    tempq |= templ;
    macAddress = tempq;
}


void rtl8139_start()
{
    // Enable TX and RX:
    OUTPORTB(0b00001100,iobase+0x37);

    // Set the Receive Configuration Register (RCR)
    OUTPORTL(0x8F, iobase+0x44);

    // set receive buffer address
    // We need to uses physical addresses for the RX and TX buffers. In our case, we are fine since
    // we are using identity mapping with virtual memory.
    OUTPORTL((unsigned char*)&rxbuf[0], iobase+0x30);  // this is a 10k buffer

    // set TX descriptors
    OUTPORTL((unsigned char*)&txbuf[0][0], iobase+0x20); // 2k alligned buffers
    OUTPORTL((unsigned char*)&txbuf[1][0], iobase+0x24);
    OUTPORTL((unsigned char*)&txbuf[2][0], iobase+0x28);
    OUTPORTL((unsigned char*)&txbuf[3][0], iobase+0x2C);

    // enable Full duplex 100mpbs
    OUTPORTB(0b00100001, iobase+0x63);

    //enable TX and RX interrupts:
    OUTPORTW(0b101, iobase+0x3C);
}

Receiving

Since we have enabled the ROK and TOK interrupts, we will receive and interrupt when a new frame arrives. So from my interrupt handler I check the ISR register to know if I got a TOK or ROK. if ROK, then proceed with getting the frame. First, some definitions:

CAPR: This register holds the address within the RX buffer where the driver should read the next frame. This register must be incremented by the driver when a frame is read. The netcard will check that register to determine if a buffer overrun is occuring.
packet header: This is a 4bytes field that is found at the begining of the frame. The first word is a bitfield indicating if the frame is OK, if it was received as part of multicast ect. More information can be found in section 5.1 of the datasheet. The following 2 bytes indicate the size of the frame

This is what I do:

1) Trigger on interrupt: Since interrupts have been enabled, IRQ will have been raised. So this will be done from the handler. We need to check TOK in the ISR register
2) Get position of frame within the RX buffer by reading CAPR
3) Get size of data: 2nd 16bit word from begining of buffer (CAPR+2)
4) copy the frame: address starts at rx_buffer_base+CAPR
5) Update CAPR: CAPR=((rxBufIndex+size+4+3)&0xFFFC)-0x10 We are adding 4 to take into account the header size and the +3&0xFFFC is to align on a 4bytes boundary. I have no idea why we need to substract 0x10 from there. Note that you should keep track of rxBufIndex separately. I.e: do not update it with CAPR everytime.
6) Check BUFE bit in CMD. if set, go back to step 2
7) write 1 to ROK in the ISR register

The receiving function:


unsigned long rtl8139_receive(unsigned char** buffer)
{
    if (readIndex != writeIndex)
    {
        unsigned short size;
        unsigned short i;
        unsigned char* p = rxBuffers[readIndex];
        size = p[2] | (p[3]<<8);
        if (!(p[0]&1)) return 0; // PacketHeader.ROK
        *buffer = (char*)&p[4]; // skip header
        readIndex = (readIndex+1) & 0x0F; // increment read index and wrap around 16
        return size;
    }
    else
    {
        return 0;
    }
}

I also wrote A 64bit memcpy in a separate ASM file


// rdi = source, rsi = destination, rdx = size
memcpy64:
    push    %rcx
    xchg    %rdi,%rsi
    mov     %rdx,%rcx
    shr     $3,%rcx
    rep     movsq
    mov     %rdx,%rcx
    and     $0x07,%rcx
    rep     movsb
    pop     %rcx
    ret

The interrupt handler:


unsigned short isr;
INPORTW(isr,iobase+0x3E);
OUTPORTW(0xFFFF,iobase + 0x3E);
unsigned int status;
unsigned char  cmd=0;
unsigned short size;
unsigned short i;
if (isr&1)                  // ROK
{
        // It is very important to check this first because it's possible to get an interrupt
        // and still have cmd.BUFE set to 1. that caused me lots of problems like
        // reading bad status, causing buffer overflows
        INPORTB(cmd,iobase+0x37);

	while (!(cmd&1))   // check if CMD.BUFE == 1
	{
		// if last frame overflowed buffer, this won't will start at rxBufferIndex%RX_BUFFER_SIZE instead of zero
		if (rxBufferIndex>=RX_BUFFER_SIZE) rxBufferIndex = (rxBufferIndex%RX_BUFFER_SIZE);

		status =*(unsigned int*)(rxbuf+rxBufferIndex);
		size = status>>16;

                memcpy64((char*)&rxbuf[rxBufferIndex],(char*)&rxBuffers[writeIndex][0],size);

                rxBufferIndex = ((rxBufferIndex+size+4+3)&0xFFFC);
		OUTPORTW(rxBufferIndex-16,iobase+0x38);
		writeIndex = (writeIndex+1)&0x0F;
		if (writeIndex==readIndex)
		{
			// Buffer overrun
		}
		INPORTB(cmd,iobase+0x37);
	}
}

Sending

I found that Sending was easier than receiving. The first thing that needs to be done is to setup the buffer pointers in TSAD0-TSAD3. I'm not sure if these buffers require any special alignment but I've aligned mine on 2k boundaries.

Sending a frame

There are 4 TX buffers available. You should keep track of which one is free by incrementing an index everytime you send a frame. This way, you will know what buffer to use next time. You will need to copy your frame into the buffer pointed to by TSAD[CurrentSendIndex]. You will then need to write the size of the frame into TSD[CurrentSendIndex] and clear bit 13. Bit 13 is the OWN bit. It indicates to the card that this buffer is ready to be transmitted. Then you increment CurrentSendIndex to be ready for next time. At the next send, if TSD[CurrentSendIndex].bit13 is cleared, it means that the frame still belongs to the card and it wasn't transmitted. This would indicate a buffer overrun, your software is sending faster than what the card can handle.


unsigned long rtl8139_send(unsigned char* buf, unsigned short size)
{
    if (size>1792) return 0;
    unsigned short tsd = 0x10 + (currentTXDescriptor*4);
    unsigned int tsdValue;
    INPORTL(tsdValue,iobase+tsd);

    if (tsdValue & 0x2000 == 0)
    {
        //the whole queue is pending packet sending
        return 0;
    }
    else
    {
        memcpy64((char*)&buf[0],(char*)&txbuf[currentTXDescriptor][0]);;
        tsdValue = size;
        OUTPORTL(tsdValue,iobase+tsd);
        currentTXDescriptor = (currentTXDescriptor+1)&0b11; // wrap around 4
        return size;
    }
}

Handling TX interrupt

Handling the interrupt is mostly done to detect send errors. I don't use it much. I won't go into details here, as the code explains pretty much everything.


unsigned short isr;
INPORTW(isr,iobase+0x3E);
OUTPORTW(0xFFFF,iobase + 0x3E);
if (isr&0b100)              //TOK
{
	unsigned long tsdCount = 0;
	unsigned int tsdValue;
	while (tsdCount <4)
	{
		unsigned short tsd = 0x10 + (transmittedDescriptor*4);
		transmittedDescriptor = (transmittedDescriptor+1)&0b11;
		INPORTL(tsdValue,iobase+tsd);
		if (tsd&0x2000) // OWN is set, so it means that the data was transmitted to FIFO
		{
			if ((tsd&0x8000)==0)
			{
				//TOK is false, so the packet transmission was bad. Ignore that for now. We will drop it.
			}
		}
		else
		{
			// this frame is pending transmission, we will get another interrupt.
			break;
		}
		OUTPORTL(0x2000,iobase+tsd); // set lenght to zero to clear the other flags but leave OWN to 1
		tsdCount++;
	}
}

Documentation

These are good resources if you need more information on the rtl8139:

Get the full source code

REST INTERFACE ENGINE

2013-10-28

This is a REST engine API that I use for some of my projects. It is very simple to use and has no dependencies. One of the nicest feature is that it documents the REST interface that you build with the engine. Note that this is only a REST engine and does not include a web server. You still need to listen on a socket for incomming requests and feed them to the engine and respond with the engine's output.

Defining your API and documenting it

Let's say you have an application that has a ShoppingCart object and you want to expose some of its functionality through a REST interface. Defining the API is easy as this:


ShoppingCart *p = new ShoppingCart();
RESTEngine engine;
	
RESTCallBack *pc1 = new RESTCallBack(p,&ShoppingCart::addToCart,"This lets you add an item to a shopping cart");
pc1->addParam("id","Shopping cart ID");
pc1->addParam("sku","Item SKU");
pc1->addParam("qty","Quantity of that item to add");

RESTCallBack *pc2 = new RESTCallBack(p,&ShoppingCart::removeFromCart,"This lets you remove an item from a shopping cart");
pc2->addParam("id","Shopping cart ID");
pc2->addParam("sku","Item SKU");

engine.addCallBack("/shoppingcart/item","POST",pc1);
engine.addCallBack("/shoppingcart/item","DELETE",pc2);

Note how each resource uri and parameters are documented at creation time.

Invoking and processing query

To invoke a query, you only need to get the URI (after parsing it from a from a HTTP request or whatever other way) and feed it to the engine. Of course, your API might want to return some data, so this is done by passing an empty JSON document object (JSON interface is part of the project as well. I told you, there are no external dependencies in this project :) ) and the callbacks will populate it with the response.


Dumais::JSON::JSON j1,j2,j3;
engine.invoke(j1,"/shoppingcart/item?id=1&sku=242&qty=4","POST",bodyData);
engine.invoke(j2,"/shoppingcart/item?id=1&sku=244&qty=1","POST",bodyData);
engine.invoke(j3,"/shoppingcart/item?id=1&sku=244","DELETE",bodyData);

The engine will parse the parameters and route the requests to the proper callcacks. Callbacks are defined like this:


void ShoppingCart::addToCart(Dumais::JSON::JSON& j,RESTParameters* p, const std::string& data)
{
	std::string id = p->getParam("id");
	std::string sku = p->getParam("sku");
	std::string qty = p->getParam("qty");
	std::string test = p->getParam("test"); // this would return "" since param "test" was not defined as a valid param earlier.
	
	j.addValue("Item successfully added to cart","message");
}

Generate documentation

When creating the callbacks and the parameters, we defined a description for each of them. This means that the engine is aware of the documentation of the created interface. This allows you to generate the documentation using RESTEngine::documentInterface(). This method will populate a JSON object with the documentation of your API. Generating the documentation for our example here would give us:


{
	"api" : [
		{
			"description" : "This lets you add an item to a shopping cart",
			"path" : "/shoppingcart/item",
			"method" : "POST",
			"params" : [
				{
					"name" : "id",
					"description" : "Shopping cart ID"
				},
				{
					"name" : "sku",
					"description" : "Item SKU"
				},
				{
					"name" : "qty",
					"description" : "Quantity of that item to add"
				}
			]
		},
		{
			"description" : "This lets you remove an item from a shopping cart",
			"path" : "/shoppingcart/item",
			"method" : "DELETE",
			"params" : [
				{
					"name" : "id",
					"description" : "Shopping cart ID"
				},
				{
					"name" : "sku",
					"description" : "Item SKU"
				}
			]
		}
	]
}

With the documentation generated as a JSON document, it is easy to make a javascript application that gets the list of API calls and lets you experiment with it for prototyping. I did an application that gets the list of API and for each API calls, shows the parameters that are defined and lets you enter a value in a text field. Then you can invoke the API call.

Thanks to William Tambellini for notifying me about a typo in this page

Source code download

Project can be found on github

javascript application to prototype