FOPS in context

Another late night bit of hacking last night to add FPU support to the context switch code (which incidentally is now committed).

It works by disabling the FPU when a new task is scheduled which doesn't match the owner of the FPU. FPU instructions then cause an illegal instruction exception, and that is used to switch the context if it was an FPU instruction.

I think I spent more time working out how to check for floating point instructions in the illegal instruction exception handler than anything else. Partly just finding out what the instruction bits were, but also just how to implement it without a big mess.

Basically, afaict the following bit patterns are those used by all of the FPU related instructions.

  1111001x        -        -        - advanced simd data processing
  xxxx1110        - xxxx101x xxx0xxxx vfp data processing
  xxxx110x        - xxxx101x xxxxxxxx ext reg l/s
  11110100 xxx0xxxx        -        - asimd struct l/s
  xxxx1110        - xxxx101x xxx1xxxx vfp<>arm transfer
  xxxx1100 010xxxxx xxxx101x        -

Well, the problem is that ARM can only compare to an immediate 8 bit value (albeit anywhere within the 32 bits). So it becomes a bit of a mess of mucking about with those, or loading constants from memory. In the end I opted for the constant load - I figure that a constant load from memory is not really any different from a constant load from the instruction stream, once in the L1 cache, so assuming you're going to save a lot of instructions it will probably be faster. Well it's smaller anyway ...

As a comparison I wondered what gcc would come up with, e.g. with:

void modecheck(unsigned int *a) {
        unsigned int v = *a;

        if ((v & 0xfe000000) == 0xf2000000
            || (v & 0x0f000e00) == 0x0e000a00
            || (v & 0x0e000e00) == 0x0c000a00
            || (v & 0xff100000) == 0xf4000000
            || (v & 0x0fe00e00) == 0x0c400a00)
                printf("do stuff\n");
        else
                printf("do other stuff\n");
}

And it's pretty ugly:

        ldr     r0, [r0, #0]
        and     r3, r0, #-33554432
        cmp     r3, #-234881024
        beq     .L2
        bic     r3, r0, #-268435456
        bic     r3, r3, #16711680
        bic     r3, r3, #61696
        mov     r2, #234881024
        bic     r3, r3, #255
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        bic     r3, r0, #-251658240
        bic     r3, r3, #16711680
        bic     r3, r3, #61696
        mov     r2, #201326592
        bic     r3, r3, #255
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        bic     r3, r0, #14680064
        mov     r3, r3, lsr #20
        mov     r3, r3, asl #20
        cmp     r3, #-201326592
        beq     .L2
        bic     r3, r0, #-268435456
        bic     r3, r3, #2080768
        bic     r3, r3, #12736
        mov     r2, #205520896
        bic     r3, r3, #63
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        ... else
        b       done
.L2:    .. if
done:

So basically it's converting all the constants to 8 bit operations. AFAICT the `branch predictor' basically just has a cache of recently taken branches, and the `default prediction' is that branches are not taken. If that's the case, the above code has a 13 cycle penalty in the best case ... plus all the extra instructions to execute in the worst. The first 2 cases are the most likely with fpu instructions though.

Not that this is in any way time-critical code (at most it will execute once per context switch), it just seems a bit of a mess.

I came up with something somewhat simpler, although to be honest I have no idea which would run faster.

        .balign 64
vfp_checks:     @ mask    ,value
        .word   0x0f000e00,0x0e000a00   @ xxxx1110        - xxxx101x -
        .word   0x0e000e00,0x0c000a00   @ xxxx110x        - xxxx101x -
        .word   0xff100000,0xf4000000   @ 11110100 xxx0xxxx        - -
        .word   0x0fe00e00,0x0c400a00   @ xxxx1100 010xxxxx xxxx101x -

        ...

 ldr r2,[r3,#-8]  @ load offending instruction
 ldr r2,[r2]

 and r0,r2,#0xfe000000 @ check 1111001x xxxxxxxx xxxxxxxx xxxxxxxx
 cmp r0,#0xf2000000
 ldrned r0,r1,vfp_checks
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+8
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+16
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+24
 andne r0,r0,r2
 cmpne r0,r1
 bne 1f

        ... if

1:      ... else

Given the constants are on a cache boundary, presumably their loads should all be 1 cycle (for all 64 bits) after the cache line is loaded, so they shouldn't be any worse than in-line constants that take more than 2 or more instruction to create (let alone the 7 required for both constants in the second C case).

Interestingly although not surprisingly the -Os code is more similar to the hand-rolled assembly, although it includes explicit branches rather than conditional execution. And does single-word loads instead of double.

Apart from that the context switching itself is pretty simple - just store the fpu registers above the others. irq_new_task is where the fpu is turned on or off depending on the current fpu owner, so it doesn't do anything if only one task is actively using the fpu. The existing context switch code required no changes - it is all handled 'out of band'.

Given that I now have some tasks to manage I also added a general illegal instruction handler for really `illegal instructions'. All it does is suspend the current task and remove it from the run queue as well, and the other tasks keep running just fine. For my uKernel it will then have to signal a process server to clean up/or whatever.

Code is in context.S and irq-context-fp.c.

OCaml

Well yesterday I should have been out working on the retaining wall but in the morning I got stuck thinking about optimising implementations of mathematical expressions (too much coffee late the night before methinks).The primary reason is that I think I found a pretty extreme case of wasteful calculation in an algorithm I'm working on. As far as I can tell from manual inspection, in one place it calculates a `square' 4D intermediate result - actually of the 8M array entries (64^4), most of them, about ~7M, remain unset at (0+0i). But when it uses this array for further processing it only uses 2047 of the results! Considering the intermediate array is 256MB in size(!!!), this is an obvious target for optimisation - it spends far more time clearing the memory than calculating the results it uses!

(I had the mistaken impression that matlab did smart things with such expressions, but I guess it doesn't - it's really just a scripting language front-end to a bunch of C functions.)

I have a feeling this sort of thing is happening throughout the algorithm although this is probably at the extreme end, but analysing it manually seems error prone and time consuming.

Surely there's a better way ... functional languages?

I'm not sure if i'll be able to take advantage of any of them, at least in the short-term. I was reading about how FFTW uses OCaml to implement it's optimiser - this is the sort of thing I might have to look at, although I never could wrap my head around functional programming languages. But anwyay - it's something i'll keep investigating although my focus to start with will just be code translation. Perhaps some sort of symbolic expression analyser might help me reduce the equations - once I work out what they are at least.

Another brick in the wall

I did get a couple of hours in on the retaining wall, laying the foundation layer of bricks which is the tricky bit - but then just as I was starting to make good progress it started to rain. I got nearly half-way on the main wall at least. I assessed that the stair-well i'd created was too steep and started too high (a slack string led me to believe the first-run of bricks would be lower), so i'll have to dig that out more too, but that wont take too long. I need to get some ag-pipe before I can back-fill it though. Hmm, should really get out there and do a bit more for a couple of hours - at least do the first run - hmm, but alas I think the day has gotten away from me again, and I have some other things to do.

It's a long-weekend anyway, so maybe tomorrow is a better day to get dirty, and the clouds are looking a bit dark at the moment.

Strange week.

Phew, another week down. That was a pretty long one. New job, all sort of things to learn about.

I went with CentOS for my workstation ... hmm, but i'm not really all that pleased so far. It's mostly working but there's a few weird bits. xterm's keep dying occasionally while i'm using them - they lock up hard and I can't even close the window. I have 'focus follows mouse' on, but it isn't always working properly - if the machine is busy when I move the mouse it seems to do weird shit. I also had a locked pointer grab - gee, I haven't had those outside of using gdb for years and years. And today I was switching virtual desktops and it just went mental - keep cycling the windows as fast it could once I let go of the keyboard (it still accepted keystrokes as the windows whizzed by) - had to kill X. The whole machine feels pretty slow too - it is an older machine being re-used, but it feels slower than it should. Might be the encrypted home mount too ... and ext3 must share some blame too. And finally the lack of packages - I expected it to be more limited than other more bleeding edge systems i've used lately, but the lack of packages is really stark; some what I thought really basic 3rd party packages are simply missing. I guess i'll have to re-install it with something a bit more production ready; CentOS 5.4 just isn't, at least for my needs. Sigh, not really what I wanted to have to do - stability is pretty much the whole reason I was looking at CentOS anyway.

As part of work i've been reading about convolutions and fft's and other fun stuff. Not all new to me, but I've never had to use it in anger before so it's pretty interesting delving into it. Unfortunately my maths was never really up to scratch in this area ... on the other hand I just have to use it, not derive it. Octave works pretty well for playing with ideas, and gnuplot does some nice plots too. Maybe if I come up with some interesting stuff as i'm learning I'll post some shots, but i'm a bit too worn out this week.

Last Saturday and today I spent a couple of hours shifting road-base and filling up the trench and tamping it down. Pretty hard work but not that bad in short spurts, and the trench is just about where I need it - 12 barrow loads so far. I was going to get some sand for the final layer, but I might try with the roadbase since I have it, and would have to move it all before I can order any sand. It's a lot harder making it level though, so I may regret that idea. I'll see - should really do some more on it this weekend, but there's other things I should probably do too. It might rain anyway (hmm, rain, i've forgotten what that feels like).

Context switching

After working on it in bits and pieces over the last few days I managed to get context switching working (late) last night.

Again apart from one little error it might've happened a lot sooner - I read the APSR description about the 'big endian bit' and I think my mind just assumed 'not x86 ==> big endian' and I decided I needed to turn it on. Oops. Very odd, apparently scheduling tasks just fine but they don't seem to execute at all, nor cause any crashes. Oh this is the first time i've had user-mode code executing as well.

My initial idea was to only save the minimal state possible on interrupt entry, and only bother with a full save/restore in the event of an actual context switch. But then thinking about the microkernel design I want to implement later on, this seems a bit pointless since generally the `low' part of the interrupt handler will always run straight away so it will probably lead to a context switch. This simplifies the context switcher a little bit and adds a bit more flexibility internally too. For system calls i'm not so sure yet - quite a few system calls should lead to a context switch too but a lot wont, so i'm still thinking about whether it will work the same way. And I have to decide if system-calls can be interrupted - if they remain very tiny then they wont need to be.

The context switch code basically ends up the same as the one out of the manual, but hooked into the normal IRQ handler. The IRQ stack pointer is used as the point to ThisTask.tcb.regs[0] all the time, so nothing needs to be loaded at interrupt handler time. I have a trivial ASM function which lets the system code set the next task to run:

        @ r0 = pointer to r0 in tcb
        .global irq_new_task
irq_new_task:
        mrs     r1,cpsr
        cps     #MODE_IRQ
        mov     sp,r0
        msr     cpsr,r1
        bx      lr

The function below then becomes the new IRQ entry point. The lines highlighted in bold indicate the changes from the previous interrupt handler - there aren't too many, and one is just a cleanup (the mov lr,pc). The main difference is that all of the context state is stored on the `irq stack' rather than the supervisor stack. But it isn't really a stack, it's just a pointer to the TCB. This does have one consequence - it is impossible to implement re-entrant interrupts (at least without further code). If more state is required it may make sense to shift all interrupts to use fast interrupts, since it could then use the other FIQ banked registers to store per-processor state - although it could equally just be a pointer from the TCB to a per-processor or per-process struct.

        .global irq_entry
irq_entry:
        sub     lr,#4
        stm     sp,{ r0-r14 }^          @ save all user regs
        srsdb   #MODE_IRQ               @ save spsr and return pc

        cps     #MODE_SUPERVISOR
        push    { r12, lr }             @ save supervisor lr and r12 to supervisor stack

        ldr     r5,=INTCPS_BASE         @ find active interrupt from INTCPS
        ldr     r0,[r5,#0x40]
        
        ldr     r2,=irq_vectors         @ execute vectored handler
        and     r0,r0,#0x7f
        mov     lr,pc
        ldr     pc, [r2, r0, lsl #2]

        mov     r1,#1                   @ tell INTCPS we've handled int
        str     r1,[r5,#INTCPS_CONTROL]
        dsb

        pop     { r12, lr }             @ last of state on supervisor stack

        cps     #MODE_IRQ

        ldm     sp,{r0-r14}^
        rfedb   sp                      @ back to new or old task

Now the IRQ sp is just used as a pointer to the current TCB - so the code doesn't perform any write-back to the sp when writing or restoring values. The user registers are stored/restored above it, and the pc and spsr registers are stored below it. The push { r12, lr } doesn't actually need to store r12 since we already saved it, but this is used to keep the stack aligned at this point - it will need to change to ensure the alignment specifically so r12 wont need to be saved once that is done.

struct tcb {
        uint32_t pc;
        uint32_t spsr;
        // <- sp_irq points here always
        uint32_t regs[15];
};

The C structure that maps to this is shown above, indicating where the sp_irq actually points. I have a simple linked-list of tasks to hold this state, and whatever other state the kernel might need.

struct task {
        struct Node Node;
        int id;

        struct tcb tcb;
};

Within an interrupt routine, if I wish to schedule a new task all I need to do is call the aforementioned irq_new_task function with the value from inside the tcb, and that task will run once the interrupt is finished.

Although not particularly practical in the real world, a round-robin scheduler of all tasks in the run-queue is as simple as:

        Remove(&thistask->Node);
        AddTail(&tasks, &thistask->Node);

        thistask = (struct task *)tasks.Head;
        irq_new_task(&thistask->tcb.regs[0]);

There are (at least?) two other pieces of context that also need changing in a complete system, the MMU tables and the floating point registers.

Floating point registers don't need saving/restoring here because the kernel will never use floating point itself. And instead of wasting the time saving/restoring full FPU state at every interrupt, the code will just find out when a floating point instruction is used and swap the state then. The irq_new_task call can probably just disable the floating point unit, and when an undefined instruction interrupt occurs on the floating point co-processor the state can be changed then if it needs to be. And/or some combination there-of. e.g. irq_new_task could check which task `owns' the FPU and enable/disable it appropriately.The MMU is a little trickier, it will need the page tables changed at every context switch. Since I will map the system memory globally across all tasks, I will probably also be able to put all of that logic into irq_new_task as well. Not sure how i'll deal with system calls that cause a context switch yet. I am looking at a process+task model too, so each task will be associated with a parent process which will encompass the memory map, so for example, a mutli-tasked process wont need an MMU switch if it is only switching between tasks.

And finally the last piece of the puzzle is boot-strapping the task system. There is only 1 CPU and it can only execute one bit of code at a time, so basically the initial booting process just 'falls through' to the 'current task' and automagically just starts executing as part of it's context. There is actually very little that needs to be set up - simply changing to user mode, and then initialising it's stack pointer, and that's it. The code then jumps to what is the entry point of the current task (as defined by the context switching mechanism above).

        // <- in supervisor state here, with boot-strap stack, etc
        asm volatile("cps #0x10");
        // <- user mode, undefined stack pointer
        asm volatile("ldr sp,=0x88000000 - 32768");
        // <- now sp is set to tasks's stack
        task(0);

The code above is just a demo, but the final thing will just jump to the idle task, so it can still be hard-coded in a similar way. Or it can be even simpler - the idle task does not need any stack, e.g. the following would suffice.

idle_task:
        wfi
        b idle_task

Code not committed yet.

Hmm, not sure what to look at next, maybe MMU context switching, or perhaps something simpler like system calls. Really exhausted tonight - spent the whole day staring at code not written by a programmer, and I need a good meal too.

Maiden Century

This is my first web diary that's made it to 100 posts, that being this post. Yay.

Sick and tired of being a permanent beta-tester.

Sigh. I was really angry about this, but now I'm just disappointed.I'm just sick and tired of being a permanent beta-tester, or even alpha-tester, for `linux distributions' and much of the software they use. That was never terribly cool, but it was entirely acceptable up until about the turn of the century. But today there's no excuse anymore, and really only one way to describe it, and that's bullshit. It just doesn't seem to be getting better either, and if anything seems to be taking a turn for the worse in the last few years (Notwork Manager, Puss Audio, KDE 4, EXT4, anything from f.d.o, and so on).In the usual bullshit goal to add new features, stability and usability are being cast aside in the permanent quest for shiny. It certainly isn't confined to the linux world - just look at vista or apple - but they don't affect me personally so i couldn't care less.

The latest case to affect me being Grub 2 (apparently it's been in use for a while, but today is my first experience with it). I installed Ubuntu 9.10 on a fresh system, and it just doesn't boot. No problem ... go to rescue mode and fix it. Ahhh, what the feck is this mess? They've turned the boot-loader into a friggan `platform'. I don't want to have to learn a whole new over-engineered `script' format, together with each distribution's overly extensible frame-work designed to make it `usable' - for something that doesn't work anyway, and wont add anything most people need. It's just a bloody boot loader after-all. I know grub had some issues, but it just doesn't need a huge run-time extensible module system and sophisticated scripting platform to copy a disk image to ram and change the cpu's program counter to point to it. And particularly since they still seem to be short on developers - the last thing they need is to make it more complex.

It's a fundamental problem which starts at the project developers and filters it's way through to the distribution makers - but ultimately it is the distribution makers who are making the wrong choice. For starters, if basic things like sound, network, or booting don't work, they rightly bear the brunt of the anger. They are responsible for not just compiling a group of disparate projects and ensuring they work together, but that the application choices actually work. Distributions too often seem to confuse `stable' with `unmaintained' as well - the noisiest busiest projects get the attention, even if an alternative already does the same thing but isn't being actively developed any more since it doesn't need to be. And their choices can have pretty negative effects on projects; including a project 6 months before it is actually ready for production use can leave a sour taste in many users and tar it's image for years.

Project developers do need to take their share of the blame too. Sure nobody wants to maintain old software forever, but if they decide to drop support for and old product, they had better make sure the replacement works pretty well before pushing it for inclusion into distributions (or accept that nobody will use it till it does). Too often unwanted software is forced onto all users as a way to ensure it gets the testing required to make it a complete product, or worse, to ensure a competing project doesn't gain a footing (this is particularly problematic inside Linux where politics has started to blatantly undermine merit). As much as I dislike it, I realise this is part of Fedora's policy, so it can be excused to some extent, but most distributions actually promise a usable system from the start.

I guess i'll try Centos again - I tried it earlier but I couldn't get dual-screen working, and the network went all strange when I went to download decent drivers (the former not entirely their fault - bloody nvidia) ... hmm, maybe I shouldn't bother. And if that doesn't work easily I guess it'll have to be Fedora 10, or perhaps an earlier Ubuntu. At least they just worked as far as I need them (i.e. I don't need sound).

I suppose this strategy of avoiding the unstable shiny shit will work for a couple of years ... and hopefully by then the distributions will have got their fracken shit together, but somehow I just don't think that will happen.So you might be wondering - just what are you doing about it then, you whiney prick?

Well for starters, there just isn't anything I can do about Linux or it's distributions - they are too big with their own culture and momentum - and way too much politics. At the most I might be able to write a small application or work on a larger one with other people - but that wouldn't have any impact. Even when I was working on Evolution I didn't have full control of the project, which sometimes stopped me fixing some of the larger problems. And too often I find myself in the minority and don't have the skills to make my case effectively (for example it is pretty hard explaining something when you think it's so obvious it shouldn't need any).

I have poked around AROS a little bit, and Haiku - I dislike a lot of the way GNU/Linux works and I think the only practical solution to that is just a completely different OS. But so far I just haven't been able to get into them for whatever reason, and sometimes you just have to work to your own strengths - and perhaps working on such projects just isn't for me. I think Haiku probably has the most promise at this point, and I think AROS is a little too conservative in it's goals to ever be really useful for most people. I think that despite always screaming for them, projects like these secretly don't really want any new developers anyway - the developers are quite happy to make their little wins on a system completely of their own devising, without having to worry about the politics and simply the hassles of dealing with anyone new. And I think that's an entirely reasonable approach to take too if you're in it for the hobby - it's a hell of a lot more fun that way for starters, and why else would you be doing it?

PS The pile of dirt hasn't moved, but I did have a pretty good nap - although that'll probably just mean a late night!

Yawn

Ahah, well starting work again Monday. Sounds like it could be more interesting than I thought. But I guess time will tell. It might be keeping me pretty busy for the first few months too, but hopefully I still have time for the distractions offered by the beagleboard.

Rode to a meeting and rather than taking the train, which was probably a bit unwise. Apart from not preparing properly and meeting in a cycling jersey ... the weather was unexpectedly hot yesterday (the forecast the day before was about the same but it was no-where near as hot on the road). Every time I put a bit of oomph into it I started to feel a bit weird from over-heating, so had to take it pretty easy most of the way (only about 24km each way). Apart from not having ridden much lately and so not having a decent summer acclimatisation, I think it mostly just comes down to badly needing a haircut and getting a super-hot head from the bouffy hair. Well, having a dreadful hangover and 5 hours of sleep didn't help i'm sure.

The day before I got out and had a look for old computer stuff in pawnbroker shops. Didn't find much, although there were a couple of shops tucked away which had a fun range of old tools and other junk which might merit another visit one day. I did pick up an old keyboard that looked hardly used. I took all the keycaps off and gave them a good wash, and it looks almost new now (apart from putting the - and = on the wrong way around). Oh apart from that **it's just a keyboard** - none of those bullshit `multimedia' keys or even worse - windows keys! It's even got a steel base. It's the little things sometimes ...

On the way home yesterday I ordered some roadbase so I can get stuck into the retaining wall at last (last day of holidays - the one i partly took to do such things - typical).

The dirt.

It got delivered at 8am - and I was all keen to go get a haircut so I can move it without passing out, but somehow the day is slipping away and i'm stuck doing little things around the house again. Almost feels like I still have the hangover from yesterday morning - but I didn't go out last night at all. I think I just need a really good sleep (and maybe some better meals), but seem stuck going to bed too late and rising too early.

The hole.

The trench isn't quite right, and I probably need to dig out the stairs before starting to fill it, but it's almost ready. I should be out there now, but it's a bit warm to be working in the full sun in the middle of the day (when you have a choice anyway). Have to drag myself out a little later to get stuck into it.

Smooth as a baby's bum

Well I didn't end up getting out, but I had a break for a while. I also started thinking about what I want to do now I have interrupts working and decided I couldn't be bothered doing too much with the vblank demo. But since there was nothing on TV for a bit I tidied it up and committed it as is.

It's moving smoothly, honest.

Yes, not much of a screenshot, but I thought the page needed some colour and a break from all the text of late. It just smoothly scrolls from one screen to the other and back again, drawing one box on the lower screen every time it's fully hidden. Demo in irq-scroll.c, with the interrupt code from yesterday in exceptions.S.

So what to do next. Well there's still the idea of a little game or demo or so, but i'm lacking a bit of inspiration, and besides, a bit of support to actually run the code in, for things like sound. So i'm starting to think about working on a little real-time micro-kernel.

Many months ago I had been writing one for x86 but got a bit pissed off with all the PC crappyness, and found something more interesting to do instead. So i'll probably start with that, although i'm not sure how much i'll end up using (not that I can remember the state I left it in anyway). For example, currently it does all the VM and process management inside the kernel (it made it simpler at the time), but I want to move that to a process instead so the kernel can work without any pre-emption or locks. Basically the goal is to make the kernel as simple as possible (nano-kernel?) without the simplification getting in the way. What I had been working towards was something like a mix of Minix 3 and AmigaOS; taking features like memory protection and processes from Minix, in addition to the asynchronous non-copying message passing, shared libraries, tasks, and the device mechanism from AmigaOS, with a bit of a reworking to make it all fit. Ahh well, maybe a tad on the optimistic side so I wouldn't hold my breath, on the other hand apart from device drivers there's not all that much to the core of such a design.

Not sure if i'll put it in PuppyBits or another project, but for now I still need to work out some basic routines like context switching and so on so i'll definitely put that stuff in PuppyBits at the least.

Interrupt progress

Had another poke at interrupts late last night. And finally got a simple interrupt handler working. As usual a couple of simple mistakes initially thwarted my efforts.

To start with I was trying to use the FRAMEDONE interrupt from the display controller. But it seems I needed to use VSYNC for HDMI out, and the EVSYNC_ODD/_EVEN for S-Video.
I was using the rfe instruction, but neglected the ! on the register, so it wasn't fixing the stack pointer properly on exit. Somehow the application code managed to run ok for a few seconds with a constantly changing stack!
I forgot to fix lr before saving it on the stack using the srs instruction (subtract 4). Again, rfe was thus skipping an a application instruction every interrupt, and again somehow the code didn't crash immediately either.

Other than using the ARMv6 instructions above, the code is basically straight out of the OMAP TRM § 10.5.3 MPU INTC Preemptive Processing Sequence, the step numbers below relate to that section. I haven't implemented the priority stuff yet (I was hoping it wasn't necessary for such a simple bit of code since I don't need priorities, but it seems it is for other reasons), so it doesn't actually implement re-entrant interrupts, but I might try to get that working before committing it.

        .set    MODE_SUPERVISOR, 0x13

ex_irq:
        // 1. save critical registers
        sub     lr,lr,#4
        srsdb   #MODE_SUPERVISOR!
        cps     #MODE_SUPERVISOR
        push    { r0-r3, r12, lr }

        ldr     r3,=INTCPS_BASE

        // 2,3 save and set priority threshold (not done)

        // 4. find interrupt source
        ldr     r0,[r3,#0x40]

        // 5. allow new interrupts
        mov     r1,#1
        str     r1,[r3,#INTCPS_CONTROL]

        // 6. data sync barrier for reg writes before enable irq
        dsb                              // not sure what options it should use
        
        // 7. enable irq

        // 8. jump to handler
        ldr     r2,=irq_vectors
        and     r0,r0,#0x7f
        ldr     lr,=ex_irq_done
        ldr     pc, [r2, r0, lsl #2]
        
ex_irq_done:
        // 1. disable irq

        // 2. restore threshold level (not done)

        // 3. restore critical registers
        pop     { r0-r3, r12, lr }
        rfeia   sp!

        .data
        .balign 4
        .global irq_vectors
irq_vectors:
        .word   exception_irq, exception_irq, ...
        .word   ... total of 96 vectors

The srs instruction and cps instructions are used to run everything on the supervisor stack/in supervisor mode. On entry the code is executing in irq mode, so it first saves lr_irq and spsr_irq onto the supervisor stack (after fixing the return address in lr!), and then switches to supervisor mode. Without the srs instruction (pre ARMv6) things are pretty messy since you either have to muck about with the irq stack first (and last), or have to switch between modes a few times to get everything sorted (see the links at the end of this post).I also implement a simple vectored interrupt table to simplify the C side of things, although I think I can just use a simple mov lr,pc before jumping to the vector rather than a literal load.

The ARM ARM actually recommends using the system mode for re-entrant interrupts (you can't use the interrupt mode itself because lr could be clobbered), so why am I using the supervisor stack? Partially historically because at first I couldn't work out how to save the state without clobbering some system stack registers (they're shared with user state). But I also have other plans where this scheme might work better, and if nothing else it stops broken code in user-mode crashing interrupts by breaking the stack pointer.

And finally one more thing I noticed whilst reading bits and pieces is that the AAPCS (EABI) specifies that the stack pointer should remain double-word (8-byte) aligned for entry points. I probably read it before but didn't take notice. This just normally means you always need to push an even number of registers onto the stack before calling other functions. Fortunately this just falls out with this code ... but with interrupt handlers which can be invoked at any time, we don't know what the alignment of the stack is so a specific check is needed too, according to the ARM Info Centre (damn, and I definitely know i've read that before just looking it up now - and it has some other important other bits too!).

Hmm, now i'm thinking about it ... i'm not sure I even need re-entrant interrupts at all. I'm thinking of working towards something along the lines of a microkernel architecture similar to AmigaOS or Minix 3, where device drivers are just high priority unprivileged tasks - the Cortex-A8 should be more than fast enough for this to work. All interrupt handlers will need to do is post events to these tasks, and the software will handle the priorities and whatnot. I suspect re-entrant interrupts are much more important in an embedded system where you just leave most of the work to the interrupt handlers, where DMA isn't available for everything, or the CPU speed is a limiting factor.

Specific Handler

The next step after the interrupt handler is the interrupt vector code itself. This is just a plain function call since the entry point has handled all the nitty gritty. But it still has to deal with the hardware - to identify which interrupt caused it to be invoked, and to clear it. Even with 96 interrupts in the interrupt controller, most of them map to multiple physical events.In the case of the video subsystem, there is a single interrupt DSS_IRQ (25) which can be triggered from 29 different events in either the DISPC module or the DSS module (actually I just noticed there are many more from the DSI module). § 15.3.2.2 Interrupt Requests has a pretty good overview. Fortunately there is a couple of bits in the DSS_IRQSTATUS which lets the code determine which are asserted to simplify processing. After that test is made, each bit needs to be checked in turn and processed accordingly. And finally the interrupt bits must be reset by writing a 1 to each bit in the DISPC_IRQSTATUS or DSI_IRQSTATUS register - otherwise it will go into an infinite loop re-invoking the interrupt as soon as it exits.

void dispc_handler(int id) {
        uint32_t dssirq = reg32r(DSS_BASE, DSS_IRQSTATUS);

        // see if we have any dispc interrupts
        if (dssirq & DSS_DISPC_IRQ) {
                uint32_t irqstatus = reg32r(DISPC_BASE, DISPC_IRQSTATUS);

                if (irqstatus & DISPC_VSYNC) {
                        ... do vsync code ...
                }

                // clear all interrupt status bits set
                reg32w(DISPC_BASE, DISPC_IRQSTATUS, irqstatus);
        }

        // check for dsi ints (to clear them)
        if (dssirq & DSS_DSI_IRQ) {
                // not expecting this, just clear everything
                reg32w(DSI_BASE, DSI_IRQSTATUS, ~0);
        }
}

This is basically the same process that all interrupt handlers need to go through. Identify the source, handle it, clear the assertion.

There are lots of 'gotchas' with interrupt handler writing at first, but the main thing is to not call any functions which share state with non-interrupt code. e.g. anything non-reentrant, or using hardware registers. Oh, and they should always run as fast as possible - all the `real work' your cpu could be doing is halted the entire time the interrupt is executing, and you could be processing thousands per second in a busy system.

The last piece of the puzzle is the interrupt enable masks. You don't just get all interrupts possible in the system all the time, you can mask (or enable) which ones you want to receive. This is all set-up before interrupts are enabled but after the hardware in question is setup. Here I clear all the status bits as well, just to make sure I don't get an unexpected surprise when I enable CPU interrupts later.

        // disable all but vsync
        reg32w(DISPC_BASE, DISPC_IRQENABLE, DISPC_VSYNC);
        reg32w(DISPC_BASE, DISPC_IRQSTATUS, ~0);
        // dss intterrupt can also receive DSI, so disable those too
        reg32w(DSI_BASE, DSI_IRQENABLE, 0);
        reg32w(DSI_BASE, DSI_IRQSTATUS, ~0);

I think I have some sort of set-up bug because I think that i'm sometimes getting interrupts when no event i'm testing is asserted. I will have to check the extra DSI interrupts I just noticed whilst writing this - they should all be masked off (should be reset condition anyway, but ...).

My little demo code right now just does a vsync'd smooth-scroll by changing the video dma base registers. The TRM states that the register is a `Shadow register, updated on VFP start period or EVSYNC.' There is another little trick though, it looks like all DISPC registers themselves are shadowed again, so you always have to set the GOLCD bit in DISPC_CONTROL whenever you make changes for them to make their way to the hardware. I guess I realised that anyway, but initially forgot.

        // update the graphic layer 0 address (video out) to scroll it
        reg32w(DISPC_BASE, DISPC_GFX_BA0, addr);
        reg32w(DISPC_BASE, DISPC_GFX_BA1, addr);
        reg32s(DISPC_BASE, DISPC_CONTROL, DISPC_GOLCD, ~0);

I might come up with a more impressive demo before committing though. Actually now I have interrupts working it opens up a lot of possibilities, such as a real sound driver, serial driver, and proper timing events (in a very odd twist, sometimes my delay loops seem to run twice as fast as other times ...).

Links

I came across a couple of links on the internet about bare-metal ARM coding, some of it doesn't apply/wont work on OMAP3, but the general ideas are the same.

Simplest bare metal program for ARM;
And a related article on Embedded.com Building Bare-Metal ARM Systems with GNU. or a much easier to read PDF version I just found.

Oh, I finally got out in the yard yesterday - if only for a couple of hours. More or less finished the trench for the main retaining wall foundation. Now I just need to get off my lazy bum and order some road-base and sand. Can't say I felt the fittest - easily out of breath, although I'm sure that has something to do with the sleep apnoea, my particularly poor sleep the night before (i let the cat stay in and he was wandering around all night), as well as my bum sitting. Glorious day today, and no meeting organised yet about work, so I should probably get out on a bike. Maybe I can scan a few pawn shops in the extremely unlikely event any have C64's lying around.

About Me

Tags