A couple of links, the wall of dread lurches forward, dried meat.

A couple of interesting posts I came across this evening:

Obliquity by John Kay, an old article from the Financial Times, via Naked Capitalism.
Showing examples of how direct goals such as 'increasing shareholder value' are the least successful way to achieve it. Food for thought in this crazy modern world where CEO bonuses and shareholder returns are the the sole quantification of success of a company or industry. Interesting in light of globalisation, out-sourcing, patenting of everything from ideas to seeds to animals, and so on and so forth that all put private profit before society.
Curtain of Tragedy Will Be Raised Soon Enough, But Perhaps Not Next in Japan, Jesse's Café Américain.
A pretty depressing look into the crystal ball for the next decade or so. I don't think Australia for one realises how fragile it's current `resource-driven' aka `lucky-country' or rather, `stupid-country' economy really is (as opposed to 'the clever country' era which gave the me the opportunity to go through Uni). A little god-bothery at the end, but don't let that put you off, Jesse always has lots of interesting things to say.
More than a four-letter word from Financial Armageddon, and Revolution, flashmobs, and brain chips. A grim vision of the future from The Guardian.
Both about a new report from the UK Department of Defence about the world-wide strategic threat environment in 30 years time. As The Guardian's sub-heading says, very grim reading. It's all about the population really; I can't imagine any palatable political solution, and any trend line with no limit has no scientific solution either. In for a bumpy ride even if these glimpses of the future are out by a few years. Although at my age I may just escape it through natural means.

All very depressing. I don't think it helped that I was sort-of watching a pretty grim tv series 'The Survivors' at the same time either. Although I don't think the writers there can do maths terribly well - a 90% reduction in population would still leave a sizeable number of people in Britain - over 5 million. Not one dozen, and even raw meat takes more than a few hours to go off on the nose in a disconnected fridge! Hmm all this reminds me for some reason about a book I read a few years ago called `The Genocides' that left me deeply disturbed for at least a week. It just a sadistic device of the author simply give the reader un unpleasant time and lasting memory. It is a grotesque tale with no redeeming features whatsoever. Avoid.

Wall of Dread

Well, on to more mundane and thankfully lighter topics. This afternoon I finally got off my fat bum and spent 4-5 hours working on the main retaining wall. Managed to finish the bottom layer and stacked another two on top since I was on a run - I still need some ag-pipe before I can back-fill each layer with gravel, so I should try to get that early this week so I can get most of the work just out of the way at last. I was going to walk it from the plumbing shop - it's only about 1km away. Although I may have stuffed up a bit - i'm not sure where the ag-pipe should drain to, and I didn't leave any drainage holes along the 12m of the run. It can still come out of the ends though, so maybe i'll just do that (unlikely i'll move all those damn blocks again, particularly the bottom row). I also re-cut the stair-case base to make the steps deeper, although I can't fill that in till I have that ag-pipe.

I measured the height along the length once I finished it rises a bit in the middle unfortunately - about 8mm, but at least in a relatively smooth curve. Yes yes, stupid I know - I should have regularly measured, but I did have another string-line at the brick level and used a couple of spirit levels, but all those little `close enough fuckit's add up. But it doesn't really matter because it only goes out `significantly' where the stair-case is, which is a natural break in the line. I guess i'll have to see how much it settles over time too. My $5 rubber mallet disintegrated before the last few base bricks were done, but I managed to tap down and level the last few with a bit of wood and a heavy ball-peen hammer - pretty well over it all by that stage, but I think I didn't drop off too much in finishing quality. Knowing my luck it will be a bit out compared to the boundary (and to-be-shed-wall), but that should only affect the aesthetics unless I made a really big mistake.

Dried Beef

The biltong turned out more or less ok. I've been sampling bits as I went along to see how it 'aged', and the first bits were a bit young (they were also thinner). The flavour didn't really settle till today, 6 days in - once the vinegar had dissipated. I made a slight variation for two sets of strips. One I rinsed in water and vinegar to remove the excess salt, as according to one recipe, and the other I left as is in an attempt to keep more flavour in-tact, according to another. The vinegar one ended up pretty bland, so i'm not sure that's the right idea. But the other one is a bit salty - ok with a beer in hand or after a long hot day in the sun, but not fantastic. Since I was caught up with some other things I also salted it for about 15 hours, so maybe next time I'll just try a shorter seasoning session. Although I imagine most of the salt would still be clinging to the outside once it is hung up. Need a way to get more chilli flavour on it too - maybe some sambal pedas as a final stage. Still, one bit I tried today I'd label a success, even with the saltiness. Almost has a touch of salami flavour to it, mixed with bbq steak, a bit more chilli and less salt and it could be a winner. 1kg of well-trimmed lean porterhouse made about 400g of biltong; so it's pretty cheap to make. Will have to investigate a better drying cabinet than a house-moving box with a shade-cloth cover though, at least if it becomes a regular task.

SSE, gcc, AMD

I'm a bit ahead of schedule on my work and I am awaiting some feedback so I thought i'd look at what SSE2 could do to speed up tiny bits of what it does. Might not need to use it, but it'd be good to have some sort of feel for what it can do - e.g on CELL it can make an order of magnitude difference. Some interesting and surprising results.As a quick aside, while searching for why using an SSE intrinsic kept complaining about assigning to the wrong data-type (turned out it was just a typo and gcc didn't complain about the undefined `function') I came across this interesting comparison of SSE optimisations on common compilers. Actually I was quite surprised that gcc was doing quite so much with constant subexpression elimination of vector quantities - I thought intrinsics were pretty much in-line assembly with register aliases. Anyway good to see it is well ahead of the pack in all departments.First, I tried implementing the following function:

X = abs(Z)*exp(Y*i)Z, X complex, Y real

The C code is all pretty straightforward:

                        Xrow[x] = cabsf(Zrow[x])
                                * cexpf(Yrow[x]*I);

The SSE code was a bit messier, but for such a simple function quite manageable. I found a nice library of SSE optimised transcendental functions needed to implement a vector cexpf(), and the Cephes maths library was an excellent reference for the internal workings of functions like cexp - e.g. so I could remove redundant work.

                        v4sf d0, d1;
                        v4sf i, r;
                        v4sf absz;

                        v4sf rr, ri;
                        v4sf ci, si;

                        d0 = Zrow[x];
                        d1 = Zrow[x+1];

                        i = _mm_shuffle_ps(d0, d1, 0x88);
                        r = _mm_shuffle_ps(d0, d1, 0xDD);
                        absz = _mm_sqrt_ps(r*r + i*i);

                        i = Yrow[x/2];
                        sincos_ps(i, &si, &ci);

                        rr = absh*ci;
                        ri = absh*si;

                        d0 = _mm_unpacklo_ps(ri, rr);
                        d1 = _mm_unpackhi_ps(ri, rr);

                        Xrow[x] = d0;
                        Xrow[x+1] = d0;

I'm sure it's not the prettiest code, and it may not even work (not sure my shuffles are correct), but assuming it is otherwise correct, it'll do. I also tried a double version for simple C, and the above sse version with 1 unrolled loop.

I have a number of machines to try it on, a pentium-m core-duo-wtf-its-called, athlon 64 x2, and a phenom II 945. I'm using gcc 4.4.3. Running the loop 10 000 times over a 64x64 array using constant loop counts. I'm using the aggressive optimisation flags suggested by AMD, including -march=amdfam10 and -funroll-all-loops, and also c99 features like restrict.

Anyway I'll keep this brief. I'm not so much interested in the various CPU's as the relative differences within them.

 Code             A64-X2    Phenom II Pentium-M

 C single           1.16         0.84      1.17
 SSE single         0.69         0.31      0.37
 SSE unrolled       0.66         0.31      0.37
 C double           1.36         0.90      1.25

Well it's certainly an improvement, but nothing like on Cell. The Athlon-64 X2 really suffers from a 64-bit internal data-path, so the SSE is not even 2x faster, even though it's calculating 4x values at once for most operations. The intel shows the biggest improvement using SSE at just over 3x, but the Phenom is about the same. gcc has much improved since I tried similar things on the SPU so perhaps the non-sse code is 'just that good', but I can't help but imagine there is more silicon dedicated to making non-sse code run faster too. Oh and it is interesting that double's are pretty much on-par with singles, at least in this case. I was hoping that moving to singles would make a bigger difference - too lazy at this point to try a double SSE3 version (i'd have to write sincos_pd for example).

I tried this first because I thought the a little complexity (yet not too time consuming to write) would help the compiler do more optimisations, and it is complex enough it can't be memory constrained. I couldn't get gcc to inline the sincos_ps, but I doubt that would've made much difference.

Then I tried something much simpler, just adding two 2d arrays. I didn't use sse for this, but just used the vector types. And just looped over every element of every row - I used a stride calculation rather than treating it as completely linear. Hmm, perhaps I should have used non-constant dimensions to see what the compiler did with more general purpose code.

a[x] += b[x];

gcc unrolled the whole row-loop so things are about as simple as they get. Actually gcc vectorised the version using simple C types as well (well if it couldn't vectorise this loop it'd be a worry), and then it included the two versions: one vectorised for when the addresses are aligned appropriately (which they are), and one using the FPU for otherwise. Looking at x86 assembly gives me a headache so I didn't look too closely though. Since this loop is so simple I had to run it 1 000 000 times to get a reasonable timing.

  Code             A64-X2    Phenom II Pentium-M

  C single           8.5         1.1        2.1
  C vector single    6.0         0.98       1.2

Ok, so it's tuned for the phenom, but wtf is going on with the athlon64? Memory speed constrained? I presume they are all constrained by the L2 cache, or I might even be getting cache thrashing or something (this is the sort of thing you could guarantee on cell ... huge pity IBM aren't progressing it for HPC). Don't ask me why the pentim-m version is so different - afaict from the source they're both using SSE2 code since the memory should be aligned properly. *shrug*.

So after all that i'm not sure i'm much wiser. The question being - is it worth the effort to manually vectorise the code or even use sse intrinsics on an x86 cpu, at least for the algorithms i'm using? Hmm ... probably ... even a 2x increase on a minor operation is better than nothing, but it'll only be an incremental improvement overall- but at least in every case it was an improvement. These are all using 32-bit operating systems too, I wonder if I should try a 64-bit one. Well i'll see how much time I have to play with it.

Should really try GPGPU and/or OpenCL simply to get a good 'feel' of what to expect.

Indecision, games, and the rest

Wow, tired. Then again I was up till 3am last night hacking on some MMU code, so that probably isn't surprising. And some damn person called me with a wrong number at 8:30am too. Been a long couple of days. Oh hmm, now i remember, I was up for 23 hours on Saturday too - after a very early awakening that led me to the last post, I spent all afternoon helping a mate put up a fence, then a few beers and pizzas and a late night later ended up crashing about 3am. No wonder it feels like I need a weekend, and it's only Tuesday.

Been very busy with work, and had a few wins today - I think i'm all out of endorphins now, so I can't get back into the MMU code as i'd planned to. As a side note, whilst generating data to test some algorithms I accidentally created the image below. What took me about it is that it is surprisingly uncomfortable to look at - it seems to jump around randomly, about 3 times a second (at least on a bright LCD monitor). I don't think it's just my tired or coffeed eyes ...

On the MMU code I was writing a more comprehensive bootstrap function. One that initialises the MMU to relocate the loaded image to somewhere appropriate, and in a way that it can remain in use for the running system. I have some code, and it's nice and compact but I'm still undecided on the final memory map. For example, the ARM CPU on the Beagle can have two page tables, one for high addresses, i.e. the system, and one for the rest. If I use that feature it forces the system table to use at least the top 2G of the virtual address space, which was different from what I had in mind. Not that it matters I guess since the machine only has 256MB of memory anyway ... so probably no big deal. I'd like to use that option since it simplifies the management of dynamic globally shared data the way I was thinking of doing it (i.e. shared library code/rodata). There's a few other details i'm still undecided on too, it would be nice to use section pages for instance (1MB pages that only require a level 1 PTE).

I spent the few nights earlier working on the higher level MMU code, managing translation tables, allocating physical pages and thinking about how to implement efficient `agile' (migratable) memory. The actual page table manipulation is pretty simple but you need a fair bit of infrastructure before you can use it.

So I also wrote a simple first-fit memory allocator that uses an AVL tree to coalesce free blocks. It's something i've researched quite a bit in the past and first-fit works surprisingly well for a general allocator particularly if there's a quick way to coalesce frees (the main drawback being not-so-great worst-case allocation performance). Even though I've written one before (for AROS), I'm using the AVL code derived from the parent-linked one from libavl, since I couldn't find the final version of the one I wrote and it's probably better anyway. I modified it though to remove the iterator and tree objects so it can be used without requiring any additional or dynamic memory. I also poked the balance factor into the low 3 bits of the parent pointer so I can jam in the free block size into 16 bytes - which determines the minimum allocation granularity. If the linear search for the first-fit allocation proves too inefficient, I can always add another tree and bump the minimum granularity to 32 bytes which is probably still ok. Although the overhead of managing two trees may not be.

And I played with a couple of different ideas for allocating physical pages. I tried a bitmap, which is pretty simple but tricky to allocate large ranges (>32 pages), has poor worst-case, and has no room to store any meta-data. So then I went to an indirectly linked list of words, mainly because I then have somewhere to store extra per-page data like a shared refcount, or other bits of information. But that's even messier to allocate any range more than 1 block in size; although I thought I didn't need to do that, I do at least need to allocate 16kb blocks for the first level translation tables (although I only need 8kb if I use the dual-translation table setup). So I may end up going back to the idea I had with my other os which was just to track the holes in the address space using a tree - much like with the gp memory allocator but without being able to use the holes themselves to store data. I suspect I chose to do it that way for a reason - it has a reasonably high worst-case memory requirement, but that is unlikely, and bounded. Since I will need the functionality for a global virtual address allocator for the 'agile' memory blocks, it probably makes sense to just use the same thing.

For something different i'm also trying to make some biltong. Thought about it for ages but never got motivated enough to try it. I was at the Central Market on Friday arvo and grabbed some dried meat and super-hot sausage and found it particularly yumscrious. But the extraordinary price helped me to decide that it was time to try to make something like that myself. I just started with 1kg of beef to see if it works or if I just end up making a batch of biological warfare agent. I should really have given the kitchen a good clean first. I thought i'd skip the receipe that starts with '25lbs of beef or game', at least to start with ;-)

Read an interesting article a few days ago about five creepy ways video games try to get you addicted. The point I found most interesting was the assertion that games are basically taking the place of an un-fulfilling work environment, and particularly since the points mentioned seem to match my hobby and at least the 'good bits' of the work I do:

Autonomy (that is, you have some say in what you do day to day);
Complexity (so it's not mind-numbing repetition);
Connection Between Effort and Reward (i.e. you actually see the awesome results of your hard work).

Maybe that's why I don't get terribly obsessed with games - in effect i'm already playing one. Particularly of late, this has been very obvious to me - I can't put a problem down until I get it to work, or at least finish some little mile-stone along the way. If I can do that I can go away and feel a bit pleased that that 'target' has been reached, even if I know they will never end. On the other hand, if I can't it can be very disappointing, and it can even act as a catalyst to something worse since I have no other distractions just now. The other point this ties into is quite interesting too:

What is a game?Well, we humans play games because there is a basic satisfaction in mastering a skill, even if it's a pointless one in terms of our overall life goals. It helps us develop our brains (especially as children) and to test ourselves without serious consequences if we fail. This is why our brains reward us with the sensation we call "fun" when we do it

Basically games are a way to learn stuff. I guess a bit like fizzy drinks or other food with flavour but low nutritional content, some of these modern games are simply abusing key survival traits honed through evolution over millions of years simply to make a buck. This is what the world has become, hollow food, hollow entertainment, and hollow learning, for a hollow existence?

Actually I'm not so sure any of the survival skills gained through evolution are much use to us any more. Most of them don't work when you have unbounded supplies of high-energy food-stuffs within easy reach, spend all of your time sitting on a chair behind a desk or in a machine screaming across a layer of processed oil, or simply have so many people that there's no room for them or any other animals simply to live in peace. Evolution only optimises based on the past, not on the future. Yes, we're all screwed - keep an eye on the fish.

Fork'n Evolution!

I just saw a post suggesting it is time for Ubuntu to fork evolution and thought it deserved a comment ... Although now i've just about finished this post I realise just how silly the whole notion is to start with, so it probably doesn't.

It is a pretty bizarre suggestion.

Even just on a technical level, the author may not realise just how much work is involved in maintaining a piece of code the size of Evolution. Or the amount of organisation and effort required to make the sort of major structural changes it probably needs before any real progress can be made.

And politically it is a naive and arrogant to suggest one GNU/Linux company can 'go it alone' like that on any major piece of software, particularly one developed by a competitor. It is also pretty rude - I haven't even touched Evolution for 5 years as a user or developer, but as a free software developer and advocate I think forking should really be the point of last resort. And even then it would only be after a demonstrable and intractable conflict had arisen - e.g. refusal to accept many reasonable patches of significant size. A few 10 line patches hardly turns one into a maintainer. And I wonder if anyone who might be involved has even done that.

Evolution was always a strange free software project, which i've written about plenty of times before. Whilst I was working on it it had only a few external contributions, and even less so from any of the GNU/Linux vendors (apart from Sun). Most had patches just for (re)branding, but the occasional bug fix they did have rarely made it to the code-base through normal processes i.e. they rarely if ever submitted them to us. Not that we made it terribly easy, anally retarded patch submission and review processes, the copyright assignment crap, and simply being busy most of the time.

But even despite that - it is a free software project, and open to contributions from anyone. There is simply no need to fork it, if anyone desires to improve it they can, and indeed are encouraged to. It is part of the `social contract' if you will, of any GPL project. You get the code for free, and if you want to make changes you can, and add them to the public pool for the good of all. And unlike many 'open source' projects - Evolution was always a real free software product and not just an alpha-grade product looking for free beta-testers and code-monkeys like many seem to be.

"They didn’t understand that there is just one killer feature (just as with integrated desktop social networking) that needs to be in there which is Exchange support."

Back on topic ... this one statement probably undermines the post itself more than any other. As Jeff said in comments on the post this was Ximian's business model pretty much from the start, and one of the reasons Novell bought them. It was pretty much the reason Evolution existed anyway, although more to replace the need for Microsoft Exchange than to work with it - at least in my mind. As a free software developer I never saw the need to inter-operate with proprietary crap like that - which is probably why I was never working on that part of the code.

Also the whole bit about `social networking' mentioned in the post - this was a major part of the internal thinking at Ximian right from the start - before the term even existed. Although like then, I still don't really see the need for it, particularly in a corporate environment. We all just used IRC and it worked just fine for us.

I was working on the mail code even longer than Jeff - I can't believe I started over 10 years ago. I haven't looked at the code since leaving Novell and the project, and I never worked on the `microsoft exchange component' anyway, but i'd suggest the code was simply never up to snuff given the way it originally designed, and even just because who wrote it.

Nice guy and all, but not great at coming up with a maintainable design let alone writing reliable C. Jeff and I's experience with the original IMAP code was a nightmare that started when he simply abandoned it in a fit of terrible management. One day he simply began refusing emails or dealing with patches saying he'd moved to another project (might have even been the microsoft exchange plugin). So we were left maintaining a really nasty bit of code with no deep knowledge of how it worked. It was all too lisp-like with little or no structure which made it very difficult to grok particularly given it's requirement for multi-threading. Lisp might be a great language, but it simply isn't how you write C.

And apart from the code itself, I believe the microsoft exchange component had a messy architecture layered on top of the already overly complex ones (yes, multiple) within evolution. Partly because all the different data-types were routed through the same connection, and also because it started as a proprietary extension and had silly things like a symbol obfuscater. To be honest once I found out how it worked I was surprised it worked at all ... it was not surprising in the least that it was slow and buggy and probably always would be.

Disbanding the original team and moving the project wholesale to India can't have helped either, at least at the time. Apart from the brain drain that involved, India has a tendency to add layers of middle-management too, so big decisions we could have gotten away with by the end (probably without asking ;-) became completely impossible due to risk averse middle-managers only worried about their next promotion-due-to-tenure. High staff turn-over due to a liquid job market was also an issue.

The entire component architecture of the system should have been replaced years ago (I wouldn't know - perhaps it has been) - almost all of the early design decisions were wrong, and particularly outside of the mail code none of it was ever reviewed because of the high turn-over of developers. Inside the mail component there was a huge scope for doing things in a much more re-usable way. CORBA got a bad wrap because we had an absolutely terrible design and often simply used it incorrectly. The original bonobo design was just a COM clone, so no surprises there - pluggable multi-process widget systems was a stupid use of CORBA. But the evolution-data-server wasn't, although the interfaces were not as good as they could have been.

I don't think there is any conspiracy - but Novell are a proprietary software company first and foremost and a project like evolution never really fit it's culture - probably the major reason I left in the end. Once HR starts impacting on engineering I think you have worries as a technology company, and even more as an `open sauce' one where the most valuable `eye-pee' you have is in your employee's heads.

Canonical make some odd decisions about Ubuntu too - more odd since they are purporting to be such a community focused organisation. But I don't think even they would be arrogant or silly enough to seriously consider such a strange notion as forking evolution. I'm sure they could surprise me though.

Syscalls and IPC

The next thing i've been looking at is system calls, and using them to implement IPC.

I wanted system calls to lead to a direct context switch efficiently, so that for example, in the simple case an IPC call to a server process acts synchronously. But on the other hand, there are other tasks for system calls which are really just isolated or system-level function calls, and they don't need the overhead.

So I came up with the idea of two levels - simple system calls which are invoked and return much any other function invocation, and heavier ones which might lead to a context switch. I just then use the system call number to chose what code to execute.

It's actually easier just showing the code I came up with than trying to explain it.

svc_entry:      
        cmp     r7,#SVC_SIMPLE_LAST
        push    { r12, lr }
        adrlo   r12,svc_vectors
        adrlo   lr,svc_exit             @ low vectors, call and return to svc_exit
        ldrlo   pc,[r12, r7, lsl #2]
  ...
svc_exit:
        ldmia   sp!,{r12, pc }^

Well that's the entirety of the code-path for 'simple' system calls. It doesn't strictly need to save r12, but it makes the context-switching case a bit simpler, and keeps the stack aligned properly too. It doesn't need to save the normal EABI work registers, because the target vector will save any it needs to.

At the svc_exit label I initially had `pop { r12, pc }^' thinking that the pop pseudo-op would work the same as ldmia as it does normally, but implement an exception return because of the trailing `^' ... but it doesn't ...

The "we might context switch" case is then much the same as every other exception handler, and re-uses some of the code too. It saves the state to the current TCB, and then invokes the vector. It also handles invalid system call numbers by suspending the task.

        srsdb   #MODE_IRQ
        cps     #MODE_IRQ
        stm     sp,{r0-r14}^
        cps     #MODE_SUPERVISOR

        cmp     r7,#SVC_LAST
        adr     r12,svc_vectors
        adr     lr,ex_context_exit
        ldrlo   pc,[r12, r7, lsl #2]
        b       task_error_handler      @ syscall # invalid - suspend task

One of the simplest useful IPC mechanism I can think of is simply a signal based one (the AmigaOS idea of signals, not the Unix one, although they can be made to work the same). If you're waiting on a signal and it arrives you wake up, otherwise it just gets registered asynchronously and if you wait later, it no-ops.

It only required 2 basic operations, and another 2 for a bit more flexibility.

Signal signals another task with a single signal bit, and Wait puts the current task to sleep until any one of a given set of signals arrives, or does nothing if it already has. Since either may trigger a context switch, these are implemented using the heavier-weight system call mechanism.The other two functions are AllocSignal and FreeSignal whose usage is pretty obvious. These are implemented using non-context switching system calls.The functions are very simple. First Signal, all it does is set the signal bit in the task control block, and checks if the task was waiting for it. If so it sets the return value the function expects, adds the task back to the run queue and reschedules the task, which may then execute depending on the scheduling algorithm.

void svc_Signal(int tid, int sigNum) {
        struct task *tcb = findTask(tid);

        if (!tcb)
                return;

        sigmask_t m = (1<<sigNum);
        sigmask_t sigs;

        tcb->sigReceived |= m;
        if (tcb->state == TS_WAIT) {
                sigs = tcb->sigReceived & tcb->sigWaiting;
                if (sigs) {
                        tcb->tcb.regs[0] = sigs;

                        ... move tcb to run queue ...
                        ... reschedule tasks ...
                }
        }
}

Wait is just as simple. In the simplest case, if any of the signals it is waiting on have already arrived, it just returns the intersection of the two sets. Otherwise, the task is put to sleep by placing it on the wait queue, storing the signals it is waiting on. The `trick' here is that when it wakes up, it will just act like it has returned from a context switch - and start executing immediately after the svc instruction. The Signal code above uses that to return the return-code in the delayed case via register r0 - the standard ABI register for function return values.

sigmask_t svc_Wait(sigmask_t sigMask) {
        struct task *tcb = thistask;
        sigmask_t sigs;

        sigs = tcb->sigReceived & sigMask;
        if (sigs) {
                tcb->sigReceived &= ~sigs;
        } else {
                tcb->sigWaiting = sigMask;
                tcb->state = TS_WAIT;

                ... move current task to wait queue ...
                ... reschedule tasks ...
        }

        return sigs;
}

Invoking system calls uses the svc instruction. This encodes the system-call number in the instruction, but you can also pass it in a register - and I follow the linux EABI which puts the system call number in r7.

With this mecanism it is trivial to implement a system-call in assembly language - tell the compiler what the args are, but the assembly just passes them to the system.

C code:

extern int Wait(int sigmask);

ASM:

Wait:   push { r7 }
        mov  r7,#3
        svc  #3
        pop  { r7 }
        bx   lr

But this wont let the compiler in-line the calls, which would be desirable in many cases. So one must resort to inline asm in the C source.

sigmask_t Wait(sigmask_t sigMask) {
        register int res asm ("r0");
        register int sigm asm("r0") = sigMask;

        asm volatile("mov r7,#3; svc #3"
                     : "=r" (res)
                     : "r" (sigm)
                     : "r7", "r1", "r2", "r3", "r12");

        return res;
}

(I'm not sure if that is 100% correct - but it seems to compile correctly.)

Code in ipc-signal.c and another version of context.S.

Other IPC

Well beyond simple signals there are a few other possibilities:

Mailboxes - These are usually a fixed-length queue which can be written to asynchronously if is it not full, but only with simple data type such as an integer or pointer. No dynamic memory is required inside the kernel so it is easy to implement. Shared memory and the like is required to send more complex information.
Shared memory - Fairly simple once you have a mechanism to share memory. No way to send or receive notifications about state changes though, so requires polling (bad for lots of reasons), or other IPC primitives such as signals. Requires some sort of serialisation mechanism too, such as mutexes or semaphores which are best avoided.
Sychronous message passing - This is used to implement a light-weight context switch in microkernels. Because the callee and caller rendezvous at the same time, there's no need to save state beyond the normal state saved in a function call, and simplifies scheduling. If asynchronous or multi-processing is required threads are needed though, and passing large messages can be expensive.
Asynchronous message passing - This uses a queue to keep track of the received messages and allows the target to receive them when ready. If the receiver is ready and waiting it can work much like the synchronous case, otherwise the sender can keep doing work. In a non-protected/non-virtual memory environment this easy and efficient to implement, but these both add complications. For example dynamic memory is needed in the kernel, and messages saved to a kernel-queue involve a double copy - although it can be implemented without copying too.

I'm fond of the async message passing model, particularly with non-copying of messages - I used it in Evolution and other threaded code i've written since. So i'd like to try and get that working efficiently in a protected/virtual memory environment if I can - I had it working before on x86. I don't know if it worked terribly efficiently, but I don't see why it shouldn't be too bad.

Mailboxes also look interesting - because of their simplicity mostly. They can convey more information than a single bit as with signals, but i'm not sure they could replace them (the CELL BE hardware implements mailboxes plus a bit-based system much the same as signals - and if they spent the hardware to do them both there's probably a good reason).

Once I get some mechanism to pass data between tasks I guess i'll look at something a bit more meaty, like a device driver - and go back to banging my head against that gigantic OMAP manual.

FOPS in context

Another late night bit of hacking last night to add FPU support to the context switch code (which incidentally is now committed).

It works by disabling the FPU when a new task is scheduled which doesn't match the owner of the FPU. FPU instructions then cause an illegal instruction exception, and that is used to switch the context if it was an FPU instruction.

I think I spent more time working out how to check for floating point instructions in the illegal instruction exception handler than anything else. Partly just finding out what the instruction bits were, but also just how to implement it without a big mess.

Basically, afaict the following bit patterns are those used by all of the FPU related instructions.

  1111001x        -        -        - advanced simd data processing
  xxxx1110        - xxxx101x xxx0xxxx vfp data processing
  xxxx110x        - xxxx101x xxxxxxxx ext reg l/s
  11110100 xxx0xxxx        -        - asimd struct l/s
  xxxx1110        - xxxx101x xxx1xxxx vfp<>arm transfer
  xxxx1100 010xxxxx xxxx101x        -

Well, the problem is that ARM can only compare to an immediate 8 bit value (albeit anywhere within the 32 bits). So it becomes a bit of a mess of mucking about with those, or loading constants from memory. In the end I opted for the constant load - I figure that a constant load from memory is not really any different from a constant load from the instruction stream, once in the L1 cache, so assuming you're going to save a lot of instructions it will probably be faster. Well it's smaller anyway ...

As a comparison I wondered what gcc would come up with, e.g. with:

void modecheck(unsigned int *a) {
        unsigned int v = *a;

        if ((v & 0xfe000000) == 0xf2000000
            || (v & 0x0f000e00) == 0x0e000a00
            || (v & 0x0e000e00) == 0x0c000a00
            || (v & 0xff100000) == 0xf4000000
            || (v & 0x0fe00e00) == 0x0c400a00)
                printf("do stuff\n");
        else
                printf("do other stuff\n");
}

And it's pretty ugly:

        ldr     r0, [r0, #0]
        and     r3, r0, #-33554432
        cmp     r3, #-234881024
        beq     .L2
        bic     r3, r0, #-268435456
        bic     r3, r3, #16711680
        bic     r3, r3, #61696
        mov     r2, #234881024
        bic     r3, r3, #255
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        bic     r3, r0, #-251658240
        bic     r3, r3, #16711680
        bic     r3, r3, #61696
        mov     r2, #201326592
        bic     r3, r3, #255
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        bic     r3, r0, #14680064
        mov     r3, r3, lsr #20
        mov     r3, r3, asl #20
        cmp     r3, #-201326592
        beq     .L2
        bic     r3, r0, #-268435456
        bic     r3, r3, #2080768
        bic     r3, r3, #12736
        mov     r2, #205520896
        bic     r3, r3, #63
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        ... else
        b       done
.L2:    .. if
done:

So basically it's converting all the constants to 8 bit operations. AFAICT the `branch predictor' basically just has a cache of recently taken branches, and the `default prediction' is that branches are not taken. If that's the case, the above code has a 13 cycle penalty in the best case ... plus all the extra instructions to execute in the worst. The first 2 cases are the most likely with fpu instructions though.

Not that this is in any way time-critical code (at most it will execute once per context switch), it just seems a bit of a mess.

I came up with something somewhat simpler, although to be honest I have no idea which would run faster.

        .balign 64
vfp_checks:     @ mask    ,value
        .word   0x0f000e00,0x0e000a00   @ xxxx1110        - xxxx101x -
        .word   0x0e000e00,0x0c000a00   @ xxxx110x        - xxxx101x -
        .word   0xff100000,0xf4000000   @ 11110100 xxx0xxxx        - -
        .word   0x0fe00e00,0x0c400a00   @ xxxx1100 010xxxxx xxxx101x -

        ...

 ldr r2,[r3,#-8]  @ load offending instruction
 ldr r2,[r2]

 and r0,r2,#0xfe000000 @ check 1111001x xxxxxxxx xxxxxxxx xxxxxxxx
 cmp r0,#0xf2000000
 ldrned r0,r1,vfp_checks
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+8
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+16
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+24
 andne r0,r0,r2
 cmpne r0,r1
 bne 1f

        ... if

1:      ... else

Given the constants are on a cache boundary, presumably their loads should all be 1 cycle (for all 64 bits) after the cache line is loaded, so they shouldn't be any worse than in-line constants that take more than 2 or more instruction to create (let alone the 7 required for both constants in the second C case).

Interestingly although not surprisingly the -Os code is more similar to the hand-rolled assembly, although it includes explicit branches rather than conditional execution. And does single-word loads instead of double.

Apart from that the context switching itself is pretty simple - just store the fpu registers above the others. irq_new_task is where the fpu is turned on or off depending on the current fpu owner, so it doesn't do anything if only one task is actively using the fpu. The existing context switch code required no changes - it is all handled 'out of band'.

Given that I now have some tasks to manage I also added a general illegal instruction handler for really `illegal instructions'. All it does is suspend the current task and remove it from the run queue as well, and the other tasks keep running just fine. For my uKernel it will then have to signal a process server to clean up/or whatever.

Code is in context.S and irq-context-fp.c.

OCaml

Well yesterday I should have been out working on the retaining wall but in the morning I got stuck thinking about optimising implementations of mathematical expressions (too much coffee late the night before methinks).The primary reason is that I think I found a pretty extreme case of wasteful calculation in an algorithm I'm working on. As far as I can tell from manual inspection, in one place it calculates a `square' 4D intermediate result - actually of the 8M array entries (64^4), most of them, about ~7M, remain unset at (0+0i). But when it uses this array for further processing it only uses 2047 of the results! Considering the intermediate array is 256MB in size(!!!), this is an obvious target for optimisation - it spends far more time clearing the memory than calculating the results it uses!

(I had the mistaken impression that matlab did smart things with such expressions, but I guess it doesn't - it's really just a scripting language front-end to a bunch of C functions.)

I have a feeling this sort of thing is happening throughout the algorithm although this is probably at the extreme end, but analysing it manually seems error prone and time consuming.

Surely there's a better way ... functional languages?

I'm not sure if i'll be able to take advantage of any of them, at least in the short-term. I was reading about how FFTW uses OCaml to implement it's optimiser - this is the sort of thing I might have to look at, although I never could wrap my head around functional programming languages. But anwyay - it's something i'll keep investigating although my focus to start with will just be code translation. Perhaps some sort of symbolic expression analyser might help me reduce the equations - once I work out what they are at least.

Another brick in the wall

I did get a couple of hours in on the retaining wall, laying the foundation layer of bricks which is the tricky bit - but then just as I was starting to make good progress it started to rain. I got nearly half-way on the main wall at least. I assessed that the stair-well i'd created was too steep and started too high (a slack string led me to believe the first-run of bricks would be lower), so i'll have to dig that out more too, but that wont take too long. I need to get some ag-pipe before I can back-fill it though. Hmm, should really get out there and do a bit more for a couple of hours - at least do the first run - hmm, but alas I think the day has gotten away from me again, and I have some other things to do.

It's a long-weekend anyway, so maybe tomorrow is a better day to get dirty, and the clouds are looking a bit dark at the moment.

Strange week.

Phew, another week down. That was a pretty long one. New job, all sort of things to learn about.

I went with CentOS for my workstation ... hmm, but i'm not really all that pleased so far. It's mostly working but there's a few weird bits. xterm's keep dying occasionally while i'm using them - they lock up hard and I can't even close the window. I have 'focus follows mouse' on, but it isn't always working properly - if the machine is busy when I move the mouse it seems to do weird shit. I also had a locked pointer grab - gee, I haven't had those outside of using gdb for years and years. And today I was switching virtual desktops and it just went mental - keep cycling the windows as fast it could once I let go of the keyboard (it still accepted keystrokes as the windows whizzed by) - had to kill X. The whole machine feels pretty slow too - it is an older machine being re-used, but it feels slower than it should. Might be the encrypted home mount too ... and ext3 must share some blame too. And finally the lack of packages - I expected it to be more limited than other more bleeding edge systems i've used lately, but the lack of packages is really stark; some what I thought really basic 3rd party packages are simply missing. I guess i'll have to re-install it with something a bit more production ready; CentOS 5.4 just isn't, at least for my needs. Sigh, not really what I wanted to have to do - stability is pretty much the whole reason I was looking at CentOS anyway.

As part of work i've been reading about convolutions and fft's and other fun stuff. Not all new to me, but I've never had to use it in anger before so it's pretty interesting delving into it. Unfortunately my maths was never really up to scratch in this area ... on the other hand I just have to use it, not derive it. Octave works pretty well for playing with ideas, and gnuplot does some nice plots too. Maybe if I come up with some interesting stuff as i'm learning I'll post some shots, but i'm a bit too worn out this week.

Last Saturday and today I spent a couple of hours shifting road-base and filling up the trench and tamping it down. Pretty hard work but not that bad in short spurts, and the trench is just about where I need it - 12 barrow loads so far. I was going to get some sand for the final layer, but I might try with the roadbase since I have it, and would have to move it all before I can order any sand. It's a lot harder making it level though, so I may regret that idea. I'll see - should really do some more on it this weekend, but there's other things I should probably do too. It might rain anyway (hmm, rain, i've forgotten what that feels like).

Context switching

After working on it in bits and pieces over the last few days I managed to get context switching working (late) last night.

Again apart from one little error it might've happened a lot sooner - I read the APSR description about the 'big endian bit' and I think my mind just assumed 'not x86 ==> big endian' and I decided I needed to turn it on. Oops. Very odd, apparently scheduling tasks just fine but they don't seem to execute at all, nor cause any crashes. Oh this is the first time i've had user-mode code executing as well.

My initial idea was to only save the minimal state possible on interrupt entry, and only bother with a full save/restore in the event of an actual context switch. But then thinking about the microkernel design I want to implement later on, this seems a bit pointless since generally the `low' part of the interrupt handler will always run straight away so it will probably lead to a context switch. This simplifies the context switcher a little bit and adds a bit more flexibility internally too. For system calls i'm not so sure yet - quite a few system calls should lead to a context switch too but a lot wont, so i'm still thinking about whether it will work the same way. And I have to decide if system-calls can be interrupted - if they remain very tiny then they wont need to be.

The context switch code basically ends up the same as the one out of the manual, but hooked into the normal IRQ handler. The IRQ stack pointer is used as the point to ThisTask.tcb.regs[0] all the time, so nothing needs to be loaded at interrupt handler time. I have a trivial ASM function which lets the system code set the next task to run:

        @ r0 = pointer to r0 in tcb
        .global irq_new_task
irq_new_task:
        mrs     r1,cpsr
        cps     #MODE_IRQ
        mov     sp,r0
        msr     cpsr,r1
        bx      lr

The function below then becomes the new IRQ entry point. The lines highlighted in bold indicate the changes from the previous interrupt handler - there aren't too many, and one is just a cleanup (the mov lr,pc). The main difference is that all of the context state is stored on the `irq stack' rather than the supervisor stack. But it isn't really a stack, it's just a pointer to the TCB. This does have one consequence - it is impossible to implement re-entrant interrupts (at least without further code). If more state is required it may make sense to shift all interrupts to use fast interrupts, since it could then use the other FIQ banked registers to store per-processor state - although it could equally just be a pointer from the TCB to a per-processor or per-process struct.

        .global irq_entry
irq_entry:
        sub     lr,#4
        stm     sp,{ r0-r14 }^          @ save all user regs
        srsdb   #MODE_IRQ               @ save spsr and return pc

        cps     #MODE_SUPERVISOR
        push    { r12, lr }             @ save supervisor lr and r12 to supervisor stack

        ldr     r5,=INTCPS_BASE         @ find active interrupt from INTCPS
        ldr     r0,[r5,#0x40]
        
        ldr     r2,=irq_vectors         @ execute vectored handler
        and     r0,r0,#0x7f
        mov     lr,pc
        ldr     pc, [r2, r0, lsl #2]

        mov     r1,#1                   @ tell INTCPS we've handled int
        str     r1,[r5,#INTCPS_CONTROL]
        dsb

        pop     { r12, lr }             @ last of state on supervisor stack

        cps     #MODE_IRQ

        ldm     sp,{r0-r14}^
        rfedb   sp                      @ back to new or old task

Now the IRQ sp is just used as a pointer to the current TCB - so the code doesn't perform any write-back to the sp when writing or restoring values. The user registers are stored/restored above it, and the pc and spsr registers are stored below it. The push { r12, lr } doesn't actually need to store r12 since we already saved it, but this is used to keep the stack aligned at this point - it will need to change to ensure the alignment specifically so r12 wont need to be saved once that is done.

struct tcb {
        uint32_t pc;
        uint32_t spsr;
        // <- sp_irq points here always
        uint32_t regs[15];
};

The C structure that maps to this is shown above, indicating where the sp_irq actually points. I have a simple linked-list of tasks to hold this state, and whatever other state the kernel might need.

struct task {
        struct Node Node;
        int id;

        struct tcb tcb;
};

Within an interrupt routine, if I wish to schedule a new task all I need to do is call the aforementioned irq_new_task function with the value from inside the tcb, and that task will run once the interrupt is finished.

Although not particularly practical in the real world, a round-robin scheduler of all tasks in the run-queue is as simple as:

        Remove(&thistask->Node);
        AddTail(&tasks, &thistask->Node);

        thistask = (struct task *)tasks.Head;
        irq_new_task(&thistask->tcb.regs[0]);

There are (at least?) two other pieces of context that also need changing in a complete system, the MMU tables and the floating point registers.

Floating point registers don't need saving/restoring here because the kernel will never use floating point itself. And instead of wasting the time saving/restoring full FPU state at every interrupt, the code will just find out when a floating point instruction is used and swap the state then. The irq_new_task call can probably just disable the floating point unit, and when an undefined instruction interrupt occurs on the floating point co-processor the state can be changed then if it needs to be. And/or some combination there-of. e.g. irq_new_task could check which task `owns' the FPU and enable/disable it appropriately.The MMU is a little trickier, it will need the page tables changed at every context switch. Since I will map the system memory globally across all tasks, I will probably also be able to put all of that logic into irq_new_task as well. Not sure how i'll deal with system calls that cause a context switch yet. I am looking at a process+task model too, so each task will be associated with a parent process which will encompass the memory map, so for example, a mutli-tasked process wont need an MMU switch if it is only switching between tasks.

And finally the last piece of the puzzle is boot-strapping the task system. There is only 1 CPU and it can only execute one bit of code at a time, so basically the initial booting process just 'falls through' to the 'current task' and automagically just starts executing as part of it's context. There is actually very little that needs to be set up - simply changing to user mode, and then initialising it's stack pointer, and that's it. The code then jumps to what is the entry point of the current task (as defined by the context switching mechanism above).

        // <- in supervisor state here, with boot-strap stack, etc
        asm volatile("cps #0x10");
        // <- user mode, undefined stack pointer
        asm volatile("ldr sp,=0x88000000 - 32768");
        // <- now sp is set to tasks's stack
        task(0);

The code above is just a demo, but the final thing will just jump to the idle task, so it can still be hard-coded in a similar way. Or it can be even simpler - the idle task does not need any stack, e.g. the following would suffice.

idle_task:
        wfi
        b idle_task

Code not committed yet.

Hmm, not sure what to look at next, maybe MMU context switching, or perhaps something simpler like system calls. Really exhausted tonight - spent the whole day staring at code not written by a programmer, and I need a good meal too.

Maiden Century

This is my first web diary that's made it to 100 posts, that being this post. Yay.

About Me

Tags