A bit more vj progress, thoughts on elf loader

I had a chance to fit in a bit more hacking on the parallella over the weekend, although not as much progress as i'd hoped.

I first created an ARM version of the viola-jones (VJ) code to compare; it's still getting different results to a previous implementation but at least it's consistent with the EPU (epiphany core) version. I tried to work out why it isn't the same but I still haven't fathomed it - it can wait.

But at least with a consistent implementation I can do some comparisions. It's only running on a single core in each case, and it's timing just the processing of every 20x20 window on a single-scale 512x512 image (about 230K probes).

   CPU                     Time
   Intel core Duo 2Ghz     0.3s
   Parallella ARM          1.5s
   Parallella EPU          2.8s
   Parallella ARM format 2 1.1s (approx)
   Parallella EPU format 2 2.3s
   Parallella EPU ASM      1.4s

   Parallella ARM C*       1.0s
   Parallella EPU C*       1.8s
   Parallella EPU ASM*     0.9s

Update 3: I finally got the assembly version working and added 3 more timings. These include C versions that have hand-unrolled inner loops aided by the new format. The ARM version doesn't report very reliable timings (paging on first-run i guess), but it's definitely improved. I've tried to optimise the scheduling on the assembly version although there are still enough stalls to add up - but removing those will require more than just reordering of instructions. These new times also includes some tweaks to the arrangement of the cascade chunks, although i can't recall if it made much difference. Still yet to double-buffer the image DMA.

Update *: Tried a bunch more things like fully double buffering, a more C-friendly format for the C versions, and some hacks which probably aren't safe/aren't really working. But as they still "appear" to work i'm including the timings here listed with the *s. But the C implementation is so tight it only just fits into the on-core memory to such an extent it might be difficult to add the scheduling code for multi-core.

It's also pretty much exhausted my interest at this point so i will probably move onto multi-core as well as a more complete and realistic example (although, tomorrow is another day so i might change my mind if inspiration hits).

(BTW this is not representative of the time it takes to detect a face on a 512x512 image on this hardware, as one doesn't normally need to look for 20x20 pixel faces at that resolution and it will typically be scaled down somewhat first, 1x1 pixel centres are usually not needed either).

To be honest I was hoping the EPU would be a bit closer to the ARM because of the similar clock speed, but I think in part it's suffering from a compiler which doesn't optimise quite as well. And the format I'm using benefits from the more capable instruction set on the ARM (bitfield stuff). However, there's still more that can be done so it isn't over yet - i'm not double buffering the image window loads yet for one.

First I started looking at writing an assembly version of the inner-most routine. But I ran into a problem on that - debugging is a royal pain in the arse. Obviously I had bugs, but the only way to "debug" it is to stare at the code looking for the error. And i'm not yet proficient enough at the dialect to be very good at this yet. The only indication of failure is that the code never gets to the mailbox return message. I don't know how to run gdb on the epu cores, and the computer i'm using as a console doesn't have an amd64 GNU/Linux installed (required for the pre-compiled sdk). A few documentation errors were a bit misleading too (post-increment is really post-increment despite the copy-paste job from the indexed LDR page). After paring things down enough and working out the ABI I at least had something which walked the cascade. As I started filling out the guts I thought I may as well tweak the format for better run-time efficiency too.

The EPU is most efficient when performing 64-bit data loads (aligned to 64-bit); but the format I had was all based on 32-bits for space-efficiency. With a little bit of re-arranging I managed to create a 64-bit format which requires less manipulation whilst only adding 4 bytes to each feature - this adds up but I think it will be worth it. It also lets me unroll the inner loop a bit more easily. One thing I noticed is that the 'weights' of the features in the cascade i'm using only fall into 3 categories; 2 combinations for 2-region features: -0.25 + 0.50, -0.25 + 0.75, and 3-region features are always a fixed -0.25 + 0.50 + 0.50. So I don't have to encode the weights individually for each region and can just either hard-code the weights (for the 3-region features), or store a single float value for each feature instead.

Anyway then my sister dropped around, which means a slight hangover this morning ... so seeing if all this tweaking makes any difference will have to wait.

Update: So i did some evening hacking, and finally got the code working on the epu - kept overwriting the stack and other fun stuff which is a pain to track down. And oh boy, make sure you avoid double and that any shorts in a structure are unsigned - othterwise it really hits performance.

So yeah the new cascade format does increase the performance ... but it just widened the gap! Although I gained a pretty decent 30-50% improvement on ARM execution time (actually its hard to tell, timing is very inconsistent on the arm code), it was pretty much insignificant on epiphany, under 2%. Blah.

The total DMA wait time of loading the cascades went up by 30% which suggests it's being limited by external memory bandwidth and not cpu processing. Some of this is good - indicating the cpu code is better and so there's more waiting around - but as the format is a bit fatter that can't help either and it's not like there's any way to fit any other processing in the dead time.

Will have to re-check the dma code to make sure it's double buffering properly, but i'm probably going to have to try another approach. a) Go for size of the data instead, i can get each feature down to 20/24 bytes (2 region features) + 24 bytes (3 region features) vs 32/40 - if i do the address arithmetic in the inner loop (lucky i can do the multplies with shifts), and/or b) see if fiddling with the sizes of the groups helps, and/or c) Distribute the extra bits of the cascade in chunks across multiple epiphany cores so the cascade remains completely on-chip, and/or d) Try a completely different approach such as a pipeline architecture where different cores process different stages.

Update 2: Actually on second thoughts, i don't think it's the dma, it's just not enough cycles to make a big difference (assuming my timing code is right). And even if i limit it to the local stages and don't dma in any cascade chunks it's still taking about 75% of the total time. So although the C compiler seems to be doing an okish job I'm probably better off spending my time in assembler now.

EPU Linking

Based on thoughts on a post on the forums I also had a quick look at how one might write an elf loader for the epu - in a way which makes it more useful than just memcpy. I think it should be possible to create an executable for the epu which allows each core to have custom code, and for each core to be able to resolve locations in other cores. And if the linking is done at run-time it can also re-arrange the code in arbitrary ways - such as to customise a specific work-group topology or even handle differences between 16 or 64-core chips. And I think I can do this without changing any of the tools.

I poked around ld and noticed the -q and -r options which retain the reloc hunks in the linked output. For a test I created a linker script which defines a number of sections each with an origin a megabyte apart - i.e. the spacing of epu cores. This represents the absolute location of each 'virtual' core. Code can be placed on any core using section directives (attributes), and using the -q option the linker outputs both the code and the reloc hunks and so should allow a loader to re-link each core's code to suit any topology ... well in theory.

Before I get anywhere with that I need some doco on the epiphany-specific reloc hunks from the assembler - looks quite straightforward - and finding enough time to look into it.

This might seem a bit weird in terms of GNU/Linux, but i'm familair (although no doubt rather rusty!) with non-pic but relocatable code from my Amiga hacking days. My code back then was assembled to a fixed address, and the OS loader would re-link it to potentially non-contiguous chunks of memory in a global system address space. If it worked then, it should work ok with the parallella too. My only concern is that the GNU linker outputs enough information - using the -r switch (for relocatable code) causes the final link to fail (probably libc/startup related), but the -q switch (which simply retains the reloc hunks) may not be sufficient. It also relies on the fact that any inter-core reference will have to be absolute and thus require a load and a reloc entry of a 32-bit address, but I think that will be the case.

Having a runtime elf loader/relocator allows for some other nifty shit too like thread local storage and so on. Now, if only the epiphany ABI were a bit more sane too! (wrt register conventions)

Well that wasn't very fun

I've been trying to work with the android open accessory protocol over the last few days and i finally got android to android communications working via usb. Most of the examples to be found are either on an embedded platform or via GNU/Linux, and i couldn't find anything about doing the host-side from android.

Bit of a mess though. They somehow managed to make a fairly simple (but still a bit weird) library - libusb - into something less than the sum of it's parts (discordant vs syngergistic?). And then you add to that the complex life-cycle of an android app and things get pretty nasty.

Some pretty weird shit to deal with:

If you don't set the app as the default handler for the device, then you get permission popups which mean you have to put the connect logic in the resume method otherwise nothing works.
But if it is defaulted, ... then you don't get permission popups, and therefore resume is not called at the right time.
If you tell a connected device to go into accessory mode, you get a device disconnected message, which sort of makes sense ...
... but you don't get a connected message for the accessory device.
And if you subsequently list the devices connected to the system, a device is still listed with the same vendor id + device id (this is wrong, the device should have changed at this point) ...
... but that "device" isn't the same as the "device" you started with.
And there seems to be no "reliable" way to determine if you have the accessory interface, or the MTP interface (on the machine i'm testing with). (well maybe testing using a no side-effect valid request which fails could be used here).
If the device rotates it destroys the application ... leaving everything in an unknown state.
Sometimes nothing works anyway, and the apps on both end need killing.
The documentation is a bit shit, and amounts to "the [outdated] source is the documentation". And that relies on a USB library which doesn't fudge the device id.

But with all that I did manage to send strings both ways so now the challenge is to sort through the cruft i've ended up with and turn it into something robust and hopefully simple.

In other android news i finally restarted on the ffts-related work and will scale down the initial goals to get something out the door.

DMA n stuff

So it looks like it's going to take a bit longer to get the object recognition code going on parallella.

First i was a bit dismayed that there is no way to signal the ARM with an interrupt - but subsequently satisfied that it is just work in progress and can be added to the FPGA glue logic later. Without this there is no mechanism for efficient epiphany to arm communications as the only option is polling a shared memory location (ugh, how x86).

Then I had to investigate just how to communicate using memory buffers. I had some trouble with the linker script (more on that later) so I hardcoded some addresses and managed to get it to work. I created a simple synchronous mailbox system that isn't too inefficient to poll from the ARM.

Once I had something going I tried a few variations to judge performance: simple memory accesses, and different types of DMA.

The test loop just squares the elements of one array into another, and uses a single e-core. The arrays are 512x512 floats.

seq How                         Total        DMA verified 
 1: Direct array access     216985626          0 ok
 3: Synchronous DMA          11076425          0 ok
 5: Simultanous DMA byte     90242983   88205438 ok
 7: Simultanous DMA short    46104601   44066927 ok
 9: Simultanous DMA word     24047251   22009451 ok
11: Simultanous DMA long     11179747    9131659 ok
13: Async Double DMA byte    88744259   88021739 ok
15: Async Double DMA short   44542240   43819736 ok
17: Async Double DMA word    22448029   21725573 ok
19: Async Double DMA long     9479016    8756500 ok

All numbers are in clock cycles. The DMA routines used one or two 8K buffers (but not aligned to banks). The DMA column is the "wasted cycles" waiting for asynchronous DMA to complete (where that number is available).

As expected the simple memory access is pretty slow - an order of magnitude slower than a simple synchronous DMA. The synchronous DMA is good for it's simplicity - just use e_dma_copy() to copy in/out each block.

Simultanous DMA uses two buffers and enqueue two separate DMA operations concurrently - one reading and the other writing. They both still need to wait for completion before moving on, and it appears they are bandwwidth limited.

Async double-buffered DMA uses the two buffers, but uses a chained DMA operation to write out the previous result and read in the next result - and the DMA operation runs asynchronous to the processing loop.

Some notes:

Avoid reading buffers using direct access!: Slow slow slow. Writing shouldn't be too bad as write transactions are fire and forget. I presume as i used core 0,0 this is actually the best-case scenario at that ...
Avoid anything smaller than 64-bit transfers.: Every DMA element transfer takes up a transaction slot, so it becomes very wasteful for any smaller size than the maximum.
Concurrent external DMA doesn't help: Presumably it's bandwidth limited.
Try to use async DMA.: Well, obviously.
The e-core performance far outstrips the external memory bandwidth: So yeah, next time i should use something a bit more complex than a square op. This is both good - yeah heaps of grunt - and not so good - memory scheduling is critical for maximising performance.
Multicore?: I have yet to experiment with bigger work-groups to see how it scales up (or doesn't).

Build Environment

So one reason I couldn't get the linker script to work properly (assigning blocks to sections) was due to my build setup. I was initially going to have a separate directory for epiphany code so that a makefile could just change CC, etc. But that just seemed too clumsy so I decided to use some implicit make rules which use new extensions to automate some of the work - .ec, .eo, .elf, .srec, etc. The only problem is the linker script takes the name of the extension into account, so all my section attributes were being ignored ...

I copied it and added the extensions to a couple of places and that fixed it - but i haven't gone back to adjust the code trying to take advantage of it.

Object recognition

So anyway i tried to fit this knowledge into the OR code, but i haven't yet got it working. The hack of hard-coding the address offsets doesn't work now since i'm getting the linker to drop some data into the same shared address space, and i'm not sure yet how i can resolve the linker-assigned addresses from the ARM side so i can properly map the memory blocks. So until I work this out there's no way to pass the job data to the e-cores.

I could just move all the data to the ARM side and have that initialise the tables, but then I have to manually 'link' the addresses in. So that is throwing away the facilities of the linker which are designed for this kind of thing. I could create a custom linker script which hard-codes the addresses in another way but that seems hacky and non-portable.

I might have to check the forums and see what others have come up with, and read that memory map a bit more closely.

Progress on object detection

I spent more hours than I really wanted to trying to crack how to fit a haarcascade onto the epiphany in some sort of efficient way, and I've just managed to get my first statistics. I think they look ok but I guess i will only know once I load it onto the device.

First the mechanism i'm using.

Data structures

Anyone who has ever used the haarcascades from opencv will notice something immediately - their size. This is mostly due to the XML format used, but even the relatively compact representation used in socles is bulky in terms of epiphany LDS.

struct stage {
    int stageid;
    int firstfeature;
    int featurecount;
    float threshold;
    int nextSuccess;
    int nextFail;
};

struct feature {
    int featureid;
    int firstregion;
    int regioncount;
    float threshold;
    float valueSuccess;
    float valueFail;
};

struct region {
    int regionid;
    int left;
    int top;
    int width;
    int height;
    int weight;
};

For the simple face detector i'm experimenting with there are 22 stages, 2135 features and 4630 regions, totalling ~150K in text form.

So the first problem was compressing the data - whilst still retaining ease/efficiency of use. After a few iterations I ended up with a format which needs only 8 bytes per region whilst still including a pre-calculated offset. I took advantage of the fact that each epiphany core will be processing fixed-and-limited-sized local window, so the offsets can be compile-time calculated as well as fit within less than 16 bits. I also did away with the addressing/indexing and took advantage of some characterstics of the data to do away with the region counts. And I assumed things like a linear cascade with no branching/etc.

I ended up with something like the following - also note that it is accessed as a stream with a single pointer, so i've shown it in assembly (although I used an unsigned int array in C).

cascade:
    .int    22                           ; stages

    ;; stage 0
    .short  2,3                          ; 3 features each with 2 regions
    ;; one feature with 2 regions
    .short  weightid << 12 | a, b, c, d  ; region 0, pre-calculated region offsets with weight
    .short  weightid << 12 | a, b, c, d  ; region 1, pre-calculated region offsets with weight
    .float  threshold, succ, fail        ; feature threshold / accumulants
    ;; ... plus 2 more for all 3 features

    .short  0,0                          ; no more features
    .float  threshold                    ; if it fails, finish

    ;; stage 1
    .short 2,15                          ; 15 features, each with 2 regions

This allows the whole cascade to fit into under 64K in a readonly memory in a mostly directly usable form - only some minor bit manipulation is required.

Given enough time I will probably try a bigger version that uses 32-bit values and 64-bit loads throughout, to see if the smaller code required for the inner loop outweights the higher memory requirements.

Overlays ...

Well I guess that's what they are really. This data structure also lets me break the 22-stage cascade into complete blocks of some given size, or arbitrary move some to permanent local memory.

Although 8k is an obvious size, as i want to double-buffer and also keep some in permanent local memory and still have room to move, I decided on the size of the largest single stage - about 6.5K, and moved 1.2K to permanent LDS. But this is just a first cut and a tunable.

The LDS requirements are thus modest:

    +-------------+
    |             |
    |local stages |
    |    1k2      |
    +-------------+
    |             |
    |  buffer 0   |
    |    6k5      |
    +-------------+
    |             |
    |  buffer 1   |
    |    6k5      |
    +-------------+

With these constraints I came up with 13 'groups' of stages, 3 stages in the local group, and the other 19 spread across the remaining 12 groups.

This allows for a very simple double-buffering logic, and hopefully leaves enough processing at each buffer to load the next one. All the other stages are grouped into under 6k5 blocks which are referenced by DMA descriptors built by the compiler.

This also provides a tunable for experimentation.

Windowing

So the other really big bandwith drain is that these 6000 features are tested at each origin location at each scale of image ...

If you have a 20x20 probe window and a 512x512 image, this equates to 242064 window locations, which for each you will need to process at least stage 0 - which is 6 regions at 4 memory accesses per region, which is 5 million 32-bit accesses or 23 megabytes. If you just access this directly into image data it doesn't cache well on a more sophisticated chip like an ARM (512 floats is only 4 lines at 16K cache), and obviously direct access is out of the question for epiphany.

So one is pretty much forced to copy the memory to the LDS, and fortunately we have enough room to copy a few window positions, thus reducing global memory bandwidth about 100 fold.

For simplicity my first cut will access 32x32 windows of data, and with a cascade window size of 20x20 this allows 12x12 = 144 sub-window probes per loaded block to be executed with no external memory accesses. 2D DMA can be used to load these blocks, and what's more at only 4K there's still just enough room to double buffer these loads to hide any memory access latency.

LDS for image data buffers:

    +-------------+
    |    SAT 0    |
    | 32x32xfloat |
    |     4k      |
    +-------------+
    |    SAT 1    |
    | 32x32xfloat |
    |     4k      |
    +-------------+

A variance value is also required for each window, but that is only 144 floats and is calculated on the ARM.

Algorithm

So the basic algorithm is:

    Load in 32x32 window of summed-area-table data (the image data)
    prepare load group 0 into buffer 0 if not resident
    for each location x=0..11, y=0..11
        process all local stages
        if passed add (x,y) to hits     
    end

    gid = 0
    while gid < group count AND hits not empty
        wait dma idle
        gid++;
        if gid < group count
            prepare load group gid into dma buffer gid & 1
              if not resident
        end
        for each location (x,y) in hits
           process all stages in current buffer
           if passed add (x,y) to newhits
        end
       hits = newhits
    end

    return hits

The window will be loaded double-buffered on the other dma channel.

Efficiency

So i have this working on my laptop stand-alone to the point that I could run some efficiency checks on bandwidth usage. I'm using the basic lenna image - which means it only finds false positives - so it gives a general lowerish bound one might expect for a typical image.

Anyway these are the results (I may have some boundary conditions out):

Total window load         : 1600
Total window bytes        : 6553600
Total cascade DMA required: 4551
Total cascade DMA bytes   : 24694984
Total cascade access bytes: 371975940
Total SAT access bytes    : 424684912

So first, the actual global memory requirements. Remember that the image is 512x512 pixels.

window load, window bytes

How many windows were loaded, and how many bytes of global memory this equates to. Each window is 32x32xfloats.

This can be possibly be reduced as there is nearly 1/3 overlap between locations at the expense of more complex addressing logic. Another alternative is to eschew the double-buffering and trade lower bandwidth for a small amount of idle time while it re-arranges the window data (even a 64x32 window is nearly 5 times more global-bandwidth efficient, but i can't fit 2 in for double buffering).

cacade DMA count, cascade DMA bytes

How many individual DMA requests were required, and their total bytes. One will immediately notice that the cascade is indeed the majority of the bandwidth requirement ...

If one is lucky, tuning the size of each of the cascade chunks loaded into the double buffers may make a measureable difference - one can only assume the bigger the better as DMA can be avoided entirely if the cascade aborts within 11 stages with this design.

Next we look at the "effective" memory requirements - how much data the algorithm is actually accessing. Even for this relatively small image without any true positives, it's quite significant.

It requires something approaching a gigabyte of memory bandwidth just for this single scale image!

Total SAT bytes: The total image (SAT) data accessed is about 420MB, but since this code "only" needs to load ~6.5MB it represents a 65x reduction in bandwidth requirements. I am curious now as to whether a simple technique like this would help with the cache hit ratio on a Cortex-A8 now ...
Total cascade bytes: The cascade data that must be 'executed' for this algorithm is nearly as bad - a good 370MB or so. Here the software cache isn't quite as effective - it still needs to load 24MB or so, but a bandwidth reduction of 15x isn't to be sneezed at either. Some of that will be the in-core stages. Because of the double buffering there's a possibility the latency of accessing this memory it can be completely hidden too - assuming the external bus and mesh fabric isn't saturated, and/or the cascade processing takes longer than the transfer.

LDS vs cache can be a bit of a pain to use but it can lead to very fast algorithms and no-latency memory access - with no cache thrashing or other hard-to-predict behaviour.

Of course, the above is assuming i didn't fuck-up somewhere, but i'll find that out once I try to get it working on-device. If all goes well I will then have a baseline from which to experiment with all the tunables and see what the real-world performance is like.

Object detection, maths, ...

Curiosity got the better of me and I poked abit around the code side of my parallella board today.

Working on from a forum post about it, i started thinking about how to fit the viola & jones 'haar cascade' object detector into the epiphany, and I started looking at an assembly version of the inner loop. If I arrange the data appropriately and force some assumptions i can get it down to around 15 instructions per feature test which is pretty decent. And infact it can even run in the lower 8 registers and so is very compact too (16-bit encodings). I'm actually pretty confident I can get some decent efficiency out of it, although i'm not sure how that will translate to performance yet.

I have a few ideas how i can handle the large size of cascades fairly efficiently - first by having the lowest and most frequently accessed levels stored in LDS, and then either relying on their rarity to handle the upper stages, or using some pre-fetch mechanism. 32K is bloody tight though.

Although most of the calculations are simple mul + add and some comparisions, there is also a square root required per window. I can probably move the calculation of that to the ARM side (although then I would pretty much have to move all the scaling there too), depending on how fast the epihany ploughs through the window tests anyway.

I'm still not quite sure how the data flows between host and cu, but the examples should contain enough information to work that out.

fdiv

So I also looked into - and got pretty much distracted by - floating point divide. Something that is required if e.g. calculating the square root. Since the ephiany has neither reciprocal estimate or divide, one must implement it oneself.

I think I managed to implement the Newton-Raphson division mechanism from Wikipedia in about 40 instructions. Unfortunately there are a lot of data-stalls due to the feedback nature of the algorithm (and me not particularly wanting to hand-schedule the bits where there is leeway), but it runs in about 75 clock cycles with no divide by zero or other checks going on. A C implementation of the identical algorithm takes about 123, and using / in C takes about 131 (with -ffast-math, i'm not sure how the ieee error checks differ). Anyway it's not something vj needs that much of, so C would probably suffice. Divide seems to drag in a bit of libc too, whereas a stand-alone is very compact.

Update: not sure if libm was in external memory either actually, i would have to go back and have a look.

Instruction set

So these two little investigations helped expose me to some of the nuances of the instruction set. Such as the lack of a unary NOT - one must use EOR(~0) instead (which takes 3 instructions - 2 to load ~0, and the eor itself). The lack of bit-field instructions is no surprise, although that fact still doesn't make the pain of emulating them any more fun.

The offset addressing modes are nice in that the index is scaled to the data-size, which makes the 3-bit version quite a bit more useful than it might otherwise be.

Not having r15 as the PC has an obviuos side-effect: no direct way to implement PC-relative addressing and position independent code ... although I presume movfs can be used to load the PC to a register, and that can be used instead.

Wow my desk is dusty

Just a couple of shots of my shitty jury-rigged 'case' for my parallella rev-0.

Base-board is screwed to a CD-ROM with some 4mm irrigation pipe 'stand-off's. It's then sitting inside the upper-half of a 3.5" FDD case.

The fan is just sitting on the hole from the FDD motor/flywheel (the picture is with it turned on). It's powered from the 5v supply in a rather unreliable manner ...

It does the job anyway and seems to keep the chips quite cool. It's a 12v fan running off 5v so it's essentially silent too.

The camera really shows up the (emabarrsing level of) dust, as it's usually too dark for it to be quite so visible. Particularly as it's under one monitor/next to a laptop so my eyes are normally blinded by that.

Parallella rev-0

So I got my first parallella board yesterday. I don't have much to report other than that I plugged it in and it worked. The circuit board is literally credit-card sized and really packed with components - very impressive. With 1G ram and the dual-core arm the base cpu seems quite nippy too - I haven't run any benchmarks vs the beagleboard-xm, but it certainly feels significantly faster from the very limited amount of use so far (e.g. emacs seems usable even via remote X). Ubuntu/debian is a big turn-off though.

Unfortunately the early-adopter rev-0 boards are a bit sub-optimal and don't have working USB (by far the biggest amongst a few faults; hdmi doesn't work yet apparently but that's only a firmware issue and I don't have the correct cable yet in any event). And as I found with the beagleboard - without USB these things are severely limited, so i'm pretty bummed about that (I expect a future i/o expansion board will at least remedy this). I kind of made a mistake in that I really wanted 2x64 core boards but was worried about having import duty hassles. And because I was on leave and staying away from my laptop i missed the email offer to switch for these early boards until too late. Update: I should've checked the message board first, it seems USB should work at some point.

Anyway as one might notice from the lack of posts lately i've kind of gone off pretty much everything technology related. It's partly the weather, but also a change of perspective due to age, health, and so on. I had a few weeks off over that time and didn't touch any code at all, and didn't even feel like thinking about it much either. I did notice the new OpenCL spec which has some interesting features, but i've mostly just been reading online news/forums (but not participating) and playing some PS3 games - and whilst the hysteria and ignorance in the fora has had it's entertainment value, the sameyness is wearing thin.

So right now i'm not really too enthused just yet about poking the epiphany chip (let alone the fpga) or trying to provide feedback on the sdk/etc. Pity it didn't come on-time, when I was more into that shit (as i generally am over summer).

Back at work it looks like i'm on some pretty dull stuff for the next little bit too, which is pretty sapping.

Ideas

Update: responding to the comment below I thought i'd update the post ...

At this point I don't have anything specific I want to do with the board but as I think about it there are a few possibilities of things I might be interested in poking at ...

Simple low-level signal processing algorithms such as convolution, wavelets, fft?. How does it perform compared to NEON? How easy is it to extract that performance?
Machine vision/learning.
Interfacing to Java. Obviouusly JNI but it would also be nice for something deeper/easier to use like Aparapi. Or running lambda expressions. But most of that is probably too big/too out of my area of expertise.
Learn about FPGA programming. Last time I was exposed to it was the early 90s.
FPGA + DSP stuff, e.g. 3d, or sound stuff. Probably unlikely to get very far with that again due to complexity.
Task scheduling/processing; ways to use/share the epiphany resources efficiently. e.g. 'shaders'.
Assembly language.
OpenCL?

Whatever I do will probably limited to small learning experiments rather than long-term applications (that's what work is for). I need to set aside the time and find the enthusiasm to get started too. Time isn't the something i'm short of!

At least the more I think and write about it the more the desire builds to get stuck into it ...

Into the cloud!

So yeah i've been a bit bored/insomniacal[sic] lately and reading the nets ... and one topical topic is the next set of game consoles from microsoft and sony.

I still can't believe how much microsoft ball(mer)sed-up their marketing message, but I guess when you live in such a bubble as they do it probably seemed like a good idea at the time. Sad that sony gets such cheers for merely keeping things the same and letting people SHARE the stuff they buy with their hard-earned. But when microsoft intentionally avoid the 'share' word it isn't so surprising. Incidentally the microsoft used game thing reeks somewhat of the anti-trust trouble that apple are currently in; although they seem to have made an effort to ensure it was regulator-safe by a bit of weaselling in the way they structured it. i.e. they facilitated the screwing of customers without mandating it.

But back to the main topic - the whole 'it's 4x faster due to the cloud' nonsense.

Ok, obviously they shat bricks over the fact that the PS4 is so much faster on paper. The raw FPU performance is 50% better, but I would suggest that the much higher memory bandwidth (~3x) and the sony hardware scheduler tweaks will make that more like 100% faster in practice, or even more ... but time will tell on that. With 1080P framebuffers, 32MB vanishes pretty fast - it's only just enough for a 4x8-bit RGBA framebuffers, or say a 16-bit depth buffer and 1xRGBA colour and 1xRGBA accumulation buffer. With HDR, deferred rendering, and high fp performance and so on, this will be severely limiting and its still only relatively a meagre 100GB/s anyway. microsoft made a big bet on the 32MB ESRAM thinking sony would somehow over-engineer the design; and they simply fucked up (the dma engines are of course handy, but even the beagleboard has a couple of those and it can't make up bandwidth). As another aside, I lost a great deal of respect for anand from anandtech when he came up with the ludicrous suggestion that the 32MB of ESRAM might actually be a hardware associative cache. For someone who claims to know a bit about technology and fills his articles with tech talk, he clearly has NFI about such a fundamental computer architecture component. It suggests PR departments are writing much of his articles.

It learns from its mistakes?

So anyway ... most of the talk about the cloud performance boosting is just crap. Physics or lighting will not be moved to the internet because internet performance just isn't there yet - and the added overheads of trying to code it just aren't worth it. Other things like global weather could be, but it's not like mmo's can't already do this kind of thing and unless you're building 'Cyclones! The game!' the level of calculation required will be minimal anyway. However ... there is one area where I think a centralised computing capability will be useful: machine learning.

Most machine learning algorithms require gads and gads of resources - days to weeks of computing time and tons of input data. However the result of this work is a fairly dense set of rules that can then be sent back to the games at any time. Getting good statistical information for machine learning algorithms is a challenge, and having every machine on the network allows them to do just that, and then feed them back.

So a potential scenario would be that each time you play through a single player game, the AI could learn from you and from all other players as you go; reacting in a way that tries to beat you, at the level of performance you're playing at. i.e. it plays just well enough that you can still win, but not so that it's too easy. Every time you play the game it could play differently, learning as you do. We could finally move beyond the fixed-waves of the early 80s that are now just called 'set pieces'.

It would be something nice to see, although a cheaper and easier method is just to use multi-player games to do the same thing and use real players instead. So we may not see it this iteration: but I think it's about time we did.

Maybe some indie developer can give it a go.

sony or microsoft could also use the same technique to improve the performance of their motion based input systems, at least up to some asymptotic performance limit of a given algorithm.

But ...

However, microsoft have no particular advantage here as it isn't the 'cloud services' that are important here, it's just that every machine has a network port. Sure having some of the infrastructure 'done for you' is a bit of a bonus, but it's not like internet middleware is a new thing. There are a bevvy of mature products to choose from, and microsoft is just one (mediocre) player from many. And a 3rd party could equally go to any other 3rd party for the resources needed (although I bet microsoft wont let them on their platform: which is another negative for that device).

Actually ...

Update: Actually I forgot to mention that I really think the whole 'always online' and kinect-required gig in microsoft's case is all about advertising: it can see who is in a room, age, gender, ethnicity, and if you have an account registered even more details on the viewer. It could track what people in the room are doing during game-playing as well as watching tv shows and advertising; probably even where they're looking, what they're talking about, and what those tv shows/advertisements are. It doesn't take much more to "anonymously" link your viewing habits with your credit card transactions.

A marketer's wet dream if ever there was one. A literal "fly on the wall" in every house that has "one".

And if you think this is hyperbole, you just haven't been paying attention. Google (and others) already do all this with everything you do on the internet or on your phone, why should your lounge room be any different? I was pretty creeped out when I started seeing adverts that seemed to be related to otherwise private communications.

Let's just see how long before people start seeing advertising popping up (perhaps over the TV shows they're watching, or within/over games?) that matches their viewing habits and lounge-room demographics or what they were doing last night on the coffee table. And even if they don't get there "this generation", it's clearly a long-term goal.

So whist one can do some neat stuff with the network, that's just a side-effect and teaser for the main game. Exactly like google and all it's "free" services. Despite paying for it this time, you're still going to be the product. Even the DRM stuff is a side-show.

Prices

Well as usual $AUS gets shafted by the 'overheads' of the local market. But you know what? Who cares. They're both cheaper than the previous models even in face-value-dollars let alone real ones, and we don't treat our less fortunate workers as total slaves in this country (at least, not yet).

They might still be a luxury item but in relative terms they've never been cheaper. My quarterly electricity bill has breached $500 already and it's only going up next year.

The initial price of the device is always only a part of the cost (and for-fucks-sake, it is NOT a fucking investment), and pretty small part with the price of games, power, the tv/couch, and internets on top.

I'm still not sure if i'll get a ps4: given the amount of games i've played over the last 12 months it would be pretty pointless. I've still got a bunch of unopened ps3 games - I think I just don't like playing games much, they're either too easy and boring, too much like work, or I hit a point I can't get past and I don't have the patience to beat a dumb computer and feel good about it (and i'm just generally not into 'competition', i'd rather lose than compete). A PS4 CPU+memory in a GNU/linux machine on the other hand could be pretty fun to play with.

About Me

Tags