About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Wednesday, 27 March 2013, 01:51

Bummed out, or am i?

This week I've been experimenting with the performance of some NEON code. It is from an algorithm which was developed in OpenCL for desktop GPUs and then downscaled to fit on a beagleboard (only for development purposes). The overall algorithm is identical but the way some of the steps are implemented is different (and for some significant components much less computationally and bandwidth intensive).

The OpenCL code took many months to develop - although that included dead-ends, multiple steps if refinement, and other distractions including completely unrelated work. Even with that, I'd put the effort at around 4-10x that for the NEON code.

The NEON code took a few weeks. It obviously helped immensely that the algorithm was primarily known in advance although the downscaling alterations were not. On the other hand, my total experience with NEON is far less than OpenCL and certainly C or Java in terms of hours.

One reason the NEON code was much easier to write is that because as it is so cheap to invoke, one can just concentrate on the bottlenecks, and leave the housekeeping to C. e.g. I can write a routine that processes as little as 16x16 pixels in assembly, and leave the addressing crap to C. There is also no marshalling or other api binding to worry about: the C is plain C, and the assembly is plain assembly, and even though JOCL is far far better than using the C api directly it's still quite a bit of work. As much fun as OpenCL is, it's even more fun hacking NEON because you can concentrate on the fun bits even more.

Although the OpenCL model is also based on simple kernels which should equally be simple and isolated - it isn't really quite like that in practice. All but the simplest of kernels end up turn into 64-way parallel subroutines utilising LDS, barriers, and so on. Without that you end up leaving skads of performance on the floor, so it really is necessary. Not to mention all the marshalling and boilerplate in the host-code to communicate with it. And because of the marshalling and invocation latencies pretty much everything is forced onto the GPU.

So what's the point i'm getting at?

Well after all that, the projected performance on the previously-latest-version of a popular handset is only about 5x slower than a HD7970 on a pretty beefy desktop!

Yes that 5x speedup is important enough that it is worth it and opens it up to more applications, but on a personal level i'm just totally bummed it isn't much more. It's a highly parallel and bandwidth intensive workload which should be well-suited to a GPU. Obviously opencl has the advantage that it isn't tied to a single bit of hardware. It's a pity SSE sucks so much otherwise it would be interesting to see how a desktop cpu fared on it's own.

I plan to "back port" the algorithms so it can be improved on the GPU, but I have a fairly educated feeling that another 2000% performance isn't very likely. I will also need to use some AMD proprietary extensions, so the portability will suffer too.

I'm sure I can improve it, but I just take it as a big personal slap in the face for all the effort that's gone into it so far!

Of course, the alternative view is that ARM+NEON is the bees knees - with less effort i'm getting relatively great performance. But we all knew that so it isn't such a revelation ...

The main bottleneck on ARM cpus at the moment is the memory, and if you can utilise the cache effectively it really flies. I would really like to see how a beagleboard like machine with big/little A15/A7 quad core and much faster memory would fare, all these cheap android dongles are far too constrained by their form-factor.

Update: Well I might need to eat my words here. Today I started to look at GPU optimisations based on what i'd learnt from the ARM experience and trying to reduce the bottlenecks of the GPU code.

The key word of the day: batching.

One reason I wasn't previously batching the processing is because it didn't really fit the data-flow of an earlier application. But I have now achieved something like a 50x boost in one key algorithm by a combination of batching the work more aggressively and some other significant algorithmic changes.

This is more like it. No longer bummed out ...

Tagged beagle, hacking, opencl.
Monday, 25 March 2013, 23:07

Another pointless bruhaha at an IT conference

I just have 2 words to say: Google Glass.

Better get used to it, and more.

Tagged philosophy, rants.
Monday, 25 March 2013, 01:33

#$#@$ Android

One of those "I can feel the hair going grey and falling out" mornings. Although at least it wasn't one of those "I made every mistake I could have, every time I did anything" one. My throat is hoarse from all the screaming of definitely un-pc obscenities.

I'm trying to get some jjmpeg stuff working on android - but I can't use the jjmpeg android branch due to licensing issues. I need to use a fully shared library version.

Compiling the code was easy enough but trying to use shared libraries on Android turned out to be a complete head-fuck.

First, for some fucked reason although shared libraries are placed into a per-architecture location, the bloody library path isn't setup to automatically use it. So you have to hard-code all the architecture into the dlopen() calls.

Actually second, it's even worse than that, you need to use absolute paths for dlopen() otherwise it doesn't work. I just hard-coded it, a stack overflow answer stated that finding the path from Java was 'trivial', but then neglected to include which api trivially provided it. I didn't really want to change jjmpeg anyway.

And thirdly, it's flaccid linker doesn't support versioning of the shared libraries. To fix that I hacked up ffmpeg/library.mak to only create non-versioned sonames, and then hacked a build script to make it compile properly as with that change it didn't seem to want to build the libraries properly. I did this by specifying each lib*/*.so manually as the build target.

And the fun's only just started, i've still got a pile of C and NEON code I've got to get to work ...

Update: Oh fucking joy, no complex numbers now.

Update 2: Well I managed to replace complex.h with a few lines of #defines - fortunately they're complex numbers are in the compiler, just not in libm, and apart from the basic arithmetic I only needed creal/cimag and conjf.

And somehow, after re-arranging 1500 lines of C and 1500 lines of NEON assembly language, creating JNI wrappers, and hooking it into ffts and jjmpeg/head ... it worked almost first time. I'd set myself all week to get that far so i'm pretty chuffed, even if the headache from the days earlier tribulations hasn't yet receded.

I really have the beagleboard and a GNU operating system to thank for that - without being able to develop on a proper system first it would've been a nightmare.

Time for a couple of glasses of nice Barossa red methinks.

Tagged android, hacking, jjmpeg, rants.
Sunday, 24 March 2013, 08:11

clAmdFft

Just a quick note, I created a small JNI wrapper using JOCL to access the clAmdFft library from AMD, to see how it compared to the Apple FFT I ported to Java.

Bummer, the Apple FFT is still faster for my problem size - image processing. Kernel time on a 1024x1024 sp fft is ~0.5ms vs ~0.38ms for the Apple one (HD7970). And the apple one only requires 3 passes. Smaller and/or batches seems to scale linearly too.

All that (minor) frustration with the clMiXeDAbbrCamlCase API for nought.

Update: So I actually tried writing a 64-element parallel FFT to see how I could do ... well, not good. Apart from wasting a lot of time wondering why "no matter what I did it took 15ms" because I hadn't allocated a properly sized buffer (out of range access I guess), when I finally got it going it was about 10x slower than the Apple one. Blah. Actually the apple one was about the same speed (if not faster) than a simple memcpy (using float2).

I wasn't being very smart about it, I just broke a 64-element FFT into one-calculation steps and used shared memory to communicate the results. A big overhead was the index twiddling for which I just used lookup-tables, but even the shared memory communication was expensive.

I may have gotten something wrong anyway, the profiler said the apple fft code had many many fewer global read/write operations, and I couldn't work out why. Although I was trying to do it all in LDS in a single kernel.

Since I didn't verify the results - I was just seeing how the memory access patterns would work - I have no real idea whether it worked or not anyway. But I may have another play another day, I wrote some code which outputs the expressions required between two arbitrary "layers" of the calculation so I can use larger kernels than the radix-2 ones I was testing with.

Tagged hacking, opencl.
Friday, 22 March 2013, 05:46

Nvidia and OpenCL

I don't particularly follow much what Nvidia are up to anymore - for the last 18 months that I was subscribed to the 'gpgpu' mailing list of theirs it never mentioned OpenCL once, so I de-subscribed a few weeks ago because it contained nothing of interest.

I've been poking around anantech/tomshardware too much recently (work-stress induced apathy mostly) and read a few articles about Nvidia's latest developer conference.

Well, no mention of OpenCL anywhere, and now they're pushing their CUDA stuff on mobile as well. The lack of OpenCL 1.2 support, and the poor showing in compute performance of their recent hardware makes it pretty obvious that they're not interested in it, but this really nails the coffin shut.

In comments people always claim that the poor OpenCL performance is just unoptimised drivers. However this doesn't hold much water - because the architecture of their driver design is such that both CUDA and OpenCL are simply thin layers above the same infrastructure. It would be like saying ``Oracle only focuses on Java, and that's why Scala isn't as fast'' - but any improvements they make to the JVM benefit Scala too.

The only alternative is too crazy to imagine, that they're deliberately knobbling the OpenCL performance significantly on purpose. Anyone investing heavily in their super-computer infrastructure would be well to take such nonsense into consideration.

Risky move?

With fairly broad support for the HSA foundation (it seems?), and a computing model which is more advanced than CUDA, can they go it alone? Well apart from their crappy controller/game thing, it doesn't seem like anyone is too interested in Tegra4, so let us all just hope that it is a no on that one.

I'm a little bummed that the priorities at work has kept me away from OpenCL for a while, I had a single day back at it after some leave but then the focus changed again. Although I might be doing some more NEON assembly soon so it's not all bad.

Tagged rants.
Wednesday, 20 March 2013, 14:35

The farmer quest

I haven't had much time to work on anything this week, but I did get the "farmer quest" from Zabin's game implemented (or at least, the script for it). I'm a bit blank on the imagination at the moment, so by just getting old functionality working the problem is just one of filling out the implementation and script api.

This was pretty nasty in Dusk script, but is relatively straightforward in javascript. I experimented with just using the java api's a bit here too, rather than just adding script-friendly functions; it's a bit harder to use, but is more powerful.

The 'quest' is operated by walking onto a location infront of the farmer, this is just attached to the action for the location. Rather than 'bounce' the player back a square I just let them stay there - actions only trigger when you enter a location, not for standing on one. Prevents some ugly flicker in the client.

if (trigger.isPlayer()) {
    var state = trigger.getInteger("farmerquest");
    var list = trigger.getAllItems("bugcorpse");

    if (state == 0) {
        // first meeting
        trigger.chat("Farmer says: It doesn't look like my crop is going to support me this year.");
        trigger.chat("Farmer says: Those $%$# bugs ate half my crop.");
        trigger.chat("Farmer says: I have a job for you, if you could kill those bugs and bring me back their corpse, i'll reward you for each one.");
        trigger.chat("Farmer says: If you bring me 20 corpse's at once and I'll give you something special.");
        trigger.setInteger("farmerquest", 1);

Since the state can be tracked explicitly, it's easier to test at what position the script is, without resorting to returning from the script. Actually as a side-effect of the way scripts are invoked, it's actually impossible to return anyway - every script must be fully structured just like the original PASCAL. i.e. there is no 'return' or 'exit' keyword allowed, since the scripts are run as top-level code.

    } else if (state < 3 && list.size() >= 20) {
        // Check for 20 at a go
        for (var i=0;i<20;i++) {
            trigger.removeItem(list.get(i));
        }
        trigger.chat("Farmer says: That's 20 corpses, here take my Kaizer Blade for helping me.");
        trigger.setInteger("farmerquest", 3);
        trigger.addItem(game.createItem("kaizer blade"));

This bit can also just access the Java List interfaces directly to do bulk operations, as well as the wider 'game' object which has general utility functions. I do however have to make sure I design these public interfaces properly as for example the scripts are executing on another thread (eventually in a sandbox), so getting the mix right will be an ongoing experiment.

    } else if (state <3 && !list.isEmpty()) {
        trigger.removeItem(list.get(0));
        trigger.chat("Farmer says: Great, you killed another Bug, here's 30 gold for your reward.");
        trigger.addGold(30);
        trigger.setInteger("farmerquest", 2);

Simple reward is trivial to implement, it also tracks if you've done it at least once.

    } else if (state == 1) {
        trigger.chat("Farmer says: Have you collected any corpses yet?");
    } else if (state == 2) {
        trigger.chat("Farmer says: What are you doing back again, haven't you got work to do?");
    } else {
        trigger.chat("Farmer says: Thank you for your work, I think my crops will be safe for another year.");
    }
}

And finally a bit more detail for the default case since we track the state explicitly. And this version doesn't let the player keep killing the newbie enemies for easy cash once they've got the grand prize.

Update: I decided to keep up with this approach and i'm working on porting Zabin's game to the new engine bit by bit. I've imported the map just as another world which you can enter via a door and started on the tutorial. This will let me work on it in little pieces and also expose any implementation issues in manageable chunks.

e.g. When I added all the mobs from the original I hit a bug with the entity index and decided to change it completely. Fortunately as one of the first things i "fixed" was to abstract the map and map-related operations it means I have complete flexibility on how the internals are implemented.

Previously it was implemented as a 2d index storing a pointer to the first entity at a given tile location. These were then linked using a next field in the object and managed using some single-linked-list logic. For big maps with sparse entities it takes a lot of memory just for the array - 2MB just for the original dusk map. I changed it to use a hash table keyed on an x+y pair and stored using a LinkedList. Lookups will be marginally slower but it's made up for by the reduced footprint and ease of use, and because I'd abstracted the map previously I can always change it.

Update: Another thing I noticed that doorways were a pita to write. For each one you need at least an entrance and leave action, as well as an entrance and leaving location. And each action requires a one-line script. So I added another function in the map file which allows jumps without requiring a script to be run.

   x.y.goto=alias-name

It still requires 4 settings, but it doesn't need the two single-line scripts. This also means that map location aliases must now be globally unique.

Tagged dusk.
Sunday, 17 March 2013, 04:49

Scripts and conditions

As i've managed to get most of the guts of the game working - battles, internal indices, and so on, i've started to fill out the implementation by adding commands, and hooking up the scriptable actions. Even though i've only spent a few hours here and there it's coming together easily as there isn't much work to hook up a new script point. It's mostly just deciding on a convention and then sticking to it.

And to test the idea simply trying to port over some of the simple scripts and objects, such as absinthe ...

Revisiting the on-drop script for absinthe:

order trigger "get absinthe"
removeItem trigger absinthe
chat trigger "You pour the absinthe out and stare in disbelief as it eats a hole in the ground."
endscript

Took me a while to work out that first two lines simply remove the object from the game: the script is only executed after the default 'drop' action is executed - i.e. to leave it at the current map location. I decided to change the behaviour slightly so that if supplied, a script must implement the default action as well. This makes for a simpler and more obvious script in this case. Functions like 'removeItem' are not working on symbolic names either but the exact object, which should reduce coding errors.

owner.chat("You pour the absinthe out and stare in disbelief as it eats a hole in the ground.");
owner.removeItem(item);

Then I looked at the on-use script:

order trigger "emote wails: Oh, my head!"
addconditionwithduration trigger tired 500
if > trigger inte 10
  inc trigger intelligence -1
end
removeitem trigger absinthe
endscript

Fairly straightforward, except I hadn't filled out the condition stuff - the logic was there just no script hookage.

But I started with the JavaScript and worked backwards from there:

owner.emote("Wails: Oh, my head!");
owner.setCondition("tired", 500);
if (owner.getINT() > 10) {
  owner.addINT(-1);
}
owner.removeItem(item);

Previously, item definitions included fields which defined the scripts to execute. I am scrapping that and scripts are just found based on conventions from the item name. So for example, these scripts are stored in onScript/absinthe.drop and onScript/absinthe.use respectively. This means I don't have to little the objects themselves with references to names of files, and I guess I could also use it to implement an interface-based object model, particularly if i allowed scripts to store arbitrary persistent variables on items.

The existing condition mechanism defines conditions via a two-step process. Conditions are referenced by name, and then details about the conditions such as their firing rate and which scripts to execute are loaded from a file. I decided to scrap that and just let conditions be created on the fly, and use a convention to discover the scripts to execute. There aren't enough details to warrant the hassle required to maintain another directory full of files. So in this case, the "setCondition" is all that is required to set the condition, and the scripts which define the condition behaviour are defined in onCondition/tired.start, onCondition/tired.end, and onCondition/tired.tick. i.e. in a way very analogous to the item scripts, and one which will be repeated for all other script types.

Looking at the actual functionality for these scripts, the start/end is simply a notification to the player.

tired.start:

owner.chat("You suddenly feel very tired.");

tired.end:

owner.chat("Your normal level of conciousness returns.");

And the guts is in the tick script:

number tiredint * rand 150
number tiredint + tiredint 50
#
# You need an int of at least 50 to reduce the effect of tired
# and an int of 200 or more to be immune
#
if > tiredint trigger inte
    order trigger "sleep"
end
endscript

And the converted script, showing that anything with maths and logic is going to be a lot easier to write:

/*
* You need an int of at least 50 to reduce the effect of tired
* and an int of 200 or more to be immune
*/

var test = Math.random() * 150 + 50;

if (owner.getINT() < test) {
   owner.sleep();
}

Incidentally i'm considering making 'sleep' itself a condition, but at this point I haven't.

Current scripts also use conditions as a sort of general-purpose persistent state-holding variable, and I am going to remove that functionality and force the use of variables instead. To help enforce this, conditions will be visible in the client. This will leave conditions to be used for what they are intended: timed and/or periodic scripted behaviour.

I also decided to change the map alias stuff a little bit. Locations are still given symbolic names in the same way, but the visible/action/moveable script names are supplied directly. This lets 'true'/'false' be optimised, and still lets scripts be shared amongst locations. I also made the per-tile-type scripts follow the same conventions so everything is consistent.

Whilst looking at some of the existing scripts like the banks and so on, I noticed they add global commands to the game, but then have to have location checks on them to make sure they only activate at the correct locations. So another fairly simple extension may be to have per-map commands. On further thought I thought it might be more useful to have per-location commands implemented using an 'on-command' script, so then for example an ATM or bank teller for a certain bank could be implemented inside a single script, and then attached to any number of locations throughout the map. Since it's quite easy to add this, such a feature could also be added to tile-types or objects or even mobs instead ... (e.g. if you're standing on/next to an object/mob).

Tagged dusk, hacking.
Friday, 15 March 2013, 00:10

The network is the computer?

Apparently Google in their benevolent wisdom have decided to shut down Reader. It's not something I use but evidently it has a loyal following.

They'll whine a bit, then move on: like we all do. Adaptability is about the only positive trait of the human condition.

However maybe enough will realise this push to centralised network computing, or so-called "fog-computing", isn't really such a great deal after-all. If they are as "tech-savvy" as they seem to think they are, they should've seen the risk anyway. One gives up an awful lot for a bit of convenience - and I think our modern lives (for westerners certainly, and the `middle-class' everywhere else) are convenient enough as it is that we can cope fairly well without a little bit more. The fact that the biggest internet company of all can decide on a whim to kill a product used by thousands - who have no recourse whatsoever - should be a wake-up call for everyone.

As an aside ... it's nice to see our benevolent rulers at google deciding that advert-blocking apps are suddenly to be excluded from android's software library using language which is obviously there for the "good reason" of banning cracking software. Ahh lawyers, you suck. I just turn off wifi to block ads, saves power and most of them are for american shit I couldn't buy even if i cared to.

This is all just part of an on-going land-grab of a public resource - another tragedy of the commons - whereby private organisations are surrounding public culture with their own fences and gateways. All that crap we put in our blogs or our status updates or our "twats" may be crap but it's still public culture. Except when it's actually controlled and owned by a centrally managed group which locks it away in their own corral. It's all about control: making sure you read the information in the way they want you to.

Simply so you can't avoid the advertising!

For fuck sake, how many adverts do we need to be exposed to? Just how much junk can we buy? With Australia's extra tv channels now there is no regulation on the number of adverts and it's basically impossible to watch anything but the ABC anymore without using a recording or a mute button (i've already worn out my tv's remote). If i'm alone I barely turn the tv on anymore - it's not like there's much to watch most of the time anyway that isn't either a repeat you've seen too many times or some "it's the same every show" show which one eventually tires of.

All of these companies providing or facilitating centralised network services are will eventually start to see increasing risk from this type of behaviour. For one, people could get sick of the spying although that doesn't seem to be about to happen any time soon. It would be nice to think they'd get sick of buying junk: but that is unlikely too. Customers might start to realise that their 'lightweight browser-based app' is actually a shitty user-experience and to boot, more of a resource-hungry monster than a stand-alone desktop or pocket application. For coding on DuskZ I've been running netbeans 7.3 on my 5-year-old+ X61 thinkpad w/ 2G ram, and although it's a bloody heavy application doing some pretty complex shit, it still manages to use less resources than firefox running a few web pages (blogger's simple front-end, and the google search page seem to be the worst offenders, but unfortunately firefox has no 'top' function to find out which tab is hogging the resources). Personal mobile platforms might always be about 10 years behind the performance of desktop computers, but pretty soon a 10 year old desktop computer will be more than enough processing power for processing tasks to be run locally instead of remotely.

Here's an idea: a network isn't a one-way street. We don't all have to passively "consume" "content" from a centrally managed location. Well we could do that years ago; radio and television provided just that service. Every computer (including phones) connected to the network is just as much a part of the network as giant data-centres in tax-free low-cost-power locales. Every computer is capable of creating culture to add to the shared wealth of the public domain. Before the microsoft era computers were used far more for creative endeavours.

Yes I see the irony of posting this on blogger, using gmail, and with thousands of lines of code published in google code. I've long been somewhat uncomfortable about blogger and it's future and the closing of Reader only increases the discomfort. I've even started and worked on a blogging/documenting server several times (I called it 'wanki'), maybe when I get some more time I will revisit it again - i've since learnt better ways to do much of what I was trying to achieve with it. Now if only the fucking NBN would get it's act together and get me that fat pipe ...

PS I borrowed Sun's old trademark just because I could ...

Tagged rants.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!