About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Wednesday, 12 June 2013, 04:32

Into the cloud!

So yeah i've been a bit bored/insomniacal[sic] lately and reading the nets ... and one topical topic is the next set of game consoles from microsoft and sony.

I still can't believe how much microsoft ball(mer)sed-up their marketing message, but I guess when you live in such a bubble as they do it probably seemed like a good idea at the time. Sad that sony gets such cheers for merely keeping things the same and letting people SHARE the stuff they buy with their hard-earned. But when microsoft intentionally avoid the 'share' word it isn't so surprising. Incidentally the microsoft used game thing reeks somewhat of the anti-trust trouble that apple are currently in; although they seem to have made an effort to ensure it was regulator-safe by a bit of weaselling in the way they structured it. i.e. they facilitated the screwing of customers without mandating it.

But back to the main topic - the whole 'it's 4x faster due to the cloud' nonsense.

Ok, obviously they shat bricks over the fact that the PS4 is so much faster on paper. The raw FPU performance is 50% better, but I would suggest that the much higher memory bandwidth (~3x) and the sony hardware scheduler tweaks will make that more like 100% faster in practice, or even more ... but time will tell on that. With 1080P framebuffers, 32MB vanishes pretty fast - it's only just enough for a 4x8-bit RGBA framebuffers, or say a 16-bit depth buffer and 1xRGBA colour and 1xRGBA accumulation buffer. With HDR, deferred rendering, and high fp performance and so on, this will be severely limiting and its still only relatively a meagre 100GB/s anyway. microsoft made a big bet on the 32MB ESRAM thinking sony would somehow over-engineer the design; and they simply fucked up (the dma engines are of course handy, but even the beagleboard has a couple of those and it can't make up bandwidth). As another aside, I lost a great deal of respect for anand from anandtech when he came up with the ludicrous suggestion that the 32MB of ESRAM might actually be a hardware associative cache. For someone who claims to know a bit about technology and fills his articles with tech talk, he clearly has NFI about such a fundamental computer architecture component. It suggests PR departments are writing much of his articles.

It learns from its mistakes?

So anyway ... most of the talk about the cloud performance boosting is just crap. Physics or lighting will not be moved to the internet because internet performance just isn't there yet - and the added overheads of trying to code it just aren't worth it. Other things like global weather could be, but it's not like mmo's can't already do this kind of thing and unless you're building 'Cyclones! The game!' the level of calculation required will be minimal anyway. However ... there is one area where I think a centralised computing capability will be useful: machine learning.

Most machine learning algorithms require gads and gads of resources - days to weeks of computing time and tons of input data. However the result of this work is a fairly dense set of rules that can then be sent back to the games at any time. Getting good statistical information for machine learning algorithms is a challenge, and having every machine on the network allows them to do just that, and then feed them back.

So a potential scenario would be that each time you play through a single player game, the AI could learn from you and from all other players as you go; reacting in a way that tries to beat you, at the level of performance you're playing at. i.e. it plays just well enough that you can still win, but not so that it's too easy. Every time you play the game it could play differently, learning as you do. We could finally move beyond the fixed-waves of the early 80s that are now just called 'set pieces'.

It would be something nice to see, although a cheaper and easier method is just to use multi-player games to do the same thing and use real players instead. So we may not see it this iteration: but I think it's about time we did.

Maybe some indie developer can give it a go.

sony or microsoft could also use the same technique to improve the performance of their motion based input systems, at least up to some asymptotic performance limit of a given algorithm.

But ...

However, microsoft have no particular advantage here as it isn't the 'cloud services' that are important here, it's just that every machine has a network port. Sure having some of the infrastructure 'done for you' is a bit of a bonus, but it's not like internet middleware is a new thing. There are a bevvy of mature products to choose from, and microsoft is just one (mediocre) player from many. And a 3rd party could equally go to any other 3rd party for the resources needed (although I bet microsoft wont let them on their platform: which is another negative for that device).

Actually ...

Update: Actually I forgot to mention that I really think the whole 'always online' and kinect-required gig in microsoft's case is all about advertising: it can see who is in a room, age, gender, ethnicity, and if you have an account registered even more details on the viewer. It could track what people in the room are doing during game-playing as well as watching tv shows and advertising; probably even where they're looking, what they're talking about, and what those tv shows/advertisements are. It doesn't take much more to "anonymously" link your viewing habits with your credit card transactions.

A marketer's wet dream if ever there was one. A literal "fly on the wall" in every house that has "one".

And if you think this is hyperbole, you just haven't been paying attention. Google (and others) already do all this with everything you do on the internet or on your phone, why should your lounge room be any different? I was pretty creeped out when I started seeing adverts that seemed to be related to otherwise private communications.

Let's just see how long before people start seeing advertising popping up (perhaps over the TV shows they're watching, or within/over games?) that matches their viewing habits and lounge-room demographics or what they were doing last night on the coffee table. And even if they don't get there "this generation", it's clearly a long-term goal.

So whist one can do some neat stuff with the network, that's just a side-effect and teaser for the main game. Exactly like google and all it's "free" services. Despite paying for it this time, you're still going to be the product. Even the DRM stuff is a side-show.

Prices

Well as usual $AUS gets shafted by the 'overheads' of the local market. But you know what? Who cares. They're both cheaper than the previous models even in face-value-dollars let alone real ones, and we don't treat our less fortunate workers as total slaves in this country (at least, not yet).

They might still be a luxury item but in relative terms they've never been cheaper. My quarterly electricity bill has breached $500 already and it's only going up next year.

The initial price of the device is always only a part of the cost (and for-fucks-sake, it is NOT a fucking investment), and pretty small part with the price of games, power, the tv/couch, and internets on top.

I'm still not sure if i'll get a ps4: given the amount of games i've played over the last 12 months it would be pretty pointless. I've still got a bunch of unopened ps3 games - I think I just don't like playing games much, they're either too easy and boring, too much like work, or I hit a point I can't get past and I don't have the patience to beat a dumb computer and feel good about it (and i'm just generally not into 'competition', i'd rather lose than compete). A PS4 CPU+memory in a GNU/linux machine on the other hand could be pretty fun to play with.

Tagged games, rants.
Monday, 10 June 2013, 03:32

Clamping, scaling, format conversion

Got to spend a few hours poking at the photo-effects app i'm doing in conjunction with 'ffts'. I ended up having to use some NEON for performance.

One interesting solution along the way was code that took 2x2-channel float sequences (i.e. 2xcomplex number arrays) and re-wound them back to 4-channel bytes, including scaling and clamping.

I utilised the fixed-point variant of the VCVT instruction which performs the scaling to 8 bits with clamping below 0. For the high bits I used the saturating VQMOVN variant of move with narrow.

I haven't run it through the cycle counter (or looked the details up) so it could probably do with some jiggling or widening to 32 bytes/iteration but the current main loop is below.

        vld1.32         { d0[], d1[] }, [sp]

        vld1.32         { d16-d19 },[r0]!
        vld1.32         { d20-d23 },[r1]!     
1:
        vmul.f32        q12,q8,q0               @ scale
        vmul.f32        q13,q9,q0
        vmul.f32        q14,q10,q0
        vmul.f32        q15,q11,q0

        vld1.32         { d16-d19 },[r0]!       @ pre-load next iteration
        vld1.32         { d20-d23 },[r1]!

        vcvt.u32.f32    q12,q12,#8              @ to int + clamp lower in one step
        vcvt.u32.f32    q13,q13,#8
        vcvt.u32.f32    q14,q14,#8
        vcvt.u32.f32    q15,q15,#8

        vqmovn.u32      d24,q12                 @ to short, clamp upper
        vqmovn.u32      d25,q13
        vqmovn.u32      d26,q14
        vqmovn.u32      d27,q15

        vqmovn.u16      d24,q12                 @ to byte, clamp upper
        vqmovn.u16      d25,q13

        vst2.16         { d24,d25 },[r3]!

        subs    r12,#1
        bhi     1b

The loading of all elements of q0 from the stack was the first time I've done this:

        vld1.32         { d0[], d1[] }, [sp]

Last time I did this I thing I did a load to a single-point register or an ARM register then moved it across, and I thought that was unnecessarily clumsy. It isn't terribly obvious from the manual how the various versions of VLD1 differentiate themselves unless you look closely at the register lists. d0[],d1[] loads a single 32-bit value to every lane of the two registers, or all lanes of q0.

The VST2 line:

        vst2.16         { d24,d25 },[r3]!

Performs a neat trick of shuffling the 8-bit values back in to the correct order - although it relies on the machine operating in little-endian mode.

The data flow is something like this:

 input bytes:        ABCD ABCD ABCD
 float AB channel:   AAAA BBBB AAAA BBBB
 float CD channel:   CCCC DDDD CCCC DDDD   
 output bytes:       ABCD ABCD ABCD

As the process of performing a forward then inverse FFT ends up scaling the result by the number of elements (i.e. *(width*height)) the output stage requires scaling by 1/(width*height) anyway. But this routine requires further scaling by (1/255) so that the fixed-point 8-bit conversion works and is performed 'for free' using the same multiplies.

This is the kind of stuff that is much faster in NEON than C, and compilers are a long way from doing it automatically.

The loop in C would be something like:

float clampf(float v, float l, float u) {
   return v < l ? l : (v < u ? v : u);
}

    complex float *a;
    complex float *b;
    uint8_t *d;
    float scale = 1.0f / (width * height);
    for (int i=0;i<width;i++) {
       complex float A = a[i] * scale;
       complex float B = b[i] * scale;

       float are = clampf(creal(A), 0, 255);
       float aim = clampf(cimag(A), 0, 255);
       float bre = clampf(creal(B), 0, 255);
       float bim = clampf(cimag(B), 0, 255);

       d[i*4+0] = (uint8_t)are;
       d[i*4+1] = (uint8_t)aim;
       d[i*4+2] = (uint8_t)bre;
       d[i*4+3] = (uint8_t)bim;
    }

And it's interesting to me that the NEON isn't much bulkier than the C - despite performing 4x the amount of work per loop.

I setup a github account today - which was a bit of a pain as it doesn't work properly with my main browser machine - but I haven't put anything there yet. I want to bed down the basic data flow and user-interaction first.

Tagged android, beagle, code, hacking, picfx.
Friday, 24 May 2013, 02:20

on google

So google have decided to disable downloads on google code.

So I have decided to stop using it.

... although as yet I have no concrete plans or timeline for when this decision will take effect.

Whilst they claim it's about abuse, one can only assume that is just a "likely-sounding excuse" for what in reality is just another straight-up lie from the PR department of a supra-national conglomerate, and it's really just a way to cut costs and promote their 'drive' service (a useless microsoft/apple only service as far as i'm concerned).

Nobody seems to have reported that they have also gimped their POP interface to gmail a couple of days ago. No more UID support. This makes POP a lot less reliable/useful as a mail store (although in honesty it was never designed for that purpose). I proceeded to delete all the mail in gmail to help them free up some disk space.

I guess over-all the writing is on the wall. We all know that at some point 'google account' will mean 'google+', and blogger may be retired at any time.

So it seems my on-going-but-totally-lax search for alternatives to 'everything google for convenience' just got another big kick up the rump-side.

As my projects are all pretty small and low-volume I might look at a local solution because every network based solution faces the same problem. I have a couple of beagleboards doing nothing although getting a running and secure-enough system might be more pain than it's worth.

It's a bit of a pain to have to deal with.

Tagged beagle, dusk, hacking, imagez, java, javafx, jjmpeg, mediaz, pdfz, puppybits, rants, readerz, socles, videoz.
Tuesday, 21 May 2013, 09:03

on build systems

So i'm kind of baffled by gradle.

"power and flexibility of ant" with [enforced] "conventions of maven".

Sounds like it cherry picked the two worst parts of both outside of using XML!

Actually it looks ok enough for simple projects, but then again pretty much every tool is because solving simple problems is always ... simple. However I think the decision to go with implementing it in a scripting language is just going to lead to some pretty nasty long-term maintenance problems.

The only valid argument for something like ant is that the configuration files are machine readable (even if they aren't human readable!), which can lead to tooling support (ok, ant isn't very machine readable anyway, i'm just stating that it could be valid if they did it right). So it's kind of strange that gradle eschews that for something which is about as parseable as a batch file.

Of course it's the flash new kid on the block so it will go through a rapid adoption phase, but like every other tool before it cracks will then start to appear.

I'm also a little baffled by the claim that somehow groovy is just java and so it's easier for java developers. Doesn't look anything like java to me. At all. Actually even if it were true, i think that would be a problem not a benefit. Java is just not the right language to use for the problems that build systems solve.

At least it's better than ant, but that's a pretty low bar. At best ant isn't much better than a 'build-all.sh' file, and demonstrably worse in many ways.

automake

I've put a few hours into getting somewhere on the java automake stuff. However I seem to have got stuck in an extended discussion on how a zip file works. The java build process is so simple I don't think anyone who is only familiar with C can grasp it.

I guess the main impression I get is that there isn't a particularly strong desire for simplicity vs 'the way we do it', which is a bit frustrating. If I end up with something I wouldn't want to use myself there doesn't seem much point. And given that in the intersection of the sets of 'i write java' and 'i want to use makefiles' and 'using automake isn't utterly and completely out of the question' i'm probably one of about a dozen unique and beautiful snowflakes, there isn't much hope if i'm not interested myself. Actually i may not use it anyway.

So although earlier I was more optimistic now i'm not sure where it's headed. I have some fragments which do part of the job but given the difficult i've had in explaining this simple external stuff i'm not sure I'm mentally up to trying to create and then explain any code inside automake.in. I'm not really that thrilled with the idea of trying to provide a complete patch anyway.

Most (big) projects seem to want every potential contributor to kow-tow to the whims of some god-like maintainer as if it's you the one who should feel privileged that they should deign to even entertain the idea of you doing free work for them. I'm ashamed this is exactly how we did things in Evolution and now regret it. There's quite a difference between a casual contribution and a long-term maintainer. I have no idea if automake is like that, but my patience threshold is pretty low these days so it wouldn't have to be for me to suddenly not to give a shit (i get paid to put up with crap, it's not something I need to volunteer for).

Tagged rants.
Tuesday, 14 May 2013, 02:46

So I finally wrote a game ...

Ok, so it's just a bash version of hangman for the olimex weekend coding challenge, but it's still a complete game, including opening screen/instructions, a computer brain and even a closing animation on the credits.

Welcome to hang-man bash.

       /--|
       |  o /
       | /|
       | / \
      ---

I was going to do a java version with graphics, or even an android one for the hell of it, but the inner loop of the bash solution was too elegant to not just use that. Yay for grep and sed, and shuf is pretty neat too.

Although I don't have any olimex hardware I check up on the blog once in a while to see if anything is interesting is happening, and the coding challenge is quite a nice little idea. I might suggest something similar for the parallella project.

In other news I thought I would look at trying to improve the Java support in automake as nobody else seems to want to. The main issue is just coming up with a tidy set of conventions and deciding what features it has. I'm hoping to come up with something tidy and useful, but with so many possible solutions it might take a while for a good one to coalesce. An on-going journey.

I've also started doing a bit of paid-for work on libffts post the JNI stuff I contributed. Wont replace my day job but the opportunity arose. An android app is one goal, but more on that later.

Tagged code, hacking.
Wednesday, 01 May 2013, 11:26

Reading comprehension, hUMA, NUMA, HSA, FSA, WTF?

I really need to find something better to do in my spare time than read ars "tech" nica and the like, but whilst doing a pass over the confusing front-page I came across an article about AMD's hUMA press. At least the front page isn't as bad as anandtech - i''m not sure what 'pipeline stories' are supposed to be, and to be honest i'm not sure why I bother reading a site which is full of computer case and psu reviews (ffs) and otherwise rather personally biased coverage of pretty random topics.

Anyway back to the arsetechnica piece. Pretty lazy article all round but I guess it summarised some of the points.

The real laff is with the comments.

Quite a few people seem to be getting "hUMA" confused with "NUMA". Hint: The N is for "NOT". Detail: Non-Unified-Memory-Architecture is exactly the opposite to Unified-Memory-Architecture which is the UMA part of the hUMA acronym.

NUMA is a way to add a lot of memory to a system with a lot of processors and not be bottlenecked by concurrent access issues (this is very much a good thing, it scales very well). UMA just makes the memory fast enough that the concurrent access shouldn't matter and then puts everything on the same memory ... (but it can't scale as well).

The rest of the comments just show that nobody knows what the 'h' means either. Probably understandable, it's a bloody horrid acronym and the article goes no way to explaining what's going on beyond the one set of slides in that press pack - however the information is readily available on AMD's site.

i.e. the h is for HSA, ... which is the other side of the coin. Another mouth-full at Hetereogenous Systems Architecture (off the top of my head, could be off a bit - i'm not a journalist).

In a nut-shell, AMD and the other HSA co-conspirators are working on turning their custom processors, DSPs, FPGAs, and GPUs into first-class CPU-compatible co-processors. They will all need to share the same virtual (and protected) address space that the CPU does. They will need to support a coherent cache (at some level, L2 at least). Obviously (like duh) this will require operating system support although apart from the CPU I would suspect it can just be hidden in the driver. Personally I hope the coherency isn't too fine-grained otherwise it will be a bottleneck on it's own.

And the other big part (from the last information I read on it at least) is that HSA uses a common assembly language/binary format/bytecode which can be re-targetted to different platforms cheaply, at run-time. So if the hardware provides the resources required, it will just run from a single compile. Although I suspect for performance it will have to target 'classes' of hardware, since to get good GPU performance you really need to write things very differently. I presume this will be based capability based on things like LDS memory.

Obviously AMD have to do this so that developers are able to target legacy Intel/PC hardware for free as well since neither Intel nor Nvidia are part of HSA - nor are they likely to be if they have any choice in the matter since it's such a big benefit to AMD's technology.

I think the commenters are also missing the point on just how much GPUs and CPUs have already converged. CPUs keep getting a wider MMX, as well as 'hyper-threading' and so on. And GPUs now have scalar units running the show, pre-emptive threading (in addition to the super-hyper threading they already have) and other processor features. The new GPUs will be capable of directly executing other languages like Java or Python or whatever - how those would handle vectorisation is another issue.

Anyway ... man, I hope they can pull it off. Right now working with a GPU it's like trying to solve every transport problem with a frieght-train. Sure you can get a lot of work done but it's not the best suited tool to every transport job - sometimes you can just walk. Like everything in the peecee wintel world getting to this point has been the product of throwing enough hardware and power at a problem until the architectural inefficiencies are inconsequential. This isn't good system design unless you're trying to sell the big hardware parts that drive it (i.e. you're intel).

The technology is great. The challenges are great. The wintel inertia which must be overcome is great too. The challenge of making the hardware easy enough to programme that all developers can take advantage of it ... is nigh on insurmountable.

With lambda's and the parallel collections Java could be a perfect fit. Well that language will be. With the JVM being so friggan complex, hopefully the implementation wont be a decade getting there as it was with cpus.

Tagged java, opencl, rants.
Thursday, 25 April 2013, 02:15

0k503

This is the 503rd post to this blog.

I was going to do a 'status update' on everything at the 500 mark, but I just can't be bothered right now and it's not that interesting anyway.

Which pretty much sums up everything else at the moment too.

After some intense activity at work and home i'm a little drained so taking it easier for a bit. Going from sprint to winter always seem to trigger a bit of don't-care too.

Tagged biographical.
Sunday, 21 April 2013, 23:40

In perspective

Every day in the USA alone, around 12 pedestrians are killed every day and a further 200 or so injured (information readily found from official sources).

Every day.

And yet those deaths and injuries don't receive wall-to-wall news scaremongering news coverage and demands for more oppressive law enforcement.

Tagged politics, rants.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!