About Me
Michael Zucchi
B.E. (Comp. Sys. Eng.)
also known as Zed
to his mates & enemies!
< notzed at gmail >
< fosstodon.org/@notzed >
puppeteer
Took a break from the break from hacking and started looking through the old PS+ games I hadn't gotten around to downloading. Also cleared out some of the lesser ones; things i thought might be worth looking at further but now there's just not enough time for really good games so i know i'll never look at them again. Could always re-download anyway.
Anyway one of those games I downloaded was Puppeteer.
Very impressed with the game. By setting it on a small stage they managed to craft a quite exquisite piece of software. A very solid frame rate with NO TEARING, with very good use of per-object motion blur and decent AA. Very high quality textures and models with detailed and charming animation. Short loading times. Absolutely incredible presentation that makes it look like a real puppet stage setting up and tearing down between stages. I only found it too dark and had to turn up the brightness on my tv.
I kept thinking it must've cost a packet to make ... and in this day and age when a piece of shift flash or phone game passes as 'good enough', and even triple-a games are often a lot of technically incompetent snot, it's disappointing that the game didn't get much higher sales.
It's obviously a childrens game but the narrator and players don't mind throwing in some humour for the adults and even though I wasn't thoroughly enthralled myself (just can't seem to care about anything much) I laughed out loud more than once. And of course the puppet-stage setting is perfect for breaking the 4th wall at any time, which it does often. The two player mode seems to revolve around controlling a faery (in one player mode you control both) so it's also a perfect play-with-your-kid game.
Definitely a case of restaurant food vs chucky-d's. I guess it's a question of whether there is a commercial place for this quality in the 6-12 age bracket? I guess not going by it's sales. It deserves a PS4 port and technically (and for that matter aesthetically) it's already better than most PS4 games i've seen even without any enhancements whatsoever.
Anyway, definitely worth getting if you have kids and a ps3 or even if you just appreciate well-made software. It shows what games could be like if people didn't just put up with jank.
Others
I haven't even looked at the other PS+ stuff yet and I got a few disks last week, and bought some other PSN goodies: resogun+driveclub expansions, astebreed and jamestown. (+ maybe more on sale, i can't remember).
I'm pretty shit at resogun but it's just too good; the added modes aren't as varied as the base game but they're going to take a lot of playing to get good enough to know that for sure. As for driveclub I barely know most of the cars in the base game and although i've nearly finished it I don't really care much about the pre-set races for the most part (they can be good filler though) - and this is all the season pass adds (and mostly super-cars at that) - but well I'm a fan I guess.
jamestown is very 90s arcade stuff and like with those i'm kinda shit at it. Each level is fairly short but fun enough but I need quite a few more DEX points before I can weave my way through all the bullets on screen without just dying. It's no SWIV and i'll probably never finish it but it's there I guess (need another controller and visitors I would say).
astebreed was pretty much on a whim and I went in totally blind. It's the only thing I regret now, at least so far. It's fast and well presented but the gameplay mechanics are just not for me. There are 3 main weapons. A magical machine-gun thing; which you pretty much just constantly hold down the button for. A sword thing which causes damage and destroys yellow and purple (iirc) projectiles; so you pretty much just trigger it constantly, occasionally holding it down for a super-swish. And a ranged/targetted rocket-spray weapon which uses a mechanic I've seen in other jap games that I quite dislike. You hold down a button to target (fucking R3 at that) and then release to fire. i.e. you're constantly just holding that and releasing that as well. So the game pretty much devolves into maneuvering between the red projectiles or beams overlayed with a mass of visual noise whilst you're incessantly pumping the other weapons. There's a bit of timing/rhythm to it but to be honest it pretty much just sucks. ~$25 I could've spent on something else but I guess it's no big deal and maybe i'll find something i like if i play it more later (he says knowing full well he's spent $80 and never gotten around to even opening the box, and he has tons of other things he needs to go through).
No Man's Sky
ign and a few other places have a few bits and bobs about this during this through July. Nothing really new but a few things more fleshed out slowly, presumably as part of the PR build-up to launch.
I'm still blown away by the graphics here but for different reasons I might be blown away by a game like The Order. It's unclear if it's running on a PS4 but it feels quite alive for something running at that framerate (and thank fuck: no tearing). Much like it felt the first time I visited a city in ratchet and clank; except here it is not merely background decoration. Yet all some sad cunts can do is complain about a little bit of latency in the terrain generation around the periphery; jesus get a fucking life. Although it has it, I would myself be fine even if it had no anti-aliasing whatsoever; I find that aesthetic quite pleasing on low-texture models just as I did back on the Amiga.
At first I was a little shocked at how quickly the wanted level escalated into the players death - but then I thought it's actually a good idea. I like the idea that the universe itself is trying to stop you being a fuckwit and just indiscriminately killing every animal that moves and mining every bit of land you see. This is very good.
It also means that it's decidedly not a tourist or walking simulator; it's actually a game with high risks for player actions or even just mistakes. All other things look gamey too; clockwork atro-physics, simple flying and shooting. Nice, although there can be satisfaction in becoming a master airliner pilot; most people just want to have the fun bits.
Crafting looks, well, like crafting in any other game. Seems the point is to buy or find blueprints for the upgrades, find or buy the raw materials, hit a button to combine. Buying new ships/suits/guns will provide a new thing to look at as well as a different number of upgrade/storage slots.
I like the cleaninless of the UI which bares some similarity to Destiny. But it also shared Destiny's shitty finger-cursor-thing. Why then the dpad works so well for this kind of menu? Could redeem itself if it runs of the touch-pad.
Confirmation of rotating planets is nice; one hopes that extends to the whole solar system. I mean Damocles did that on an Amiga (w/ day/night) so it should be the bare minimum expected to be honest. Sure it will be a clockwork universe but that's good enough.
I'm rather pleased that it is a single-player game. The whole point of 2^64 planets is that everyone gets their own game to play - despite it all being in the same universe. I'm not sure how one will navigate given it's size but it should also be interesting to see the galactic map fill out by other players - maybe the blink or light up as they are discovered or maybe you can only find them with effort or locally. And for those that want a 'social' experience; there's the whole fucking internet there to enable that, let alone just turning to those in the same room. I imagine there will be a lot of streams/video recordings and screenshots of this one; i'm sure it wont hurt that it such an appealing facade and it may be the only way anyone else ever sees what you saw.
I'm actually pretty surprised how many people seems disappointed this game doesn't have pre-defined story or NPCs. Surprised just doesn't do my feelings half a justice: baffled, confused, somewhat disgusted to be frank. Are people lacking even the smallest amount of mental maturity that they cannot partake in some activity without explicit directions? Jesus how the fuck do they know when to go do a shit? Minecraft demonstrates that at least some children still have some curiosity bones left so at least it's not everyone stuck with this severe mental handicap. Anyway; there's simply no physical or economical way to create a game this big and put any sort of meaningful pre-generated assets in it. I think there will be some lore related things but hopefully not too many, or any tutorial things. With only a dozen people making it, any extraneous fluff seems unlikely.
Or stuff like base building? 2^64 is a number so big it's clearly impossible to comprehend for most and they just relate it to something they know already without realising it has no worth.
Another good thing is that Hello Games' director Sean Murray seems very set on the game he wants to make and isn't interested in any outside noise. Once you start listening to whiners you can easily break your vision and end up with a broken game or just make silly mistakes. I think the driveclub director gives a little too much weight to internet forums for instance. From the IGN video view count it looks like he's onto a winner anyway, even if many people seem to let their imagination escape reality a bit too far.
It still bugs me that people pronounce this as 'nomansky', have they got 'diagon alley' disease from mr potter or something? (i always thought that was such an awfully cheesy and dumb bit of the movies).
Anyway i'm obviously rather interested in this game; excited even.
From what I know (and because of what I don't know) it seems like the game I never knew I wanted, But I actually always wanted, from the first day I worked out how to play a bare-disk pirate copy of Mercenary: Escape from Targ.
the future is micro?
Although i haven't been terribly active on it i've still been regularly mulling over a few ideas about the future of the stuff i did on google code, and this blog.
My plan some time ago was to setup a personal server locally - it wouldn't handle much traffic but I never got terribly much - and this is still the plan. The devil is of course in the details. If it turns out to be inadequate I can always change to something else later but given the site history I find this unlikely.
This choice is also intentionally something of a political one. Centralised control of information is becoming a significant societal problem and with the cheap availability high speed internet, computing power, and storage provides a means to tackle it head on via decentralisation.
Micro-Server
So after a few small experiments and mostly in-head iterations i've settled on a implementing stand-alone micro-server with an embedded db. I was going to play with JAX-RS for it but the setup required turned me off it. I think the tech is great and the setup is necessary but I just don't need it here. I have the knowledge and skills to do almost everything myself but at least initially i'm going to use the JavaSE bundled http server with berkeley db je as the transactional indexing and storage layer.
After many iterations I have designed an almost trivial schema of 3 small core tables which sits atop JE which allows me to implement a complex revision history including branches and renames. Think more of a `fixed' cvs rather than subversion; copies aren't the basis of everything and therefore aren't `cheap', but branching and especially tagging is (revisions are global like svn). Earlier prototypes supported both cheap copies and branching but i felt they lead to unworkable cognitive complexity and I realised that since I think the subversion approach just isn't a good solution at all I should not even try to support it. The work I did on DEZ-1 was for this history database and revisions are stored using reverse deltas. Although this is not the aim or purpose it should be possible to import a full cvs or subversion revision tree and retrieve it correctly and accurately; actually I will likely implement some of this functionality as a basis of testing as this is the easiest way to obtain effectively unlimited test data.
Atop this will sit a wiki-like system where nodes are referenced by symbolic name and/or branch/revision. Having a branch-able revision tree may allow for some interesting things to be done: or it may just collapse in an unscalable heap. Binary data will be indexed by the db but storage may be external and/or non-delta where appropriate.
From very long ago I was keen on using texinfo as the wiki syntax; i'm still aiming for this although it will mean a good deal of work converting the blog and posts over even if automated. The syntax can be a bit verbose and unforgiving though so i'll have to see how it works in practice. There are some other reasons i'm going this route although it is unclear if they will prove useful or not yet; some potential examples include pdf export, response optimisation, and literate programming. Its likely i'll end up with pluggable syntax anyway.
The frontend will be mostly be html+css and perhaps small amounts of javascript; but it's not going to be anything too fancy initially because I want to focus on the backend systems. Authoring is likely to be through external command line and/or desktop tools because I find the browser UX of even the most sophisticated applications completely shithouse and the effort i can afford them would render any I made even more pathetic.
The project itself will also be a personal project: it will be Free Software (AGPL3) and maybe someone else will find it interesting but providing a reference product for others isn't a goal.
Living prototype
This project actually started years ago as everything from a C based bdb prototype to a JavaEE learning exercise. In the distant past I have ummed and ahhed over whether it should be absolute bare-bones C or full-blown JavaEE. I think it may well never get much beyond these experiments but unless I start it definitely will not. So I thought it's about time to put a stake in the ground and get moving beyond experimentation.
So my latest current plan is to begin with implementing my internode software pages. A read-only version covers the basic response construction, namespace and paths, and file and image serving mechanisms. Then moving on to authoring touches on revision and branch management. Adding a news system will allow this blog to be moved across. Comments would make sense at this stage but aren't trivial if moderated, as I would desire. This is most of the meat and would also allow some version of the google code stuff to make it across. Then I could think about what next ...
The idea would be to go live as soon as I get anything working and just continue working on it 'live'; availability not guaranteed. A system in constant pre-alpha, beta, production.
I'm pretty sure i've got the base of the revision systems working quite well. Object names (& other metadata) and object data history are tracked separately which allows for renames and version specific meta-data. It's actually so simple i'm not quite sure it will support everything I need but every use-case i've tried to test so far has been solvable once I determined the correct query. I've still to get a few basic things like delete implemented but these are quite simple and the hardest part now is just deciding on application level decisions like namespaces and path conventions. Other application level functionality like merging is something for later consideration and doesn't need implementing at the db layer. I still need to learn some JE details too.
Initially the architecture will be somewhat naive but once I see how things start to fall out I want to move to a more advanced split-tier architecture based on messaging middleware. This is a long term plan though. I will aim for scalability and performance but am not aiming for "mega"-scalability as that is simply out of scope. Things like searching (lucene) and comments can be tacked on later. Being a personal server authentication/authorisation and other identity related security systems aren't an initial focus.
I've done the texinfo parsing a few times and my current effort is still some way from completion but i will probably just start with the basics and grow it organically as I need more features and only worry about completeness or exporting later on. I will start with processing everything live but resort static snapshots if it proves too burdensome for the server. Actually the revision tree provides the perfect mechanism for efficiently implementing incremental snapshots so it will probably just fall out of testing stuff anyway.
The why of the what
I was prompted to think about this again by the only request about jjmpeg source i've had and i'm also in the middle of a 2-week break. I've spent a couple of those break days poking around but so far it hasn't really gotten it's teeth into me so it will continue to be a slow burn (and i really do just want a short break).
Apart from setting up the hardware and deciding on some `simple' decisions i'm quite close to having something running.
manufactured culture
So I ended up watching a lot of game related E3 broadcasts and related again this year and I kinda wrote off a couple of days work by staying up till too late (it's winding down for this financial year so it's no problem) and here's a few random thoughts about it.
At a few points I was thinking "why am i doing this, i don't even care", but ahh well, you know how it is.
This was going to be a comment on some aspects of US culture but although it's mentioned in passing it's mostly just about games n shit.
For example I have almost no interest in Bethesda games almost entirely to the negative experience of Oblivion which I found unplayable and dull, yet I watched their conference live (4am or something?). I thought Doom looked pretty good and very Doomish although I never played more than a little bit of the original demo (about then is when i stopped playing games until getting a PS2), Fallout looked a bit clunky, but its unlikely i'll play either of them. I though the whole falloutshelter "we made a free-to-play mobile game that doesn't suck!" thing was pretty obnoxious as it seems its just as exploitative as any other game in getting people to spend money without realising they are, for nothing.
EA's was pretty cringeworthy. Apart from "Yarnie"
Unravel which looked like a cool physics puzzle game with photorealistic setups. Need for speed looked pretty but the game itself was utterly attrocious. The best part is that I learnt of a new musical style "trap" which i can only guess means techno-rap, but whatever it's called it sounds like shit. But as is eluded to in the title of this post, apparently people like these sort of directed & manufactured experiences. There's no way I could play a game with such constant patronising encouragement, phrases such as "Doing great!" "You're nearly there!" - like the player is only 2 years old. I muted and ignored all the other sports stuff and I can't remember the rest.
Don't care about anything M$ or Nintendo so didn't bother. Apparently M$' big thing was backward compatibility which seems to be implemented using code translation (i.e. compilation of opcodes); which would have been a nice have on launch but even then would have been an investment they would never recoup (which duh, is why they didn't bother). But doing it now so late after launch just seems irresponsible to their shareholders. The WiiU was a failure from inception and that seems to be dying a silent death although the faithful haven't noticed yet (I was an Amiga guy so I know all about that).
The Dark Souls 3 trailer looks suitably bleak and terrifying although I haven't gone back to Demon's Souls since dying on some bridge for the 3rd time last time I gave it ago, so its probably not something i'll ever have the time to play.
I saw most of the Square E[e]nix one but i can't remember much. Apparently there was some odd flare-up on the internets over the use of 'apartheid' in the press conference which was born a combination of Americentrism, ignorance, and sheer stupidity (i.e. "hey look internet i'm so lazy i can't even be bothered to use a dictionary and just make up the meaning of words to suit my political agenda!"). But the word was used correctly in an appropriate context for the game, although the trademark of 'Mechanical Apartheid' (iirc) seems a little odd in the real world.
Sony.
Sony was sort of interesting in that it was the most impressive to fans but perhaps not the most interesting to me - they had so many bombastic announcements some of the smaller titles got missed. It was a real shame No Man's Sky release date announcement was apparently postponed which seemed to deflate Sean on stage visibly.
Seeing The Last Guardian at the start actually made tears start to swell in my eyes although they didn't break the surface. I'm not sure why really as although I do quite like ICO I only finished it once and couldn't even get past the first puzzle on the 'HD' version (pretty much the reason I got PS+ a couple of years ago). But the animation has such a believability to it i'm not surprised they couldn't get it running on the PS3 and the fact that it looks like a real living and breathing animal was quite mesmerising. On the shitty stream the graphics didn't look much better than I remembered from the PS3 version apart from some better textures but later I saw some better shots and the detail increase is significant.
I really couldn't give a shit about FF7 but maybe i'll play it if it gets modernised nicely (the PS1 game hasn't aged well, I bought it on PS3 ages ago but didn't get very far, but they probably can't modernise it without pissing some anal retard off) and I have no idea what Shenmue even is but apparently they have a following. I really think the whole kickstarter approach should have been far more transparent, but that's something for another day.
Star Wars (battlefront?) looks pretty authentically-good but i'm not really into Star Wars much myself and particularly not multiplayer games. I can see it selling absolutely gangbusters this xmas though.
Horizon - Zero Dawn however. Wow. And the more we find out the better it sounds. It looks fantastic, the lead character looks and moves amazingly, combat looks fluid and seamless, the world is interesting. And no fucking loading screens. The first time i saw it i wasn't a fan of the US accent of the actress (clearly chosen on purpose), but I suppose it's unavoidable in this day and age for something which is obviously going for massive mainstream appeal and the voiceover was just for the trailer anyway. Hmnm, wonder if we'll be able to play it with swedish or russian voice and english subtitles or something more fitting.
Guerrilla always seem to get a lot of undeserved fanboy hate but they are really a top-tier studio with amazing technology and art. It will be interesting to see where a fuck-ton of money and really talented and technically proficient team can take the RPG genre and how the competitors respond given that their usual jank-fests and silly mechanics (like, oh dear, when did 'romancing' become popular?) are going to look a bit pedestrian.
The main character - Aloy - is a pretty bold move for such a major title simply for being a woman (sadly). They are definitely leaving some money on the table due to this choice (sadly). She seems an excellent design though; visually distinct (attractive but not some bland k-pop model), lithe and athletic without being pornographic, strong and determined yet inexperienced and slightly nervous. Very noble 'PR' poses. A lot of effort was put into that reveal trailer to convey all this information and hopefully others noticed little details like her swallowing gulp just before leaping into the frey. It was a tiny and seemingly inconsequential detail which conveyed so much humanity in an entirely artificial character.
Dreams was just a bit too disturbing to me. Looked more like ``Nightmares'' and i thought even the polar bears were a bit creepy. The technology looks pretty groundbreaking in lots of ways though so it'll be something i'll keep an eye on and apparently a lot more is to be presented about the "platform" in Paris later this year.
Oh, I almost forgot Uncharted. Awesome, thrilling, and fun ride as ever; and great technical achievement to boot. I still haven't opened uncharted 3 ... although i really really intend to one day. Oh shit, I almost forgot again. So one of the big things about the uncharted demo was apart from actually being played live it had this really awesome car chase sequence that just seemed to go on forever - the amount of work for just a couple of minutes play just seemed astounding. People gushed (or disbelieved it wasn't a cut-scene). Not to downplay it in any way but it looked to me much like really excellent maze design. It appears as if you're making random heat of the moment decisions but in reality you're crossing a couple of connected areas which wrap around and always put you in the place you're supposed to go. I could definitely see how this was inspired by ICO although of course mazes go back to the very first computer games ever created and indeed maze games predate computers by some centuries (millennia?).
Morpheus finally has some real games but without a unit who can say what they're like. Battlezone looked nice though (Oh man would The Sentinal work well with it). I should really get hold of some sort of vr headset to play with it was the sort of thing I dreamt of when I was playing on the C64.
Smaller things
So with all these big things most small things got shunted out of the way. And unfortunately all the 'small independent media' I saw has fallen into the same trap as the 'msm' in that they only want to cover the biggest stories and pretty much ignored everything. So the only way to find out about most of them was the Sony show-floor live broadcasts and their PR pitch spots.
Some are out this year but like some of the big ones most are next year.
But there are some kinda strange games being made.
Thumper was some sort of rez-ish rhythm game with a beetle (!!) racing down a hot-wheels track and some very simple controls where you have to time presses and it creates the music beats. It started so slow on the first level that it looked boring as shit but the host on the show was mesmerised by the graphics so much he kept forgetting to talk so there must be something to it. I think that the end of level 'marker' (i didn't catch the context) was called "crack-head" really says it all; it's pretty much a game for having drug-induced trips to. The guy selling it seems to have had a bit too many mushrooms but he was trying pretty hard to sell it (sorry i'm being overly harsh here, it must be seriously nerve racking as hell trying to promote your baby). But it could have a bit of a cult following because technically it looks pretty competent.
Hellblade is being made by Ninja Theory as a 'low-budget triple-a title' but unfortunately although it looks like they're doing a fair job on the art and systems the subject matter pretty much guarantees it is going to be an abject failure commercially. Mental illness and entertainment don't really go hand in hand to me. It just seems such an odd decision to use serious mental illness as the basis for something so commercial as making games, this is the sort of art that needs patronage. If you know someone with serious mental illness you may not want to be reminded of it (one of my late cousins for example), and if you don't you may not want to face it. I don't know where the time and place is for such a thing but it just doesn't seem to be here to me, but I guess so long as it doesn't send them out of business they may have other goals in mind for their work which they will reach and be satisfied with; I hope so.
Divinity Original Sin (enhanced?) looks pretty neat. I think I saw something about it previously and my interest was piqued until i saw some of the menus and it just seemed way more complicated than I really wanted to bother with (i.e. too much a modern microsoft windows game). But I saw a more detailed playing and it looked pretty nice on console and reminds me of much older games so I might have to have a look when it comes out. It still looks pretty complicated and i'm not sure how well it will play solo but i'm inclined to try it out. I can't remember if i ever ended up playing Bulders Gate 2 on PS2 - I have the disk - I should check to see if i could be bothered with that sort of game again.
Relativity was a weird little puzzle game that I just chanced upon at random on the Sony stream. A sort of gravity puzzle game with a really nifty wrap-around thing going on. The aeshetic is really nice with a simple clean 'vector graphex' visuals which screams 'VR' although I think in VR you might end up falling off your chair. Anyway it looks very cool and the guy making it seems pretty switched on.
There were a few others of the stream i've seen so far but those were the ones that piqued my interest the most and I wouldn't mind seeing a bit more of The Tomorrow Children which has been expanded substantially from the alpha into a real game; although its not that far out and I might just buy it instead.
Update: One I did forget: I saw Mad Max somewhere, maybe Sony or Gametrailers or both. I don't think it's my type of game but it did look interesting for that type of game. Very mad max 2 look to it (from what i remember, it's not a tie in to the current one) although the colours in the demo bit could use a bit of a red (probably? not in australia?).
Gametrailers
Apart from the Sony stuff I watched an embarrassing amount of the gametrailers broadcasts. I tried a few other streams and the presenters or advertising was insufferable so they were pretty much the least-offensive left over. I don't really bother with reviews these days (if only because i rarely bother with games) but they also seem about the closest to objective in their reviews or at least the closest to my tastes. I didn't bother with giantbomb because they're just a bunch of arseholes, nor Geoff Keighly not because he isn't fairly likeable and a decent presenter, he's just too obviously licking arseholes and the advertising is a bit too irritating. I saw a couple from the "kinda funny" guys but they just are just "kinda not funny at all". One is too campy and obnoxious and the other is just a fuckwit, and every time an ad came on the recording reset to the start(?). Youtube streaming also pretty shit; it doesn't work at all on one of my machines and usually requires a login on the other (gotta get that identity tracking locked in and matched to your credit card), so i mostly used twitch which was fine for the most part.
Viewing the the GT broadcast was kind of like being outside of a room of people having inane conversations and passively listening through a window. But anyway I did learn some strange things about american culture which probably explains some other strange things i've seen about the inability of people to grasp No Man's Sky. This is where I saw some interesting stuff about Divinity and some of the other big games.
So firstly, grown men actually have significant emotional attachment and genuine love for Disney characters and cheap plastic dolls (aka 'action figures'). I've seen quite a bit of interest in games like kingdom hearts which I never understood for a game I thought was targeted at 7 year old kids. This probably also helps explain mario's popularity despite being so old and stale (and lots of tv shows for the same reason).
Secondly, for people who apparently follow these things closely they seemed remarkably clueless and misinformed about details or too easily miss something from a trailer (apparently their trade). Something can be right before their eyes and they just don't see it. They had a big argument about No Man's Sky and the GT VO had a heated argument repeatly saying 'well what do you do', or 'they haven't shown this before' when some of what he was complaining about was actually visible in the first Sony trailer, it just wasn't explicitly explained in intricate detail (things like mining, combat, economy, i.e. gameplay mechanics).
No Man's Sky
I'm not sure why it matters exactly but I was hoping to find out a bit more about No Man's Sky - and slowly bits and pieces are trickling out.
In earlier press we learnt that Sean spent some time in Australia in his childhood, but not only the typical Australian beachy experience but the full-blown outback which few of us locals have really experienced. This gives one a very different outlook on both scale and our position in the universe; the broadness of the horizon, personal isolation, and the spectacle of the milky way is something you can't experience in England or any city. The milky way is 'milkier' in the southern hemisphere to start with. I find cities pretty distateful to start with but i also found Boston somewhat claustrophobic when outdoors and I also found the 'personal space' distance a little too close for comfort in the US.
Like we did as kids he also got pirated games on his Commodore 64 with no instruction manuals and basically the first response to starting a game was just typed every key in turn and wrote down what they did. He surely must have played Mercenary which was one of the first ever 3D 'open world' adventure games, which is all the buz these days; and hopefully one that has a revival one day.
I'm really rambling here but what I'm getting at is that I can understand why he doesn't want to talk much about the mechanics of his game. Its not because they don't exist its because he wants people to learn for themselves because that learning experience is not only more enjoyable for him, it is also a much more fundamental to the purpose of play. The journey and learning process is always more interesting than the end.
On forums and broadcasts including gametrailers a constant refrain has been 'but what do you do!' 'tell me why i should play this game!' This is absurd. First, make up your own fucking mind; you're an adult. Secondly, what would you think of anyone who refused to watch a soap opera like Game of Thrones for the sole reason that they didn't know how it ended? You'd think they were a stupid idiot, and rightly so. (fwiw i have zero interest in shitty soap-operas like that, but if people do more power to them).
The other odd refrain that constantly comes up specifically about this game is 'but why would i want to [go to the centre of the galaxy]' (the main stated end-goal of the game). Umm, because that's the fucking game? I mean why do you want to get to the end of any game level? It likewise has no bearing on the real world and likewise has no true point - OUTSIDE OF THE MECHANICS OF THE GAME.
It is also clear that plenty of people have absolutely no idea of the scale of the game and that lots of mechanics used in many games just can't work at that scale. "I want to build a home base", "I want to kill everything" (ugh, bloody americans), "I want to team up with my pals". The last one is odd because sharing the "experience" is something anyone can and will be doing anyway; just external to the game itself via this new-fangled thing called the internet.
There was an interesting interview with Sean where he said one of the things that commonly happens within the first hour of the game is that people just keep get lost. They wander off into the wilds and then they realise they don't know where their ship is. With no constant GPS and minimap reminder people have forgotten how to use landmarks for navigation, particularly while in a game which adds even more design to stop you getting lost. And even so-called 'open world' games the actually game worlds are either so small or they are designed in such a way as to forcibly orient you in a way that prevents you getting lost. Living in cities this is a skill that isn't terribly necessary but one that could easily cost you your life in the outback or any other wild place.
Another interesting point he made was in regards to NPCs and quests in games. No Man's Sky has no fixed quests or job boards (it would simply be impossible to do this) but instead most things you do can earn you currency in a more organic way. i.e. there is no one-time quest from some old man to collect 3 wayward hens but if you discover 3 fish on a planet you still effectively earn the same result. That's not the interesting bit; the interesting bit is that the former is actually a horribly illusion-breaking mechanic that shouts 'this is a scripted game' and breaks the illusion of any sort of role playing; despite this being the primary job method by which all role playing games have operated for decades. Because of the scale of the game a job board would be impossible, but by needing for it to be removed a game system can be put in place instead which does a better job of maintaining the illusion of the game world.
Anyway, it's terribly sad to see the state of some vocal (minority?) people who shit on games like this for having no reason to play it (it's the game, fucknut), or driveclub for being too linear. The point of driveclub isn't to earn all the trophies (which seems to be the main reason some people play games), it's to have FUN driving an exotic car around a cool location. And why the FUCK would you want to drive in a city? What a horrible horrible waste of a car! But i digress.
Who knows, I may only play it for an hour anyway, but I really want to play No Man's Sky when I can.
Oh so another thing as an engineer I take away from these game shows is just the sheer scale of the software undertaking of some of these games. Wow. Makes me feel like a bit of a shit coder, or perhaps just that i could be doing something more interesting than I am as that was the sort of thing I played with more 25 years ago before working on 'desktop apps' (sigh).
... 75.
First time the scales hit 75kg this morning. Probably be up by the end of the day but i'll take it as written. I thought the weight loss was slowing down there as i seemed to be stuck at about 78 for a while but my 80kg post was only 5 weeks ago so it's still shedding at a fair clip. Can't remember exactly when I was last around this weight - probably about 2000 before I went to Boston and eating american food and their foul soft-drinks. It's further than I expected and i'm not sure how much farther I will or should take it, I have a light frame so a few more off wouldn't be excessive even if i might be approaching 'skinny' in the light of modern averages.
I'm barely eating anything yet still functioning about as well (or not) as I was before. Some days i get away with only a couple of handfuls of nuts, some watery/herby/spicy soup I make in a mug, and a single piece of toast; often bread and butter has been the basis of "the main meal of the day". Haven't had the need for afternoon naps for months despite not sleeping better or worse than before and I certainly feel less hungry all of the time than I did when I was eating too much - it felt like something wasn't quite right at the time but it didn't seem possible to correct it. But now even if i'm feeling "really hungry" it takes little to satiate that hunger and I can't finish full sized meals.
The gout seems under control for now but then i strained my other foot from favouring it for so long (@#$@$#) but that will clear up eventually. It's been too cold to do much outside so i'm not missing much.
Not much hacking the last few weeks. I still can't get SVM to work with OpenCL despite it working in some sdk samples. I'm doing some javafx tools for work and getting fairly proficient at that although I still get stuck with frustrating layout issues from time to time. Now I understand it's usage i'm quite liking Tasks for a lot of things although running separate threads and processing loops is still the best solution for anything stateful; threads are quite the joy in Java and they're a great strength over most other languages.
So with not much else to do I've been putting regular time into DRIVECLUB (its a really stunning game and I would rate it over GT for me personally), did a few afternoons of Final Fantasy 13 (i forgot how polished a proper big budget game can be, solid frame-rate, no tearing, and compared to the ps2 games it loads quickly and rarely too; pity the story and characters are awful and it's a bit grindy). I usually try some of the PS+ games but nothing has really grabbed me so far and although a few seem quite solid they are just not to my taste. But I haven't even run most of them. Barely watching tv. I watch some footy but for fucks sake I would kill for a separate audio stream without the fuckwit commentators messing it up (or ads). Abernathy and now mcquire is back too too - i had hoped he'd gone when channel 9 was. So I often end up muting it for them or the next fucking supermarket or hardware "store: advert and then forgetting it's on till it's over. I actually looked into voice recognition on GNU so i could make a device that muted on command! I'm still thinking about it; maybe use whistling or clapping instead. Still reading regularly but mostly in bed, and i've moved on to some more enjoyable, better-written, and less abysmal stories for the moment.
hotspot code generation, optimisation and deoptimisation
As promised here is an article about the hotspot code generation using the disassembler plugin mention in the last post. I was nearly going to not do it but i'd already done some playing with it.
Unfortunately I had to use AMD64 instructions here; i think the ISA is pretty shithouse so I haven't bothered to learn it very well so i'm doing some guessing below. I even downloaded the APMs from AMD (i find the intel docs quite poor) to look some stuff up.
For the C code i'm using gcc 4.8.2 with -mtune=native -std=gnu99 and -Ox as indicated in the text.
The actual test calculates 1000x dot products of 2^20 elements each. For java i'm using System.nanoTime() and printing the best result across all runs. For C i couldn't be bothered with the gettimeofday() stuff so i'm just using the time command - over 1000 iterations the difference should be negligable and there are some interesting results regardless.
Simple loop
This is the starting function; obvious what it does.
public float dot(float[] a, float[] b, int len) {
float v = 0;
for (int i=0;i<len;i++)
v += a[i] * b[i];
return v;
}
A C version is identical apart from using pointers rather than arrays and some extra fluffly conventions.
float dot(const float *a, const float *b, int len) {
float v = 0;
for (int i=0;i<len;i++)
v += a[i] * b[i];
return v;
}
First pass
After some iterations hotspot will recognise this function could benefit from optimisation and this is what jdk8 spits out at the first compilation pass.
This is using gcc syntax so instruction operands are srca,[srcb,],dst rather than the more conventional dst,srca[,srcb].
.1: movslq %esi,%rdi
jae .exception0
vmovss 0x10(%rdx,%rdi,4),%xmm1
movslq %esi,%rdi
jae .exception1
vmovss 0x10(%rcx,%rdi,4),%xmm2
vmulss %xmm2,%xmm1,%xmm1
vaddss %xmm0,%xmm1,%xmm1
inc %esi
mov $0x7ffdffc00ce8,%rdi
mov 0xe0(%rdi),%ebx
add $0x8,%ebx
mov %ebx,0xe0(%rdi)
mov $0x7ffdffc00488,%rdi
and $0xfff8,%ebx
cmp $0x0,%ebx
je .2
.3: test %eax,0x15e4076a(%rip)
mov $0x7ffdffc00ce8,%rdi
incl 0x128(%rdi)
vmovaps %xmm1,%xmm0
cmp %r8d,%esi
mov $0x7ffdffc00ce8,%rdi
mov $0x108,%rbx
jge .4
mov $0x118,%rbx
.4: mov (%rdi,%rbx,1),%rax
lea 0x1(%rax),%rax
mov %rax,(%rdi,%rbx,1)
jl .1
;; clean up and exit
.2: mov %rdi,0x8(%rsp)
movq $0x1d,(%rsp)
callq some_function
jmpq .3
Of these 11 are for the loop itself, the rest seem to be for profiling the loop.
As far as it goes it looks fairly decent - pretty much gcc -O2 level of optimisation with array bounds checking performed at each array read.
Of course the profiling adds a lot of overhead here.
The following is the output for the inner loop of gcc -O2.
10: f3 0f 10 0c 87 movss (%rdi,%rax,4),%xmm1
15: f3 0f 59 0c 86 mulss (%rsi,%rax,4),%xmm1
1a: 48 ff c0 inc %rax
1d: 39 c2 cmp %eax,%edx
1f: f3 0f 58 c1 addss %xmm1,%xmm0
23: 7f eb jg 10
The only real difference apart from having no bounds checking is that it multiplies directly from memory rather than through a register. The latter is how every other mainstream cpu does it so that may have some bearing on it.
I can't easy do comparison timing of the loops (and it isn't very meaningful) but obviously the java will be slower here, and probably on-par with -O0 output from gcc.
Final pass
After it has gained some profiling information the result will be optimised - in this case it recompiles it twice more. The inner loop of the final pass is below:
.1: vmovss 0x10(%rbx,%r14,4),%xmm0
vmulss 0x10(%rcx,%r14,4),%xmm0,%xmm1
vaddss %xmm3,%xmm1,%xmm0
movslq %r14d,%r10
vmovss 0x2c(%rbx,%r10,4),%xmm2
vmulss 0x2c(%rcx,%r10,4),%xmm2,%xmm8
vmovss 0x14(%rbx,%r10,4),%xmm1
vmulss 0x14(%rcx,%r10,4),%xmm1,%xmm2
vmovss 0x18(%rcx,%r10,4),%xmm1
vmulss 0x18(%rbx,%r10,4),%xmm1,%xmm3
vmovss 0x28(%rbx,%r10,4),%xmm1
vmulss 0x28(%rcx,%r10,4),%xmm1,%xmm4
vmovss 0x1c(%rcx,%r10,4),%xmm1
vmulss 0x1c(%rbx,%r10,4),%xmm1,%xmm5
vmovss 0x20(%rbx,%r10,4),%xmm1
vmulss 0x20(%rcx,%r10,4),%xmm1,%xmm6
vmovss 0x24(%rbx,%r10,4),%xmm1
vmulss 0x24(%rcx,%r10,4),%xmm1,%xmm7
vaddss %xmm2,%xmm0,%xmm0
vaddss %xmm0,%xmm3,%xmm1
vaddss %xmm1,%xmm5,%xmm1
vaddss %xmm1,%xmm6,%xmm0
vaddss %xmm0,%xmm7,%xmm1
vaddss %xmm1,%xmm4,%xmm0
vaddss %xmm0,%xmm8,%xmm3
add $0x8,%r14d
cmp %r8d,%r14d
jl .1
cmp %ebp,%r14d
jge .done
xchg %ax,%ax
.2: cmp %edi,%r14d
jae .stuff0
vmovss 0x10(%rcx,%r14,4),%xmm1
cmp %r9d,%r14d
jae .stuff1
vmulss 0x10(%rbx,%r14,4),%xmm1,%xmm1
vaddss %xmm1,%xmm3,%xmm3
inc %r14d
cmp %ebp,%r14d
jl .2
.done:
So this has removed all the array bounds checking from inside the loop (it's elsewhere - too bulky/not important here). It's also unrolled the loop 8x and is using modern 3-operand instructions to stagger most of the operations for better throughput on typical RISC cpus (I have no knowledge of the AMD scheduling rules). And finally it tacks on a simple 1-element loop to finish off anything left over.
Comparing this to the output of gcc -O3 ...
30: f3 0f 10 09 movss (%rcx),%xmm1
34: 41 83 c0 10 add $0x10,%r8d
38: 0f 18 49 50 prefetcht0 0x50(%rcx)
3c: 0f 18 48 50 prefetcht0 0x50(%rax)
40: 48 83 c1 40 add $0x40,%rcx
44: 48 83 c0 40 add $0x40,%rax
48: f3 0f 59 48 c0 mulss -0x40(%rax),%xmm1
4d: f3 0f 58 c1 addss %xmm1,%xmm0
51: f3 0f 10 49 c4 movss -0x3c(%rcx),%xmm1
56: f3 0f 59 48 c4 mulss -0x3c(%rax),%xmm1
5b: f3 0f 58 c1 addss %xmm1,%xmm0
5f: f3 0f 10 49 c8 movss -0x38(%rcx),%xmm1
64: f3 0f 59 48 c8 mulss -0x38(%rax),%xmm1
69: f3 0f 58 c1 addss %xmm1,%xmm0
6d: f3 0f 10 49 cc movss -0x34(%rcx),%xmm1
72: f3 0f 59 48 cc mulss -0x34(%rax),%xmm1
77: f3 0f 58 c1 addss %xmm1,%xmm0
7b: f3 0f 10 49 d0 movss -0x30(%rcx),%xmm1
80: f3 0f 59 48 d0 mulss -0x30(%rax),%xmm1
85: f3 0f 58 c1 addss %xmm1,%xmm0
89: f3 0f 10 49 d4 movss -0x2c(%rcx),%xmm1
8e: f3 0f 59 48 d4 mulss -0x2c(%rax),%xmm1
93: f3 0f 58 c1 addss %xmm1,%xmm0
97: f3 0f 10 49 d8 movss -0x28(%rcx),%xmm1
9c: f3 0f 59 48 d8 mulss -0x28(%rax),%xmm1
a1: f3 0f 58 c1 addss %xmm1,%xmm0
a5: f3 0f 10 49 dc movss -0x24(%rcx),%xmm1
aa: f3 0f 59 48 dc mulss -0x24(%rax),%xmm1
af: f3 0f 58 c1 addss %xmm1,%xmm0
b3: f3 0f 10 49 e0 movss -0x20(%rcx),%xmm1
b8: f3 0f 59 48 e0 mulss -0x20(%rax),%xmm1
bd: f3 0f 58 c1 addss %xmm1,%xmm0
c1: f3 0f 10 49 e4 movss -0x1c(%rcx),%xmm1
c6: f3 0f 59 48 e4 mulss -0x1c(%rax),%xmm1
cb: f3 0f 58 c1 addss %xmm1,%xmm0
cf: f3 0f 10 49 e8 movss -0x18(%rcx),%xmm1
d4: f3 0f 59 48 e8 mulss -0x18(%rax),%xmm1
d9: f3 0f 58 c1 addss %xmm1,%xmm0
dd: f3 0f 10 49 ec movss -0x14(%rcx),%xmm1
e2: f3 0f 59 48 ec mulss -0x14(%rax),%xmm1
e7: f3 0f 58 c1 addss %xmm1,%xmm0
eb: f3 0f 10 49 f0 movss -0x10(%rcx),%xmm1
f0: f3 0f 59 48 f0 mulss -0x10(%rax),%xmm1
f5: f3 0f 58 c1 addss %xmm1,%xmm0
f9: f3 0f 10 49 f4 movss -0xc(%rcx),%xmm1
fe: f3 0f 59 48 f4 mulss -0xc(%rax),%xmm1
103: f3 0f 58 c1 addss %xmm1,%xmm0
107: f3 0f 10 49 f8 movss -0x8(%rcx),%xmm1
10c: f3 0f 59 48 f8 mulss -0x8(%rax),%xmm1
111: f3 0f 58 c1 addss %xmm1,%xmm0
115: f3 0f 10 49 fc movss -0x4(%rcx),%xmm1
11a: f3 0f 59 48 fc mulss -0x4(%rax),%xmm1
11f: 45 39 c8 cmp %r9d,%r8d
122: f3 0f 58 c1 addss %xmm1,%xmm0
126: 0f 85 04 ff ff ff jne 30
12c: 49 63 c0 movslq %r8d,%rax
12f: 48 c1 e0 02 shl $0x2,%rax
133: 48 01 c7 add %rax,%rdi
136: 48 01 c6 add %rax,%rsi
139: 31 c0 xor %eax,%eax
13b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
140: f3 0f 10 0c 87 movss (%rdi,%rax,4),%xmm1
145: f3 0f 59 0c 86 mulss (%rsi,%rax,4),%xmm1
14a: 48 ff c0 inc %rax
14d: 41 8d 0c 00 lea (%r8,%rax,1),%ecx
151: 39 ca cmp %ecx,%edx
153: f3 0f 58 c1 addss %xmm1,%xmm0
157: 7f e7 jg 140
The main differences here are that it unrolls the loop 16x here. It only uses the 2-operand instructions - it uses fewer registers. It has also transformed the array indexing into pre-increment pointer arithmetic (in batches).
Well this definitely isn't a RISC cpu as that scheduling looks pants as everything keeps writing to the same registers. But x86 being so dominant has allowed a lot of money to be spent optimising the chip to run shitty code faster to make up for the compiler.
Benchmarks
Here are some timing results. All values are in ms for equivalent of 1 loop (or seconds for 1000 loops).
what time
gcc -O0 4.86
-O2 1.44
-O3 1.44
java 1.60
time java 1.7
The last is using the 'time' command on the whole java loop. i.e. this includes the jvm startup, profiling, and compilation. This isn't too shabby.
Either way these times are pretty good vs effort - maybe one or the other is more tuned to the cpu I have vs intel stuff but it's really neither here nor there.
Unrolled loop
Actually what prompted the idea for this article was some results I had from unrolling loops 4x in Java. I subsequently found that unrolling 2x did just as good a job in this case so i'll do that here just for simplicity. The assembly is almost identical anyway as it just gets unrolled an additional 2x rather than 4x by the compiler.
public float dot(float[] a, float[] b, int len) {
float v0 = 0, v1=0;
int i = 0;
for (int e = len & ~1;i<e;i+=2) {
v0 += a[i] * b[i];
v1 += a[i+1] * b[i+1];
}
for (;i<len;i++)
v0 += a[i] * b[i];
return v0+v1;
}
Final pass
And here's just the inner loop of the final pass:
.1: vmovss 0x10(%rcx,%r8,4),%xmm0
vmulss 0x10(%rdx,%r8,4),%xmm0,%xmm1
vaddss %xmm3,%xmm1,%xmm0
movslq %r8d,%r11
vmovss 0x2c(%rcx,%r11,4),%xmm2
vmulss 0x2c(%rdx,%r11,4),%xmm2,%xmm9
vmovss 0x24(%rcx,%r11,4),%xmm1
vmulss 0x24(%rdx,%r11,4),%xmm1,%xmm8
vmovss 0x1c(%rcx,%r11,4),%xmm2
vmulss 0x1c(%rdx,%r11,4),%xmm2,%xmm1
vmovss 0x18(%rcx,%r11,4),%xmm3
vmulss 0x18(%rdx,%r11,4),%xmm3,%xmm2
vmovss 0x14(%rcx,%r11,4),%xmm4
vmulss 0x14(%rdx,%r11,4),%xmm4,%xmm3
vmovss 0x20(%rcx,%r11,4),%xmm5
vmulss 0x20(%rdx,%r11,4),%xmm5,%xmm4
vmovss 0x28(%rcx,%r11,4),%xmm6
vmulss 0x28(%rdx,%r11,4),%xmm6,%xmm5
vaddss %xmm3,%xmm0,%xmm3
vaddss %xmm2,%xmm3,%xmm0
vaddss %xmm1,%xmm0,%xmm1
vaddss %xmm1,%xmm4,%xmm0
vaddss %xmm8,%xmm0,%xmm1
vaddss %xmm1,%xmm5,%xmm0
vaddss %xmm0,%xmm9,%xmm3
add $0x8,%r8d
cmp %r10d,%r8d
jl .1
So now it's unrolled the loop an addition 4x times so that it looks the same at first glance. But now the moves have been spread across many registers rather than mostly going through xmm1. It runs quite a bit faster.
This is getting too long so i wont include it but the same simple modification applied to the C version also makes a difference - quite a big one. The generated code is almost identical apart from every second xmm0 being replaced with xmm1 - i.e. interleaved as written.
Benchmarks
And here's some benchmarks of this 'version'.
what time
gcc -O0 2.76
-O2 0.833
-O3 0.735
java 1.00
time java 1.2
Conclusions
Well hotspot is pretty good, but could be a little bit better. And it seems mostly just to fall down on some seemingly simple areas like instruction scheduling (simple compared to the rest of the work it's done).
Although I don't have enough knowledge of the architecture here to state that the original scheduling isn't very optimal the benchmark results probably speak loud enough in that absence. It is clearly not optimal as the same machine code which interleaves the output register runs 2x faster in the C case. I don't really feel like translating this to assembly so i can see if some simple re-arrangement would make a difference.
But what is odd that neither compiler is doing this on it's own, one could argue (quite convincingly) that due to floating point peculiarities (addition is only weakly associative) both loops are not actually calculating the same result. In the case of hotspot however this argument is weak because the optimised version is already spreading the addition accross multiple registers.
Lambdas & de-optimisation
This is getting long and the next part could probably go into another article but i've spent enough of my weekend on this so i'll get it out of the way now with a quick summary.
For simplcity I created the following simple 3-parameter map/reduce operation.
public interface FloatTrinaryFunction {
public float applyAsFloat(float a, float b, float c);
}
public float reduce(float[] a, float[] b, int len, FloatTrinaryFunction func) {
float v = 0;
for (int i=0;i<len;i++)
v = func.applyAsFloat(v, a[i], b[i]);
return v;
}
And invoke it thus:
reduce(a, b, a.length, (float v, float x, float y)->v + x*y);
Opt and de-opt
In short, if you use up to two lambdas it results in equivalent code to the direct dot product equation - nice. But once you go to three or more it de-optimises the loop and reverts to a function call. It also spends more time in the compiler.
The following is what the deoptimised loop looks like:
.1: mov (%rsp),%rdx
mov %rdx,(%rsp)
cmp %r10d,%ebp
jae .exception0
mov 0x8(%rsp),%r10
vmovss 0x10(%rdx,%rbp,4),%xmm1
cmp %r10d,%ebp
jae .exception1
mov 0x8(%rsp),%r10
vmovss 0x10(%r10,%rbp,4),%xmm2
mov 0x18(%rsp),%rsi
xchg %ax,%ax
mov $0xffffffffffffffff,%rax
callq applyAsFloat
inc %ebp
cmp 0x10(%rsp),%ebp
jl .1
So it retains the array bounds checks inside the loop (bummer) and invokes the interface as a function call (expected), but it removes any profiling instrumentation that was present in the first pass (expected also) and generates the smallest code.
This hits around the 4.5ms mark.
Conclusions 2
It's important to note that this is just a run-time decision made by the current version of hotspot - this could be changed or could be tweaked in the future. And as I showed in some previous posts it can be worked-around even with the current hotspot using some bytecode foo.
Given the prevalence of lambdas in java8 i suspect it is something that will gain some tuning attention in future revisions. It's not something one would change lightly so it will probably be based on profiling data and usage.
hotspot code generation
So I was curious as to whether you can get the code out of hotspot and I found you can - a hotspot plugin is included in the jdk source but not distributed probably due to license restrictions (GPL2 + GPL3).
After a short search I came across this nice post which pointed me in the right direction. My system is a bit out of date so his approach didn't work but it wasn't much effort to drop in binutils from a GNU mirror. I had to remove -Werror in a couple of makefiles for it to compile (why this is ever used in released software i don't know, too many things change in system libs for it to be portable).
I've only had a quick look but it's quite interesting. Pity about the horrid x86 ISA though.
It will do several iterations of a compilation before settling down - gradually changing (improving?) the code at each step.
Eventually it does all the things you'd expect: registerising locals, unrolling loops, using registers as pointers with fixed array offsets where possible. It will also move array bounds checks to outside of inner loops so that the result looks pretty much like compiled C and sometimes better as it can potentially inline anything and not just macros or stuff in includes.
In one test case it appeared to unroll a simple loop almost identically to the optimisation of a manually unrolled loop; but it ran quite a bit slower. Not sure on that one, might be register dependency stalls or perhaps I was looking at the wrong code-path as it was a fairly large function. I will have to try on simpler loops and mathematically they weren't strictly identical either.
Unfortunately it wont employ SIMD even when it's pretty close; I guess that's alignment related as much as anything due to intel's strict alignment requirements. I did notice recently that bytebuffers seem to be 16-byte aligned now though.
dot product
To start with something simple this is the loop i'll look closer at, it's the one I was looking at above.
double v=0;
for (int i=0; i<end;i++)
v += a[i] * b[i];
And the manually unrolled version. This is not identical due to the peculiarities of floating point despite being mathematically the same.
double v, v0=0, v1=0, v2=0, v3=0;
int i=0;
for (int bend=end & ~3; i<bend;i+=4) {
v0 += a[i] * b[i];
v1 += a[i+1] * b[i+1];
v2 += a[i+2] * b[i+2];
v3 += a[i+3] * b[i+3];
}
for (; i<end;i++)
v0 += a[i] * b[i];
v = (v0+v1+v2+v3);
I will look at the compiler output in another post. If I get keen i might see if i can build an ARM version for the nicer ISA.
sgemm, OpenCL
Yesterday I couldn't do much else so I played with some OpenCL code again. Just as my left foot was nearing recovery from gout ... I think I strained my right foot from too much walking or other activity and i'm immobile again. Argh.
With OpenCL still haven't managed to work out why I can't use SVM - I have a C test, a C test based on extracting all the relevant code from the BufferBandwidth sample (from amd sdk), and a C++ test based on the BufferBandwidth sample; they all crash as soon as I try to invoke a kernel against an SVM buffer, although BufferBandwidth runs fine. I even tried compiling linux 3.19.8 - I had to modify the catalyst driver a little bit to get it to build but I had it working for a bit, but suspend was broken and then I made one too many changes to the linux config and i couldn't get it to boot again and eventually just gave up. The linux config system is pretty shit and any changes force a full rebuild so i was getting sick of that. When I did have it running it made no difference to the SVM stuff anyway.
OTOH I did find that using CL_MEM_USE_HOST_PTR works in much the same way anyway (in terms of java usefulness) - even without mapping or unmapping the values are being updated on the via the GPU device, so with any luck map/unmap are just noops. I didn't really look too much further though.
What I looked at instead was implementing a basic matrix matrix multiply, i.e. lapack's sgemm. Not really for any need but just curiosity; how much effort is required vs the payoff.
My test case was a C=AxB where each is (1024,1024) with row-major order storage. CPU is a AMD A10-7850K (kaveri).
java naive 20.5
java copy col B 1.5
java copy col B mt 0.50
opencl cpu naive 6.5
opencl cpu float4x4 0.49
opencl gpu simple 0.26
opencl gpu float4x4 0.045
java ojalgo (double) 0.48
java la4j (double) 1.7
java copy stream 0.40
java copy stream x4 0.30
- java naive
- This just implements the classic algorithm literally - i.e. for all rows of A, dot product of row by all columns of B, etc. The problem here is that each dot product scans a column of B in a potentially worst-case way in terms of cacheable memory access - this size hits that.
- java copy col B
- This just inverts the two outer loops so that it runs for all columns of B and then dot products that with all rows of A. It copies the current column of B in the outermost loop and so it only has to run once for every 2^20 accesses (in this case). Which is obviously worth it.
- java copy col B mt
- This replaces the outer loop with a IntStream.range(0, n).parallel().forEach(). It's not optimal memory-use wise but that makes little difference in this canned example (see the last couple of results). This is a trivial change and also easily worth it.
- opencl * naive
- This is a trivial opencl implementation that runs transposed with each work item calculating a single output value. The work size is 64,1,1 in each case. This shows that it isn't worth using OpenCL without a bit more effort on the algorithm.
- opencl * float4x4
- This is the most complex implementation whereby each work-item calculates a 4x4 output cell (calculates 4 rows at once). The number of columns and rows must be multiple of 4. It's basically just an unrolled loop using vector types; but the code is still quite straightforward. At least in this case - since the problem is embarrassingly parallel - the effort required is modest for the gains possible.
- java copy stream
- This replaces the outer loop of the copy col B loop with a custom non-gc-polluting spliterator over the columns of B. i.e. it allocates one row for each thread which is re-used for each call. This is moderately more work to set-up but the spliterator is re-usable. It's also possibly slightly misleading due to the way hotspot optimises callbacks.
- java copy stream x4
- Well what the hell - this unrolls the inner loop by 4x so only works on matrices with A_n a multiple of 4.
The opencl code is sub-optimal for the CPU case - something closer to the java implementation would make sense - I will try again at another time perhaps. I'm not that familiar with the compiler behavior or best processing model to use for the CPU driver but using vectors obviously helps. But if it's barely faster than Java there wont be much point. OTOH a 10x speedup using the kaveri GPU is a bit more interesting.
Sorry no code today - if i keep poking i might drop it into a zcl-samples thing later on. I'm sure there's plenty of (better) code out there in accelerated lapack libraries anyway.
Update: So before i posted this i came across the java matrix benchmark and the pretty simple 'java copy col b' is pretty close to ojAlgo running on this box. I only ran it on this same test and not the benchmark so i don't know if it's 'fast' on this machine although i imagine most of it's perf advantages in multiply in the benchmarks is from multi-threading. I also had some time to blow so I tried the row stream and row stream unrolled versions just out of curiosity.
ojalgo only seems to do double arrays, which doesn't make much difference with this problem size apart from double the storage space. I'm using a double accumulator for the dot product fwiw.
Update: Bored. Looked at la4j. It's maven only but a quick makefile fixed that abhorrence. Anyway it's just a tiny bit slower in this case and surprisingly only single-threaded (for something which is `current', this really is surprising). It's using the same algorithm as "java copy col B" but it uses 2D java arrays for it's dense matrix and creates a garbage copy of every column during operation rather than re-using the array. It looks like a pretty nice little library apart from a few odd looking decisions like these, especially the custom serialisation mechanism, and lack of threading. But there's a lot more to a matrix library than a multiply.
FWIW I didn't bother to include it in the above but on the weekend I also tried a direct ByteBuffer as storage. It takes a bit longer than the array backend for hotspot to optimise but it's quite close to the array version once it has. Or actually a bit faster in the mt case, for some strange reason.
faster faster loops
Given i haven't touched opencl for a while I thought i'd stop faffing about with threads and streams and see what this APU can do.
But I found a silly bug in zcl which rendered it broken so just mucked around playing with that and got nowhere with my original aim ...
I call a bunch of JNIEnv *A() functions, these take an array of jvalue's which should presumably be more efficient than walking a varargs list (if insignificantly so). But in an effort to clean up the way I was using it I broke it and hadn't gotten around to actually running anything until now. I will drop another zcl at some point but considering nobody's downloaded it i'm in no rush.
I also worked out some issue with the GPU driver, and possibly slackware. The mandelbrot demo works fine with javafx but other non-gui code just wasn't finding the GPU device. I couldn't work out what was going on but a strange error indicated it was probably some xfce session setup thing. I found an acceptable workaround in just setting export COMPUTE=localhost:0.0.
And then i spent the rest of the evening trying to work out why SVM wouldn't work on the GPU. It "works fine" on the CPU driver but although other operations are fine any kernel invocation leads to an insta-crash. After re-checking every code path i came to the conclusion that it's not the way i'm calling it, it just doesn't want to work.
And just now I tried a stripped down C implementation and it just crashes when I invoke a (do-nothing) kernel with an SVM argument :-/ Blast. I checked the BufferBandwidth sample and once I figured out netbeans was sucking too much for it to run and closed it; it worked fine. After a pretty long look i can't see why it isn't working so i must've done something really silly and small.
One part of svm - the common address pointers - aren't as useful from Java as from C anyway but the ability to share buffers without explicit map/unmap calls in the fine grained case should be, particularly on this APU.
netbeans
Netbeans is really starting to struggle for some reason. I was doing a big cleanup of a prototype which the boss gave to the customer (sigh) and moving lots of code around and suddenly it decided I had no main class and wouldn't even compile. After cleaning caches and other junk it was 'just' a non-obvious parser error. But I still had to resort to emacs+makefile to go through the compile errors one by one until I could get it to run. And the line that got it working in netbeans again was just a reference to a deleted import - i'd moved it to a common library.
But then it just started scanning dozens of files (dozens of times each) every time i switch windows; pausing for 250-1000ms each time. Cleaning the cache made no difference and it's already on an SSD. And it's running out of memory constantly - which messes up the incremental compilation something fierce. The last thing I did was I tried the same thing I did at home and disabled a dozen or so plugins i'll never use; but it didn't make much long-term difference at home so i'm not confident it will at work either.
But opening the zcl projects back up at home has pretty much busted it here - it's constantly running out of memory and taking a second or so to save any file and often not compiling it. I mean, it's actually become unfit for purpose.
Maybe lambda parsing is throwing it for a loop; but why? Any parameter/type matching should be somewhat limited in scope unlike C++.
Copyright (C) 2019 Michael Zucchi, All Rights Reserved.
Powered by gcc & me!