arm, tegra3, neon vfp, ffmpeg and crashes

So I just did a release of jjmpeg including the android player ... and then a few hours later finally discovered the cause of crashes I've been having ...

Either FFmpeg, the android sdk compiler or the Tegra 3 processor (or the way i'm using any of them) has some sort of issue which causes bus errors fairly regularly but never repeatably. Possibly because of mis-aligned accesses. Unfortunately when I compile without optimisations on - the build fails, which makes it a bit hard to debug ... i got gdb to run (once only though, subsequent runs fail), and got a half-decent backtrace, but optimistions obscured important details.

Anyway i noticed that 0.11.1 has a bunch of ARM work, so I upgraded the FFmpeg build, and mucked about with the build options for an hour trying to suss out the right ones and to see how various ones worked.

Short story is that using armeabi-7a causes the crash to appear (with any sort of float, vfp, neon, or soft), and dropping back to armeabi fixes everything.

Unless I can get better debugging results I think i'll just stick with armeabi for the foreseeable future. I can't find anything recent about these types of problems, so perhaps it's just my configuration but I really just don't know enough about the ARM specifics at this point to tell either way.

Well, that's enough for today.

Update: I spent another day or so on this and finally nutted it out. It was due to alignment problems - but it was odd that it happened so rarely.

As best I could work out, ARMV6+ allows non-aligned memory accesses, but the standard ARM system control module can be programmed to cause faults. And just to complicate matters the ARM linux kernel has the ability to handle the faults and implement the mis-aligned access manually, and this the behaviour can be configured at run-time via proc. It seems the kernel on my tablet is configured to cause faults, and not having administrative access I am unable to change it ...

So the problem is that FFmpeg's configure script assumes mis-aligned memory accesses are safe if you're using armv6 or higher. Anyway I filed a bug ~~although so far indications are that the bug triager doesn't know what I filed~~ (see update 2) - i'm not fussed as I have a work-around that doesn't require any patch.

I had wasted a lot of time based on thinking it was neon or optimisation related, whereas it was just ARMv5 vs anything else behaviour. When i finally did get it to compile without optimisation turned on, the backtrace I got was still identical and so worrying about getting a good backtrace was pointless. I had wrongly assumed that a modern cpu would handle mis-aligned accesses ok, not working at the assembly language level for a while gets you rusty ...

I suppose the main upshot of posting on the libav-user list ... that mostly just resulted in me wasting a full day of fucking about ... is that I realised my configure invocation was still broken (more problems one got from copying some random configure script from the net) and so I managed to clean it up further.

Bit over it all now.

Update: So the actual fix was to run this sed command over config.h after configure is executed:

sed -i -e 's/ HAVE_FAST_UNALIGNED 1/ HAVE_FAST_UNALIGNED 0/' $buildir/config.h

Update 2: Good-o, they've just added a configure option to override it.

Update 3: Can anyone tell me why this post is getting so many hits over the last month or so (June '14) It's showing up in the access stats but there's no info on why.

"Fuck you Nvidia" (and other news of the day)

So apparently Linus blew his top a bit and gave the bird to Nvidia with a pretty clear verbal message to match. Well if nothing else that'll be a keeper of a picture that will bounce around the internet for years to come ...

Of course, he has only got himself to blame here - if he didn't allow binary blobs to link into the kernel in the first place (choosing to discard rights that copyright gives him) then he wouldn't be in this situation would he? After-all, it was his decision alone - he could have gone either way and the rest would follow.

So much for pragmatism ...

So I guess we'll see where it goes in a few years when UEFI tivo-ises every hardware platform you buy, and you can no longer compile your own kernel or write your own operating system on your own computer, even if it is running a 'free' operating system.

Of course industry consortia such as Linaro are right behind UEFI - anyone who sells appliances would love to lock them down giving them forced obsolescence - when in reality hardware is approaching the point where software is taking over many of the functions, and is capable of much more than it's original firmware allows it. I find it pretty offensive that the guy in the linked video regards anyone who doesn't like UEFI as a pirate ...

Microsoft laptop and/or tablet

Well things must be in dire straits in microsoft's windows-rt land. One can only guess that the OEMs simply aren't embracing the platform with enough zeal - there seems no other sane reason that they would want to create their own tablet (and/or laptop, or whatever it is).

Unless it's just pure greed - which of course isn't something that can be discounted entirely. At least in part they probably think they can recreate the xbox success story - which given how much it cost, clearly wasn't anywhere near as successful as the internets would have you believe.

I bet the few OEMs even looking at microsoft windows-rt are going to be given some moment for pause with this announcement.

Nokia

The only `OEM' to embrace microsoft will probably be nokia.

But they're totally fucked and who knows if they'll even see out the calendar year. They only ever made good phones, and now they don't even do that - who is going to buy a PC from them, even if the form-factor is a tablet.

But what has happened to nokia is a rant for another person - I've had a couple of old nokia phones over the years and I thought they were fine, but I don't have any connection to them other than a shared sense of disappointment in what has become of a great company in such an astoundingly short period of time.

Transformer prime

Finally got a firmware upgrade to the transformer prime last week. TBH I can't tell any difference - if anything the browser hangs more with 'application not responding' (or whatever it says) than it did before. Not that i've been using it a great deal - it's a pretty clumsy way to do anything.

I hurt my foot again (well, this time it was my other foot - my guess is my overly sedentary work-at-home lifestyle for the last few years is catching up with me and I have to at least start taking regular walks to repair my feet - even when i venture out it's mostly cycling) so I was pretty much immobile for a few days. So I dragged out the tablet and used it for some web reading and even tried a few games.

Even the touchy feely games (I downloaded some 'bridge builder' and 'physics challenge' games) which seem well suited to the tablet are a bit of a pain to control with a fat imprecise finger which obscures what you're doing. Trying to use it in bed is annoying as you need two hands to hold it - the auto-rotate stuff is a pain in the arse too - so that gets turned off anyway. As a web browser it is just ok - portable - but again you need to prop it up to use it, or bend over it, and the fat-finger-mouse can make using any web page frustrating. Not to mention all the annoying adverts I haven't seen for years (well at least the flash stops when you're not looking at it - something i never understood about firefox after it had tabs).

About the only good thing about it is it still has a battery that works - so I can use it without a tether - unlike my laptops whose batteries are all dead now. Those batteries are too expensive to make them worth replacing. But other than the battery having died so it is tethered to my desk, my X61 thinkpad is a much easier to use and much more useful device: I use it for email and forums, and most of my browsing.

Once the tablet battery dies it'll be pretty shit as the connector is in an inconvenient place.

I'm still working on the android jjmpeg stuff though - mostly just for my own entertainment. I have the code back ported to amd64 now, but I haven't seen any crashes - valgrind gives a bunch of hits but it's always hard to tell if that's just the JVM doing funky shit or real problems (none of the stack traces show anything useful, even with a debug build).

Viola & Jones Revisited

So after the last post it got me thinking about just how I did implement the viola-jones haar cascade in socles.

The code runs in a loop, and there is no communications from the CPU to the GPU and only runs about 10-15 loops anyway (depending on the settings): thus the loop overheads are fairly small. But it still does require a 'scale features' step, which is useful on a CPU to avoid excessive calculations but isn't so important on a GPU.

So I tried a slightly different approach - that is, to perform the scaling inside the detector kernel, which allows each kernel to then work on different scales. i.e. to do all scales in one step.

My first attempt at this wasn't much faster - but that's because I was invoking the kernel for too many probes. So then I tried changing the way it works: each work-group still works together solving a single feature test stage together. But instead of calculating it's location and scale from the 2d work coordinates I create a 4 element descriptor with some of the information required and it just uses that. This gives me a bit more flexibility in the work assignment, e.g. I can utilise persistent work-groups and tune the work size to fit the hardware more directly. It requires less temporary memory since the features are scaled in-situ.

This change was definitely worth it, for a given test on the webcamfx code, I got the face detection down to around 13ms total time, vs 19ms - about 4ms overhead is fixed. A stand-alone test of Lenna registers about 8ms vs 19ms, so over 100% improvement.

Comparisons with other hardware are difficult - mostly because it depends a great deal on the subject matter and the settings and i haven't kept track of those - but I was pretty disappointed with the AMD performance up until now and I think this gets it on par with the nvidia hardware at least. Although really the 7970 should do measurably better ...

My guess is that the performance gained is mostly because with the greater amount of work done, it can more efficiently fit the total problem onto the hardware. There is usually a small amount 'modulus' where a given problem wont fill all hardware units leaving some idle, and in this newer version it only happens once rather than 10-15 times. Actually I did some more timing (and updated the numbers above), and 100% seems too much for this. Maybe? Oh I also changed the parallel sum mechanism - but I changed it in both implementations and it made no difference anyway. I changed the region description to a float array too, although that only affects the scaling function in the first instance.

If I run this on a CPU the performance is very poor - around 1.5s for this test case. If I go back to a test CPU version I wrote it's a more reasonable 240ms so i'm still getting a good 30x speedup over an Intel i7 X 980. Given I was getting 90ms before with the cpu driver and the nvidia test case i'm not really sure what's going on there.

I haven't checked the code in yet as it's a bit hacked up.

Update: I checked some stuff in, although left both implementations in-tact for now.

Update 2: So I did some further analysis of the cascades I have: it turns out the way i'm splitting the work is very wasteful of GPU resources. I'm using at least 64 work items per stage - using one work item per feature. But the earlier stages have only a small number of features to test - and the vast majority of probes don't go past the first few stages. e.g. the default cascade only has 9 tests. I tried a few variations to address this but the overheads of multiple kernel calls and the global communication required outweighed any better utilisation.

Update 3: So curiosity kept me poking. First I realised that using fixed scheduling for persistent kernels might not be idea. So I use an atomic to dole out work in a first-some-first-served consumer way. Made a tiny difference.

Then I thought I would try to see if using fewer work-items per feature stage would help. In this case I use 4x16 or 2x32 thread groups to work on 4 or 2 tests concurrently - with all the necessary (messy) logic to ensure all barriers are hit by all threads, etc. This was measurable - the lenna test case I have is now down to around 7ms (unfortunately when using sprofile the algorithm fails for some unknown reason - so this is now time measured with System.nanoTime()).

One big thing left to try is to see if localising the wide work queue would help. e.g. rather than call multiple kernels for each stage and having each work-item busy working on a sub-set of problems, do it within the kernel. e.g. if the stage count is 9, 12, ... do stage 1 with 7 concurrent jobs, if any pass then add them to a local queue. Then do stage 2 with (64/12) = 5 concurrent jobs, if any pass add them to a local queue. etc. Once you get to a stage longer than 32 items, just use 64 threads for all the rest. This way I get good utilisation with small stages as well as with large stages. I'm not sure whether this will be worth all the hassle, and the extra addressing mathematics required (and it's already using a lot of registers); but as i'm really curious to know if it would help I might attempt it.

Given that I now use a work queue, another possibility open is to re-arrange the jobs to see if any locality of reference can be exploited. Given the huge memory load this might help: although the image cache is so small it might not.

Update 4:Curiosity got the better of me, it's been crappy cold weather and I hurt my foot (i don't know how) so I had another look at the complex version this morning ...

Cut a long story short, too many overheads, and although it isn't slow it isn't faster than just using 16 or 32 threads per feature test. Too many dynamic calculations which cannot be optimised out, and so on. It's around 9.5ms on my test case.

Structurally it's quite interesting however ....

Find out how many concurrent tests can be executed for stage 0, dequeue that many jobs from the work queue and copy them to a local work queue.
If we exceeded the job length, stop.
Work out how many jobs can be done for the current stage
Process one batch of jobs.
Parallel sum the stage sum.
If it advances to the next stage, copy to a next-stage queue.
Go back to 4 unless finished the in queue.
If any are in the next-stage queue, copy them over, advance to the next stage.
Go back to 3 if we had any work remaining.
If we ran the full stage count, copy any work jobs remaining in the queue to the result list.
Go back to 1.

So each stage is fully processed in lock-step, and then advanced. The DEFAULT cascade starts with

7 feature tests, so it never has more than

7 items in the queue (7 feature tests of 9 elements = 63 work items). As the stages get deeper the number of work-items assigned to solve the problem widens, up to a limit of 64 - i.e. the work item topology dynamically alters as it runs through the stages in an attempt to keep most work-items busy.

There's a lot of messy logic used to make sure every thread in the workgroup executes every barrier, and there are lots of barriers to make sure everything works properly (i'm using locals a lot to communicate global info like the stage and topology information). So the code runs on a CPU (i.e. I got the barriers correct), although very inefficiently.

As is often the case with GPU's, the simpler version works better even if on paper it is less efficient at filling the ALU slots. Although I haven't confirmed this is the case mathematically: apart from stage 0, the more complex method will also have un-even slot fillage - it's one of those discrete maths/Knuth style problems I simply give up on.

AMD Fusion Summit, HSA, etc.

Been looking forward to watching the AMD Fusion summit this year after watching a bunch of very interesting videos last year. I knew they were coming up but this month has gone faster than I thought ...

So far i've just watched the 'programmer' keynote from Phil - it's a pity about the emphasis on C++ which is such a shit language - but what are you gonna do eh? His talk on the viola-jones haar cascade algorithm was interesting, how HSA could be used to split up algorithms to move the problem to where it is most efficiently solved (not sure how it compares to face-detector in socles, as I solved the problem of idle work-items in a different way). But yeah, looking forward to that capability in the future; during my last visit to OpenCL in the last month or so I kept thinking that being able to run stuff on the CPU where it made sense would ... make sense.

I slightly disagree that the problem with the GPU parallel programming is just that it is too hard to write - all good software is hard to write - I think it more has to do with the availability of the platform. e.g. PS3 is hard to write too, but there seems to be plenty of that now because everyone's writing to the same platform. If I was a commercial developer writing software, right now it's only going to be a niche (photoshop is a niche). This is ok - because niche customers are probably already using capable hardware or don't mind buying it - but for mass market adoption it requires mass market availability of stable, quality, compatible platforms. This is still some way off.

The videos are on the summit broadcast site which requires a freely available login.

Update: Blah, ahh well, mostly a bit dull & sparse this year, or maybe they just weren't all put up on the net. The HSA stuff is the most interesting again from a software perspective.

Update 2: Apparently more content will be added over time, I guess last year I didn't spot it for a few months so had a lot more to look at off the bat.

On more reflection the HSA foundation and the HSAIL stuff is pretty big news. People don't seem to understand why it's so important though. It's really about the H in HSA - heterogeneous. Being able to support many CPUs with the same code and even the same compiler. Being able to target the code at run-time to execute on the most efficient hardware available in the current system. And being able to do that in a practical way that isn't tied to some vendor-specific secret sauce using broken proprietary compilers. At the bottom of it, it's just another attempt at 'write once, run everywhere' technology, but this time for computationally intensive processing and not for desktop user applications. I guess time will tell to see how it goes without nvidia and intel though. And the same as to whether this finally allows free software to take part.

The other part of it is coming up with a set of re-usable libraries so that the performance is opened up to non-gun-hackers (or in their terms, non-'ninja'-programmers), although TBH I don't see that is any different to any other modern hierarchical programming environment full of frame-works and tool-kits. This can already be done with OpenCL anyway, but I suppose there is still messy crap to deal with from the idiot-programmer's perspective. e.g separate memory spaces, device-host copy overheads and so on. HSA with code transparently intermingling with plain old host code means the same could be done without the overheads and make it more attractive.

I still think the biggest hurdle for application developers is platform support. Any extra work has to be justifiable if it is only going to benefit a part of your customer base.

Update: I never got around to seeing the actual talks at the time but I just found that Stream Computing have a nice index of all the OpenCL specific talks. I'm not a regular reader of their blog but every now and then I do a search in which it turns up and I do a catch up ...

Random stuff

So I spent the last couple of days playing with a few random things.

Beat detection

Interest in this goes way back, probably to when a few lads and I did some graphics at a rave in the early 90s ... anyway I thought i'd have another look. I started here with this rather badly formatted word 'processed' document, but ended up trying to implement a wavelet algorithm based on this paper.

I played with it a bit but didn't really get good results (as far as i could tell, and my testing code wasn't great). Next time I revisit it i will probably look at the simpler spectrum-based algorithms from the first article but with a variation on the cycle detection.

DLNA

Oh that DLNA crap again. Mostly because there doesn't seem a simple video player that lets one access a DLNA server from Linux in a simple way. I started with cling, and after waiting for about 20 minutes for maven to compile it decided that there really is a build tool worse than ant (surprising as that is), and then proceeded to split the project up into a pattern that netbeans can work with. I eventually got it to run and started work on a jjmpeg media renderer, got the android browser working and so on. But really wtf - it's a huge fucking pile of code just to retrieve a URL to a file on a HTTP server ...

Then this morning I read this post from the developer of libdlna (which seems to be abandoned now), and decided it was just a world of pain I wanted nothing to do with.

IF I ever bother with this again I will probably just write my own protocol but today I can't be stuffed.

jjmpeg build

And finally today I poked at the jjmpeg build. Looking at using the current android branch as the main code-base. The native stuff isn't that difficult (just making a decision and sticking with it is the main problem), but ant and netbeans either makes it impractical or impossible to support cross-platform development in the same project.

So I will probably need to create 3 side-by-side projects, although for various reasons this isn't terribly ideal either.

jjmpeg-core would contain the main binding classes, native generators and so on. Probably a copy of the ffmpeg sources.
jjmpeg-java would contain the java-specific i/o and display classes. I guess this would also build the native code.
jjmpeg-android would contain the android-specific i/o and display classes and android native build, and probably be configured as an android library project so it can be re-used.

I didn't actually get that far, I was working on re-arranging the native build and platform-specific stuff in order to fit it in with the android build and clean it up. But I didn't really come up with great solutions, and in the end I think I solved nothing so might have to try again from scratch another time.

Just not switched on today, so might go find somewhere warm to read ...

Slow queues and big maths

After a couple of days poking at my android video player - mostly just to test jjmpeg really - and getting it to mostly work, I thought i'd have a look at some performance tuning. It even does network streaming now.

Fired up the profiler and had a poke ...

Found out it was spending 10% of it's time in the starSlash implementation I was using - I hadn't upgraded the multi-thread player to use the libavutil rescale function and it was still using the java BigInteger class. Easy fix.

And then I checked the queue stuff it's using for sending around asynchronous data. I changed from using the LinkedBlockingQueue to a lighter weight and simpler blocking queue I had tested with the simpler player code. Nearly another 10% cpu time there as well.

So with those small changes my test video went from about 30% cpu time (if i'm reading the cpu usage debug option properly) to well under 10% most of the time. It's only a fairly low-bitrate 656x400@25fps source.

The next biggest thing is the texture loading which can be a bit slow (actually depending on the video size, it's easily the single biggest overhead). I kinda mucked around with that a lot trying to debug `the crash', so that could probably do with a revisit. Right now i'm updating it on the GL thread - but synchronously with the video decoding thread, so any delay is costly (I wasn't sure if decode_video could reference an AVFrame at the next call). From memory doing it this was was quicker than using a shared gl context too. Actually whilst writing this i had a closer look and a bit of a poke - the texture loading is still really slow but i removed more overhead with a few tweaks.

The main problem with the player itself is that it still just vanishes sometimes with a SIGBUS. Android displays nothing about it apart from that the signal was '7', so i guess it's in some native thread. I thought I had gdb running a few weeks ago but i can't seem to get it to run now - although back then it didn't help much anyway. It also doesn't even try to cope well with not being able to keep up with decoding and ugly stuff happens.

Ahh well, if it was done I'd have nothing to do ...

Idle minds ...

So it turns out I have a bit of a break between contracts again - i'm always happy to have extra time off, so there's nothing to complain about there!

I sat down on the weekend and yesterday to play with some socles code, but so far it's been really slow going. I just don't feel like getting too much into it and it's easier just to put it down if i hit a problem; I guess I really do need a bit of a break. I also have tons of crap to do in the back yard, shed, and even around the house as well; but i've been a bit lazy on that front the last year or so, as such I doubt much will happen there.

But yeah, I guess eventually over this break I will get the opencl ransac stuff sorted out in socles, and probably then re-visit jjmpeg to at least check in the code I've already done on the android stuff.

I tried the 12.6 beta catalyst driver yesterday - and thankfully it seems a lot better than 12.4, so far, touch wood, etc. At least it doesn't keep throwing up OUT_OF_HOST_MEM errors after a half a dozen code runs, AND the xinerama twin-screen desktop is back up to decent performance. So after finally getting a GCN GPU I would like to have a play with that and see what I can get out of it. I should probably try and come up with a specific application I want to try to implement and work toward it as well, rather than just poking in random algorithms to socles. The thing is, computers mostly just do what I need them to do (run emacs and a terminal in overlapping windows?), so i'm not particularly driven at this point.

I'm keeping an eye on the ARM stuff; the rhombus tech guys, the open pandora (who knows if i'll ever get the one i ordered - at least an email confirming the order and address once a year would be nice), but with a bunch of beagleboards sitting idle already it doesn't seem much point in me getting another dev board to poke at. Just not enough hours to look at everything that is interesting ...

On the RANSAC code, I pretty much have it done - it's just that messy testing to go. In this version I tried to do most work in the one kernel - I will see if that added complexity makes it slower, or the lower memory demands help it overall. I also tried to parallelise absolutely everything, from coordinate normalisation/result denormalisation to matrix setup. So far i'm getting a strange result in that just the SVD is somewhat slower than just the SVD I had before: although for all intents they are the same design. Once I have it going I will try double arithmetic to see if that generates better results.

Green Tomato Sauce

So for some strange reason I have an abundance of tomatoes at the moment - being June, this is way way out of season. Considering I had tomatoes from October to February as well, it's been a strange year.

The wet is making them split a bit, and together with a lot of bug problems it means they're best picked green before they get eaten or go rotten

I started getting more than I could consume so I made some sauce (not a fan of the green tomato pickles). It's pretty much a plain tomato sauce recipe, but with green or pink tomatoes instead of red. A bit more sugar to compensate (and to compensate for not measuring the salt properly). And since I made it - a shit-load of chillies. The habanero plants are suffering in the cold too and getting a bit of mould problems, so I just grabbed all the chillies I had left on the plants as well. Chucked in a bit of sweet potato I grabbed out of the ground as well since I had it - probably should've used more as i have more of that than I use too.

So, plenty of chillies. I didn't count but it's at least 2-3 cups worth. Plus a handful of all the other green chillies i found in the garden (cayesan and serrano).

It looks much like some fermented green chilli sauce I made (tabasco style), but this has much more of a bite to it. Otherwise it tastes much like home-made tomato sauce, the red kind. While licking the spoon I had enough chilli to give me the hiccups - which means it's pretty hot.

Nearly 3 litres of that should keep me warm over winter and beyond ... I don't even use tomato sauce that often, but this stuff is great with a burnt snag or a boiled sav in a bit of bread. Used sparingly.

Was busy cooking most of the day yesterday, I also made a banana cake (a couple of bananas getting past it that needed using) and another 2l of lime cordial. I used the same recipe as last time but put twice as much lime juice in it- came out much better. More limey and less 'cane sugar'.

PS I used half a bottle of ezy-sauce in the ~2.5kg of the green fruit, so acid shouldn't be a problem.

About Me

Tags