Fucked up Fridays

What is it about Fridays lately ...Well the latest little thing to ruin my day has been the inability of Firefox 7 to function correctly with the primary selection.

~~It seems to want to ignore middlemouse.contantLoadURL for some reason~~

. Given that it's a recently new setting and fully documented I presume it's just a bug, but what a pain.

It's not something I use constantly but discovering it doesn't work is pretty annoying.

Update: So now it decides it's going to work. Well what can I say ... except maybe that I need to get AFK more often.

I'm totally sick of the upgrade treadmill and feel somewhat annoyed by being forced to install a newer version of Fedora just to get my graphics card working. I had everything working just nicely and was familiar enough with any of the the warts left to not notice them. And now I have to go through all that crap again. The thought that firefox will become 'versionless' horrifies me, as does the love-fest that is HTML5+JavaScript where I will no longer be able to ignore CO2 belching crap like I can now by just disabling flash.

socles demos

I finally got off my fat arse - or is that sat on it further enlargening[sic] it - and tied up some of the test driver code I have for socles into a set of demos.

I also implemented the colour mode for the DCT denoising algorithm. Over-all it's a little slow still - i.e. not fast enough for real-time video. One of these days i'll get around to the complex wavelet version, that should be a lot faster and can also sharpen. I haven't been able to suss out DCT sharpening and so far my attempts add too many artefacts to be useful (i.e. pixel-level chess pattern).

The demos so far are:

AdaptiveBlur: An interactive window that shows an experimental algorithm I came up with some time ago for de-noising. It uses sobel filter to detect edges, then uses that to progressively blend between a blurred and non-burred image. Works ok sometimes.
ConvolveNonSeparable: Simple non-separable convolution that blurs an image.
ConvolveSeparable: Separable convolution to do the same thing (~~and demonstrates the code is broken atm~~ - demo was broken, fixed)
DCT8x8Mono, DCT8x8Colour: Interactive DCT based denoise demo for mono/colour images.
WebcamFX: Another old interactive demo I wrote which uses Video4Linux to access a webcam and apply a bunch of effects including KLT motion detection and viola-jones face detect. It also shows the first half of a low-overhead video display path: the GPU does the colour conversion from raw frames. Well as low as possible with v4l4j anyway.

They're in the soclesdemo sub-module in socles' cvs.

Hmm, another week nearly down. I've been reading lots of papers and trying to suss out some fiddly crap for work, so this stuff has been a nice distraction. That's finally going somewhere so might keep me busy for a bit.

GC, finalisers

So I was doing some memory profiling the other day (using netbeans excellent excellent profiler - boy I could've used this 10 years ago) to try to track down some resource leakages and I noticed that xuggle was really exercising the system heavily.

So it seems I might look at moving to use jjmpeg in my client's application fairly soon. There are some other reasons as well: i.e. not being able to run in a 64-bit JVM on microsoft windows is starting to become a problem, and the bundled ffmpeg is just a bit out of date.

Since I haven't implemented memory handling completely in jjmpeg I went about looking how to do it 'properly'. I was just going to try to use finalisers, but then I came across this article on

~~java finalisers~~

java finalisers which said it probably wasn't a good idea.

I was going to have a short look this morning but suddenly it was 4 hours later and although I had something which works i'm not sure yet that I like it. It seems the cleanest way to implement the suggestions of using weak references, and mixing the auto-generated and hand-crafted code I want, so I will probably end up running with it. The public api didn't need to change.

Previously, the binding worked with an object class hierarchy something like this

 AVNative [
   ByteBuffer p (points to allocated/mapped native memory)
 ]
   +- AVFormatContextAbstract [
    Generated field accessors and native methods
    Most methods are object methods
   ]
    +- AVFormatContext [
      Public factory methods/constructors
      Hand-coded specific methods
      Hand-coded helper native methods
      Hand-coded finalise/dispose methods
    ]

The new structure:

WeakReference<AVObject>
+- AVNative [
   ByteBuffer p pointing to native memory
   internal dispose() method
   weak reference queue/cleanup as from article above
   Weak reference is AVObject
 ]
 +- AVFormatContextNativeAbstract [
   Generated field accessors and native methods
   All methods and field accessors are static
   ]
   +- AVFormatContextNative [
     Hand-coded helper native methods
     Implements native resource dispose
   ]

Together with

AVObject [
  AVNative n (the pointer to the native wrapper object)
  public dispose method
  ]
  +- AVFormatContextAbstract [
      Generated public access methods which use AVFormatContextNative(Abstract) methods.
    ]
    +- AVFormaContext [
      Public factory methods/constructors
      Hand-coded specific methods
      ]

So yeah - a bit more complicated, and it requires 2 objects for each instance (and often 3 including the C side instance it's wrapping), as well as the overhead of the weakreference instance data and the list entry for tracking the references. The extra layer of indirection also adds another method invocation/stack frame to every method call.

On the other hand, it lets the client code use dispose() when it wants to, or if it forgets then dispose will automatically be called eventually. And makes it obvious in the code where dispose needs to sit.

As usual it's a question of trade-offs. If the article is correct then presumably these trade-offs are worth it.

In this case the whole point of using jjmpeg is to avoid numerous allocations every frame anyway: I can allocate working and output buffers once and just use them directly. In this case the actual number of objects is quite small and doesn't happen very often, so I suspect that either mechanism would work about as well as the other.

Well this distraction has blown my morning away; I'd better leave it for now so I can clock up some work hours after lunch.

Update I figured i'd gone too far down this route to do anything other than keep it. I've checked this in now as well as a bunch of other stuff described on the project page. Update 2: Oracle keeps breaking links, but i've updated the pointer. I'm looking at this again (September 2012) because of some issues in jjmpeg.

OpenCL DCT Denoise

I've just checked in an OpenCL implementation of the DCT de-noising algorithm I mentioned previously. I've only done the mono version so far.

It's not terribly fast - 10ms wall-clock for a 512x512 mono image, and given that it requires 64 DCT's per 8x8 block and needs to accumulate the results, it probably never will be.

The kernel source.Update: Colour version implemented now.

Its beaten me. For now.

I should've stayed outside in the sun today gardening - but curiosity got the better of me. I hope the (absolutely stunning) weather continues tomorrow, otherwise i've blown it on nothing ...

I tried working on the AMD performance of the Viola & Jones detector in socles: I tried a whole bunch of stuff, from copying the image tiles pre-scaled (as summed area table) to local memory, to completely re-arranging the data structures so they are workgroup aligned, to even trying the cpu single-thread-per-location version.

I got some minor improvement, the most being the copying the tile to local store and removing some of the calculations (since it doesn't need to scale the rects): but that only took a simple test case from about 25ms to 20ms. Barely really noticeable in my webcam test harness.

I think the problem is with the fact it has to read so much data for each single test. It requires 3-4 uint4's just to describe the test, and 8-12 uint texture lookups for the summed area table lookups. The cascade I have has ~6 400 regions to test grouped in ~3&nbsp000 features, and although most aren't tested it's just a lot of data. It's too much for constant memory for example.

With a fix to use the atomic counters AMD hardware provides at least it's now in the same order of magnitude as the nvidia hardware, but still 2-4x slower.

Maybe ... if the stages were broken up into smaller parts it could work more efficiently, but it does seem a pretty long shot to me as the problem remains with the sheer amount of stuff that needs to be loaded for each test.

Time probably better spent on something else.

Ho hum.

Have a new AMD card - HD 6950 - for my workstation, need the catalyst driver for the OpenCL stuff. I use XFCE so the gnome3 incompatibilities are of no interest to me.

Couldn't get the driver built for FC13 (all sorts of bugs/problems with the rpm and I really just couldn't be fagged with it all late at night), so `upgraded' to FC15 ...

It kind of works, but is really slow in really weird ways - when changing virtual desktops one window refreshes at 'cpu speed'. glxgears @ 6000fps which is really way too slow: I'm getting 10KFPS on my rather older 5770 card on my other older/slower machine. Although fgl_gxgears is twice as fast on this new card. Using the AMD CPU backend for OpenCL causes more interference with graphics update than using the GPU backend(!) The other machine is using catalyst 10.12 on fedora 14, new one is 11.9 on fedora 15 ...

I've blacklisted the kernel radeon module and whatnot. I'm using xinerama - i tried without it and it was even slower.

I think there's just something wrong with the whole system as everything feels rather sluggish - or is that just the price of 'progress'? I'm trying a yum update (all 1G's worth) and if that doesn't work I might have to try something more drastic. Obviously the upgrade was a risky choice, but one would hope having the right kernel and X driver would be enough for the video driver ...

Only 1000 packages to go now ...

Later ...

Well it's still really slow. I tried an older driver release (on windows - hard to find them for fedora) but it wouldn't support the card. On windows the wall-clock of part of my application runs about 2x vs linux: which is pretty significant since much of the time is just waiting around for the video frame to arrive so the speed-up is presumably more than that. Needless to say the desktop is smoother too.I also tried the viola-jones detector from socles. Ouch, this really really struggles - about 100x slower than running on nvidia hardware. I tried a few things that didn't make any noticable difference apart from removing the single rarely-used atomic_inc which made it jump to about 30x faster - but even with that huge increase it was still well behind the GTX 480.

I think probably I will have to try some other possible ideas to deal with this:

Scale the images so that each sliding scan reads adjacent locations (i.e. coalesced reads), and go back to 1-thread-per-test/cascade.
Pre-calculate the scaled weights/regions on the cpu so they can be stored in constant memory.
Cache the region/weight information in LS.
Unpack the region/weight info into a flat structure so it is read sequentially rather than walking a tree stored in an array.
? separate the sum calculations from the weight calculations. By doing less work there might be more locality of reference/chance for any cache to function. This is just another way to try the first point I guess.
Use atomic counters if available since global atomics are obviously a huge no-no on cayman.

I had also better check it on my HD 5770 which runs the fc14 desktop very snappy and runs OpenCL ok to verify it isn't just all down to a shoddy driver (Hmm, now I think about it, I haven't tried OpenCL on it since 'upgrading' to fc14 from a hacked up ancient gnewsense).

glxgears does start to slow down on the 5770 vs the 6950 as you make the window bigger - so the hardware itself is somewhat faster. The problems must be in the overhead of the os/drivers. No question that ATI aren't doing a great job here but on the other hand, the xorg, fdo, and linux guys seem to change their minds about driver/graphics architecture every 6 months too ...

I was looking forward to playing with some new hardware, but apart from the sluggish GUI and having to `upgrade' the system, most of the application I work on no longer functions as critical routines are returning broken results. Not fun. Some of these are going to turn out to be bugs but i've already found problems with the compiler (e.g. commenting out all of the #pragma unroll directives fixed a bunch of stuff).

Well as the boss said, these things are so cheap it probably isn't worth my time (or his money!) for me trying to fix these issues ...

Later Still ...

Well I seem to have most of the code working again. Apart from the #pragma unroll error, they seem to be my own fault.

First, a bunch of queue synchronisation problems: data being over-written before it was fully processed for example. NVidias libraries are more aggressive about starting work without an explicit clFlush(). And apart from that I just made some mistakes along the way which weren't exposed until now.

And one odd one which took a while to track down: passing the same image as both a read_only image, and a write_only one. I knew this was suss when I did it, but 'it worked' so i left it there: I had it in the back of my mind that this was the sort of thing I should check, but I couldn't remember where I'd done it.

I still have newly added stability issues - the dreaded and meaningless 'error 134': but in the past these have usually been bugs too. Although not always.

So perhaps the drivers aren't so bad after-all; although they are still too slow from linux.

I guess I should've stuck to one of my rules of thumb of late: if you think you're getting the wrong result from the compiler, you just haven't checked your code closely enough yet.

DCT denoising

Ok now the weekend's over, time to calm down and stop ranting ... ;-) Bummer about Australia losing though, apart from some real shockers right from the kick-off they did calm down and start playing fairly well. When they did have a good run - and they had a few - they were let down badly by not enough support at the breakdown. Still, NZ deserved winners ... And channel 9's race-caller sucked the whole way through.I just found this very well put together site about using the discrete cosine transform (DCT) to do threshold de-noising in a manner similar to the wavelet threshold denoising and sharpening I mentioned before.DCT Denoising

Very slick, complete with well formatted mathematics that puts most microsoft-word based papers to shame, GPL3 source and on-line demo!

I downloaded the code and modified it not to add the noise and tried it myself on Lenna:

The results are effectively the same as with the complex DTCWT version for moderate settings - visually even the artefacts it introduces are the same.

In the form provided however it is somewhat more computationally intensive - it's sliding window is offset by single pixels, and the way the C++ is written isn't the most efficient. I wonder how well it would work with a hanning window and 4 pixel offsets. I wonder if it can also sharpen - from a quick search it looks like it can.

Very interesting, and it also works with colour images in smarter ways than just processing each channel separately.

When I get the time I'll look at coding this up for ImageZ and socles,

~~although I just noticed blogger mucked up something else - looking at images - so the threshold of having to do something about that is ever approaching~~

(I found the option to disable 'lightbox' mode).Update: Just another advert for Java. It looked simple enough so I coded up a version in Java using an 8x8 DCT and it runs single-threaded over 3x faster than the C++ version, including the JVM startup or over 4x once it's going. Rather than generate all 255 025(!) patches, transform, threshold, inverse, and merge, it fully processes a single patch each time: requiring that much less DCT memory (i.e. rather a lot - over 62MB less). So that's 0.9s vs 3.9s for this 512x512 mono image. Although I can't fathom why my version needs 1/2 the threshold to give a similar result ...Update: See follow-on post where i mention implementing it in OpenCL for socles.Update: I've now added it to ImageZ. DCT8Denoise is the main entry point. I changed it to work with separate colour planes rather than planes stored in a single array, just to make it easier to invoke from ImageZ. It's only single-threaded atm.

About Me

Tags