About Me

Michael Zucchi

 B.E. (Comp. Sys. Eng.)

  also known as Zed
  to his mates & enemies!

notzed at gmail >
fosstodon.org/@notzed >

Tags

android (44)
beagle (63)
biographical (104)
blogz (9)
business (1)
code (77)
compilerz (1)
cooking (31)
dez (7)
dusk (31)
esp32 (4)
extensionz (1)
ffts (3)
forth (3)
free software (4)
games (32)
gloat (2)
globalisation (1)
gnu (4)
graphics (16)
gsoc (4)
hacking (459)
haiku (2)
horticulture (10)
house (23)
hsa (6)
humour (7)
imagez (28)
java (231)
java ee (3)
javafx (49)
jjmpeg (81)
junk (3)
kobo (15)
libeze (7)
linux (5)
mediaz (27)
ml (15)
nativez (10)
opencl (120)
os (17)
panamaz (5)
parallella (97)
pdfz (8)
philosophy (26)
picfx (2)
players (1)
playerz (2)
politics (7)
ps3 (12)
puppybits (17)
rants (137)
readerz (8)
rez (1)
socles (36)
termz (3)
videoz (6)
vulkan (3)
wanki (3)
workshop (3)
zcl (4)
zedzone (26)
Thursday, 06 September 2012, 09:00

Java and legacy operating systems, graphics drivers, &c.

So I had to jump into the same operating system the client uses to verify some code changes still worked - really just a formality as I have jjmpeg working fine there and I was only changing some Java code.

Pretty weird results though. First I had some unpleasant flashing when the code was painting a bitmap - I thought it might've been from the latest jre update together with my habit of poking BufferedImage array's directly, but I checked and this code doesn't do any of that. All it does is replace the bitmap pointer and request and invalidate the component. Annoying but no show stopper.

But then the whole system crashed, ... graphics driver? Who knows. Probably. The AMD drivers are known to be ... problematic.

I rebooted, updated the AMD driver and rebooted again and in the short period of verification which followed I didn't see it happen again.

Except then I noticed something else - part of the application runs slower. Based on user input I do a bunch of deconvolutions on frequency-domain data, convert it back to spatial and merge the results to an image. It runs in another thread and on GNU/Linux it pretty much keeps up with as fast as the sliders move. Under the other system I noticed it was a bit laggy - nothing great but noticeable. The result of all this goes to a view which one can zoom using the scroll-wheel. Doesn't make any difference to the performance on GNU/Linux - it's just scaled during the paint() call, but it drops to a crawl on that other system if one zooms in a few times ... like down to 2-3 fps or something shocking.

As I was there I also tried the slideshow thing I wrote for JavaFX too. Apart from not traversing directories as I thought I was (bug), it was definitely slower there as well. Noticeable lags just as it started to transition in the next image. Possibly from GC, or from the blur effect.

I previously noticed other weirdness when I was playing with Glassfish based application a few months ago (in hindsight I should've just used the embedded webserver in SE, but i digress). I had few troubles getting it going on my system - just installed into ${HOME} and pointed netbeans at it - and just let it manage it. But all sorts of issues on the other system. In the end I had to use the command line tools to get anything to work (it was being used for development and running it, so having netbeans function would have been beneficial), and a few other bits of system pokery jiggery I shouldn't have needed.

Maybe it's just a Java thing, but shit just doesn't seem to work too well on that legacy system ... and that's before you count the unusual disk activity and long pauses opening applications.

Tagged java, rants.
Wednesday, 05 September 2012, 09:34

JavaFX Media, or not, and a Video Cube.

I got a bit stuck at work today (yesterday) so I took the arvo off. It was the warmest day post-winter (25 degrees, but windy) which was a good excuse, but I also had a lazy look at the JavaFX media stuff.

Actually as per the updates on the previous post I also discovered how to enable GL acceleration on AMD hardware, and it gave the appropriate speed-up - given the previous iteration was using Java2D, it shows just how decent Java2D is (Java2D is rather nice).

But as to the media player - nice api - pretty weird results. Despite using ffmpeg, it only supports a limited number of formats and one of the transcoded tv shows I put through it played back oddly. Sound was ok, cpu usage was very low, but it skipped most frames, and not in a particularly regular way either - making it uncomfortable to watch. Who knows, maybe it's a non-standard encoding or just a bug with the current implementation.

So I guess I will port over jjmpeg at some point - it's not like it should be much work. ... although sound looks as painful as ever ...

Video Cube

Ok I had another play with JavaFX this evening. It might not be as technically impressive as 9 Fingers, but this one is actually decoding the 6 videos concurrently (2xPAL MPEG TS, 4xlesser MP4), and runs smooth with only about 1.5 cpu cores busy (I7 6x core blah blah).

It looks a lot better in motion (actually I locked this orientation to get a screen capture - otherwise it updated too fast and mucked it up) ... seems to run 'full frame rate', with no tearing I can see. I'm not buffering any frames ahead of time or doing anything complex - but it seems to be keeping up with all the video decodes as well. I get the occasional jump if the system is busy - mostly netbeans displaying the debug logs.

Rather than using MediaPlayer I'm using jjmpeg to do the video decode as it supports more formats and plays back every frame properly on GNU/Linux. I just use a simple JJMediaReader loop, it runs libswscale to create PIX_FMT_BGRA at the original video resolution, and I set the pixels into the WritableImage using PixelFormat.getIntArgbPreInstance(). I'm using JavaFX to do the image scaling and all the rest. I do not decode (nor obviously; play) the sound.

Hah. It's Java.

Tagged hacking, javafx, jjmpeg.
Sunday, 02 September 2012, 03:37

JavaFX Slideshow

Just to play with JavaFX and to put it through it's paces I wrote a simple slide-show thing.

For a change of pace, i'll just paste the whole source code here.

Update: I noticed stack overflow was linking here, ... and that the code was broken because it was broken and also the final release of Java8 stopped ignoring null pointers in animations so it was broken again. I've fixed the code. Note that it still has the source location hard-coded to my home directory so that needs editing (ScannerLoader).

package au.notzed.slidez;

// this code is public domain

// This is not meant to be wonderful exemplary code, it was just
// my first experiment with JavaFX animations.

import java.io.IOException;
import java.nio.file.FileVisitResult;
import java.nio.file.FileVisitor;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.concurrent.ArrayBlockingQueue;
import java.util.concurrent.BlockingQueue;
import java.util.logging.Level;
import java.util.logging.Logger;
import javafx.animation.FadeTransition;
import javafx.animation.Interpolator;
import javafx.animation.ParallelTransition;
import javafx.animation.PauseTransition;
import javafx.animation.ScaleTransition;
import javafx.animation.SequentialTransition;
import javafx.application.Application;
import javafx.collections.ObservableList;
import javafx.event.ActionEvent;
import javafx.event.EventHandler;
import javafx.scene.Node;
import javafx.scene.Scene;
import javafx.scene.image.Image;
import javafx.scene.image.ImageView;
import javafx.scene.layout.StackPane;
import javafx.stage.Stage;
import javafx.util.Duration;

/**
 * Simple slide show with transition effects.
 */
public class SlideZ extends Application {

    StackPane root;
    ImageView current;
    ImageView next;
    int width = 720;
    int height = 580;

    @Override
    public void start(Stage primaryStage) {
        root = new StackPane();

        root.setStyle("-fx-background-color: #000000;");

        Scene scene = new Scene(root, width, height);

        primaryStage.setTitle("Photos");
        primaryStage.setScene(scene);
        primaryStage.show();

        // Start worker thread, and kick off first fade in.
        loader = new ScannerLoader();
        loader.start();
        Image image = getNextImage();

        if (image != null)
            startImage(image);
    }
    ScannerLoader loader;

    public void startImage(Image image) {
        ObservableList<Node> c = root.getChildren();

        if (current != null)
            c.remove(current);

        current = next;
        next = null;

        // Create fade-in for new image.
        next = new ImageView(image);

        next.setFitHeight(height);
        next.setFitHeight(width);
        next.setPreserveRatio(true);
        next.setOpacity(0);

        c.add(next);

        FadeTransition fadein = new FadeTransition(Duration.seconds(1), next);

        fadein.setFromValue(0);
        fadein.setToValue(1);

        PauseTransition delay = new PauseTransition(Duration.seconds(1));
        SequentialTransition st;
        if (current != null) {
            ScaleTransition dropout;

            dropout = new ScaleTransition(Duration.seconds(1), current);
            dropout.setInterpolator(Interpolator.EASE_OUT);
            dropout.setFromX(1);
            dropout.setFromY(1);
            dropout.setToX(0.75);
            dropout.setToY(0.75);
            st = new SequentialTransition(
                new ParallelTransition(fadein, dropout), delay);
        } else {
            st = new SequentialTransition(
                fadein, delay);
        }

        st.setOnFinished(new EventHandler<ActionEvent>() {
            @Override
            public void handle(ActionEvent t) {
                Image image = getNextImage();

                if (image != null)
                    startImage(image);
            }
        });

        st.playFromStart();
    }

    @Override
    public void stop() throws Exception {
        loader.interrupt();
        loader.join();
        super.stop();
    }

    public static void main(String[] args) {
        launch(args);
    }
    BlockingQueue<Image> images = new ArrayBlockingQueue(5);

    Image getNextImage() {
        if (loader.complete) {
            return images.poll();
        }
        try {
            return images.take();
        } catch (InterruptedException ex) {
            Logger.getLogger(SlideZ.class.getName()).log(Level.SEVERE, null, ex);
        }
        return null;
    }

    /**
     * Scans directories and loads images one at a time.
     */
    class ScannerLoader extends Thread implements FileVisitor<Path> {

        // Directory to start scanning for pics
        String root = "/home/notzed/Pictures";
        boolean complete;

        @Override
        public void run() {
            System.out.println("scanning");
            try {
                Files.walkFileTree(Paths.get(root), this);
                System.out.println("complete");
            } catch (IOException ex) {
                Logger.getLogger(SlideZ.class.getName())
                    .log(Level.SEVERE, null, ex);
            } finally {
                complete = true;
            }
        }

        @Override
        public FileVisitResult preVisitDirectory(Path t, BasicFileAttributes bfa)
            throws IOException {
            return FileVisitResult.CONTINUE;
        }

        @Override
        public FileVisitResult visitFile(Path t, BasicFileAttributes bfa)
            throws IOException {
            try {
                Image image = new Image(t.toUri().toString(),
                    width, height, true, true, false);

                if (!image.isError()) {
                    images.put(image);
                }
            } catch (InterruptedException ex) {
                Logger.getLogger(SlideZ.class.getName())
                    .log(Level.SEVERE, null, ex);
                return FileVisitResult.TERMINATE;
            }
            return FileVisitResult.CONTINUE;
        }

        @Override
        public FileVisitResult visitFileFailed(Path t, IOException ioe)
            throws IOException {
            return FileVisitResult.CONTINUE;
        }

        @Override
        public FileVisitResult postVisitDirectory(Path t, IOException ioe)
            throws IOException {
            return FileVisitResult.CONTINUE;
        }
    }
}

The only 'tricky bit', if you could call it that, is the use of the Java 7 FileVistitor and a thread to load the files incrementally and asynchronously. Rather than try to make assumptions about the filename I just attempt to load every file I find and let the Image object tell me if it was valid or not.

But the JavaFX bit is pretty simple and straightforward, and i'm looking forward to playing with it further. It's given me a few ideas to try when I have some time and the weather isn't so nice as it is this weekend.

I'm not sure if the rendering pipeline is completely `GPU accelerated' yet on AMD hardware - the docs only mention that '3D' is only on NVidia so far.

If it isn't then the performance is ok enough - the CPU on this box is certainly capable of better mind you.

If it is, then it needs a bit more work. It can keep up ok with the simple fade and scale transitions i'm using at 580p, but adding a blur drops it right down, and trying to run it 1920x1200 results in a pretty slow frame-rate and lots of tearing.

Every JavaFX application also crashes with a hot-spot backtrace when they finish.

But the "main" problem with learning more JavaFX at this point for me is that it's going to make it more painful to maintain any Swing code I have, and it will make Android feel even more funky than it does already.

Update: So i've confirmed the GPU pipeline is not being used on my system. Bummer I guess, but at least the performance i'm getting is ok then.

If one sets -Dprism.verbose=true as a VM argument, it will print out which pipeline it uses.

Update 2: I found another option -Dprism.forceGPU=true which enables the the GPU pipeline on the AMD proprietary drivers I'm using. Oh, that is much better. Added a gaussian blur-in and ran it full-screen and it keeps up fine (and so it should!). There's a jira bug to enable it, so I presume it isn't too far off in the main release.

Update 3: I've done another one with a more sophisticated set of animated-tile transitions as JavaFC Slidershow 2.

Tagged code, hacking, javafx.
Friday, 31 August 2012, 07:51

JavaFX + webcam, Android n shit.

For Friday Follies I thought i'd have a look at setting up a test harness for mucking about with object detection and so on on a webcam outside of OpenCL - and for a change of pace I thought i'd try JavaFX at last - now it's included in release Java runtime for all platforms it's time I had a look.

Well, i'm not sure i'm a fan of the squashed round-cornered buttons - but it does seem nice so far. No more need to write an 'OverlayView' to draw boxes on the screen. After fighting with some arse-pain from Android for the last couple of weeks it's nice to go back to Java proper again too.

One of the big oversights in the earlier incarnations - being able to load in-memory pixmaps into the scene - has now been fixed, and the solution is simple and elegant to boot.

For example, I hooked up video 4 linux for java to display webcam output with only 1 line of code (outside the rest of the v4l4j stuff).

  writableImage.setPixels(0, 0, width, height,
      PixelFormat.getByteRgbInstance(), videoFrame.getBytes(),
      0, width);

And all I need to do is add the writableImage to the scene. (occasionally when i start the application the window doesn't refresh properly, so i'm probably missing something - but it's quite fine for testing). It would be nice to have some YUV formats, but that should be possible with the api design.

Of course, I could do something similar with Swing, but that involved poking around some internals and making some assumptions - not to mention breaking any possible acceleration - and I still had to invoke a refresh manually.

There's a new set of concurrency stuff to deal with, but at least they seem fairly straightforward.

Let Swing ... swing?

I actually quite liked Swing. It did have a lot of baggage from the awt days but it was a nice enough toolkit to write to, and almost every component was re-usable and/or sub-classable which saved a heap of development effort (now that i'm claiming it was perfect either). The first time I used it in anger i'd just come from a few years fighting with that microsoft shit (jesus, i can't even remember what it was called now, it had fx in the name/predated silverlight) and coming to a toolkit with decent documentation, fewer performance issues, fewer bugs, the ability to subclass rather than forcing you to compose, and the friggan source-code was a big relief. (Not to mention: no visual studio).

But from what little I know about it so far, JavaFX sounds pretty good and I don't think i'll be missing Swing much either. At least now they've made it a proper Java toolkit, and fixed some shortcomings. The main benefit I see is that it eschews all the nasty X and native toolkit shit and just goes straight to the close-to-hardware apis rather than having to try to hack them in after the fact (although I found Swing ok to use from a developer standpoint, I would not have enjoyed working on it's guts too much).

I'll definitely have to have more of a play with it, but so far it looks nice.

Android ...

Or rather, the shit lifecycle of an Android window ...

Until recently I just hadn't had time to investigate and implement orientation-aware applications - so all my prototypes so far were just working with a fixed orientation. Seems like an ok toolkit when used this way - sure it has some quirks but you get decent results for little effort.

But then I looked into supporting screen rotation ... oh hang on, that window object on which you were forced to write all your application code, suddenly goes away just because the user turned the phone?

Of course, the engineers at Google have come up with other solutions to work around this, but they all just add weird amounts of complication to an already complicated situation.

AsyncTask

So when I previously thought AsyncTask was a neat solution to a simple problem, I now realise it just makes things more complicated because of the lifecycle issue. Now there's some AsyncLoader mess which can be used to replace what you might've used an AsyncTask for - but with about 5x as much boiler-plate and lots of 'I don't really get what's going on here' crap to deal with as well. Most of that 'not knowing' goes away with time, but the learning curve is unnecessarily steep.

AsyncTask is simple because it's just a bit more around Runnable, saving you having to runOnGuiThread explicitly. AsyncLoader on the other hand is a whole 'lets pretend static resources are bad unless it's part of the toolkit' framework designed to separate your application from it's source code. i.e. it adds hidden magic. Hidden magic isn't abstraction, it's dumb.

You've been fragged!

Then take fragments. Go on ... please? So as if tying application logic to user windows wasn't enough, now one does the same for individual components of windows. A large part of why they exist just seems to be to be able to support orientation changes whilst preserving live objects - but it seems a pretty round-about way of doing it (they are also purportedly for supporting layout changes, but you can already do that ...).

But one of the biggests problem is the documentation. For the few times there is any, it is way out of date - when you find an official article explaining how to do something you soon find out that way was deprecated when you go to the class doco ... Obviously google are penny pinching on this one, you'd think with their resources they could do a better job to keep everything in sync - even Sun did a much better job and Oracle are probably better again.

To be honest, I think main issue I have personally with Fragments is the name itself - I kind of get what they're for, building activities from parts - but obviously all the meaningful names were taken (hell, even 'widget' means more). But the name 'fragment' just has no useful meaning in this context - fragments are what you have left over after you've blown shit up . Fragments are what a plate becomes after dropping one on a cement floor. Or what was left of the longneck I left too long in the freezer the other night (well, fragments and beery ice).

It is not a word of which things are constructed, it's just the mess left after some catastrophic mechanical failure.

(and the fact there are UI+logic, and logic-only 'fragments' surely does not help either)

The GoogleWay(tm)

Unfortunately, this all just forces you to code exactly the way google wants you to - which is fine when the problem suits, but it's a bit of a pain to learn just what that `way' actually is in the first place ...

In the end I just had to put everything in a database, and pass around id's to keep track of the state. Although it works there are still some complications. For example if you have a fragment which maintains its state during an orientation change, you then have to presumably code your activity to only load the state if it doesn't already have one, or fuck around with some other crap in the fragment to deal with it. I think my code just ends up reloading everything anyway, but at least it's on another thread and the user doesn't notice. If you set your object id's properly in your list adapter (now there's another 'lets mix different types of code' mess) then it mostly just works.

Once you've finally done all that ... then at least it does work out more or less, and trying to implement the 'proper' lifecycle stuff (whatever that is) becomes a whole lot easier too.

Tagged android, java, javafx, rants.
Wednesday, 29 August 2012, 00:37

Internode Streaming Radio App

So last night I threw together a simple application for accessing my ISP's streaming radio proxies from android. Now I have a box plugged into my stereo, suddenly I find a use for such a programme.

It can be found over on a page on my Internode account. I don't have time to up the source, but if i work on other additions (which personally I don't need), i will probably create a project somewhere for it.

You'd really think playing a remote playlist was a pretty basic function - but it took less time to write this (I just made some very small changes to the RandomMusicPlayer from the samples) than to try to find one in the 'play' store that bloody worked. It's not that they were just clumsy, ugly, really slow (for no apparent reason) - I didn't get a single beep out of them either.

I actually first mucked about with JJPlayer from jjmpeg and did get it working playing music from a selected .m3u file in a browser, but I decided that was a bit too hacky and clumsy. And besides if MediaPlayer worked with the internode streams, I would just use that. They did so that's what I went with. The RandomMusicPlayer sample I started with pretty much does all the hard work - it runs as a service, works in the background (even with the Mele 'off'!!!), and so on.

I put a link to it on my internode home page - which has been sitting idle for a few years.

Note: These streams are only available to Internode customers, a local ISP. There's no point looking at this if you're not one.

Also, I am not affiliated with Internode in any way.

But I am off to listen to some retro electronica ...

Update: I updated the app to be a bit more properer, and wrote another post about it.

Update: Moved the home page.

Tagged android, jjmpeg.
Saturday, 25 August 2012, 08:37

Object detector in action

Well I really wanted to see if the object detector I came up with actually works in practice, and whether all that NEONifying was worth it. Up until now i've just been looking at heat-maps from running the detector on a single still image.

So I hacked up an android demo (I did start on the beagleboard but decided it was too much work even for a cold wintry day), and after fixing some bugs and summation mathematics errors - must fix the other post, managed to get some nice results.

A bit under 10fps. And for comparison I hooked up some code to switch between my code or the android supplied one.

Well under 2fps.

I was just holding the "phone" up to a picture on my workstation monitor.

Some information on what's going on:

A good 6x performance improvement is ok by me using these conservative search parameters. For example if I simply change the scale step to 1.2x, it ends up 10x faster (with still usable results). Or if I increase the minimum search size to 3x17 instead of 2x17, execution time halves. Doing both results in over 25ps. And there's still an unused core sitting idle ...

As listed above - this represents the worst-case performance. Unlike viola & jones whose execution time is dynamic based on the average depth of cascade executed.

But other more dynamic algorithms could be used to trim the execution time further - possibly significantly - such as a coarser search step, or a smaller subset/smaller scale classifier used as a 'cascaded' pruning stage. Or they could be used to improve the results - e.g. more accurate centring.

Update: Out of curiosity I tried an 8x8 detector - this executes much faster, not only because it's doing under 25% of the work at the same scale, but more to the point because it runs at smaller image sizes for the same detected object scale. I created a pretty dumb classifier using scaled versions of the CBCL face set. It isn't quite reliable enough for a detector on it's own, but it does run a lot faster.

Tagged android, hacking.
Friday, 24 August 2012, 00:54

The second golden age of personal computing.

With interest and some excitement (despite it being Nokia Nikon and having some negative experiences with them), I just read that Nokia have announce an Android powered camera. Of course, it's all ostensibly about feacebook and twatter, but one can easily see a world of more practical use to an appliance on which you can install customised software. Let's just hope that a DSLR isn't too far behind - although camera vendors may fear having the shift-ful proprietary software they took years to develop improved upon and obsoleted by someone else.

It is rather ironic that whilst desktop computer operating systems are locking down, dumb-stupifying and generally turning general purpose machines into appliances, appliance makers are opening up, complicating, and turning appliances into general purpose computers.

And at the heart of it, it's all thanks to Free Software. 'Onya RMS.

And it's no doubt also thanks to the purest form of capitalist economics. It is simply cheaper to use a free operating system than it is to write your own or purchase a per-appliance non-free one.

First Golden Age

I consider the first Golden Age of personal computers to the time before Microsoft Windows and their illegally attained `wintel' monopoly.

So these developments are clearly also a result of breaking the back of the wintel monopoly and the USA based global technology plutocracy.

Who would have thought that a bit of competition would benefit we the little people?

Second Golden Age

So I feel we're entering another 'golden age' of computing: where anybody with the skills can improve upon or replace the usually shitty software that comes with any device - as every device is now powered by substantial software components this equates to a lot of computers. With any luck I will never have to purchase a locked-down throw-away appliance ever again ...

However, I fear as with all golden ages, this one will also not last and eventually they will try to pull a M$ on us ...

Ultimately however economics will force their hand, and any such tactics will prove too costly to maintain.

But what cost to society will such experiments first extract?

Update: Thanks Sankar, yes Nikon, too much Tomi on my brain that morning. And of course, Samsung have followed up with announcing their Android camera ... yum.

Tagged android, free software, gnu, politics.
Saturday, 18 August 2012, 04:46

Maths to the rescue

I had a bit of a think about the expression I am evaluating and I noticed I could use logarithms to perform the calculation instead of multiplies. And actually, as I don't use the value itself other than as a threshold, the summation result can be used directly anyway - it just wont give as sharp peaks.

First, changed the inner loop to perform a summation of two possible values, in 16 bits. This removed the need for 32-bit masks, and the floating point multiplies (not that they were holding anything up). This took 2 classifiers down from 26 instructions to 18, and a test-case from 100 to 70ms. It also freed 3 quad registers but I haven't yet investigated whether this will let me do more (actually it will let me take the loop down to 16 instructions as I have 2 immediate loads in there).

However, this morning I had a closer look at the summation I now have.

  r = sum m { testbit ? p : n }                        (1)

The bit testing code can't be further improved, but perhaps the select and sum can?

First, subtract n from the expression and move it outside of the select, and then separate the summation.

  r = sum m { (testbit ? (p - n) : 0) - n }
    = sum m { testbit ? (p - n) : 0 } - sum m { n }
    = sum m { testbit ? (p - n) : 0 } - mn             (2)

Which leaves with a slightly simpler internal summation and a constant. This expression can be implemented without a select instruction but since we have select in NEON it doesn't gain any instructions on it's own.

But this is where a neat trick of binary arithmetic comes in to play. Unlike C logic which uses 0 and 1 for false and true, the result of a bit test in SIMD code (and opencl c on vectors) is 0 for false, and ~0 for true. But in two's compliment signed binary, ~0 is also equal to -1.

So ... if the expression can be converted into a sum of -1's, then all one needs to do is directly add up the result of the bit tests, and avoid the select altogether. It is rather trivial to convert (2) into this form (if I have my summation identities on target anyway).

  r = sum m { (n-p) (testbit ? -1 : 0) } - mn
    = (n-p) sum m { testbit ? -1 : 0 } - mn
    = A sum m { testbit ? -1 : 0 } + B                 (3)

Where A and B are constants.

Actually there is another benefit here, as the summand is either 0 or -1 the range of the working sum is far more limited. Because of the size of the template I couldn't get away with 8 bit arithmetic without the danger of overflow - if I was using normal 8 bit two's complement arithmetic. But NEON to the rescue again, it has saturating instructions which will not overflow. It may still lead to a slightly incorrect answer - but i'm not interested in the absolute result but the relative one, and i think it should suffice.

I have yet to code this up, but if it works this will get the two-classifier case down to only 12 instructions - for 8 pixels. Less than one instruction per classifier per template, or about one when including the data loading as well.

And as this all now runs in double registers instead of quad registers, and I don't need n or p in the inner loop, it freeds up a whole swag of registers. These can be used to unroll more of the loop or share more of the data, for example by calculating more than 8 results per pass, more than one classifier at once, and so on. This might squeeze out a tad more juice out of it, but i'm not expecting much more here.

Update: So I coded it, it works. I'm calculating 32 locations at once, and there's still a good few registers free. Test case down from 70ms to 50ms.

Even better, I noticed the 'limited dual issue' capability of the NEON processor could be exploited in a couple of cases. Interleave the vtbl with some shifts: down to 43ms. Interleave the vqadd's with some vext's: down to under 41ms. Noice.

For comparison, an integer version in C (gcc, -O2) takes about 700ms.

Time for beer.

Tagged beagle, hacking.
Newer Posts | Older Posts
Copyright (C) 2019 Michael Zucchi, All Rights Reserved. Powered by gcc & me!