I don't have the V4, but I have the V3 and I've tinkered with its firmware so I think I can answer your questions:
a) a series of still-images (sorta like frames but spaced out by more time or the same amount of time)
No, the data returned with the standard firmware is a collection of coordinates of up to 8 tracked objects and their matched colour range. So for example, if you have light red to dark red (in RGB values) set as colour range 1, it will do pixel based matching on those colours and if they match, it will try to find the smallest enclosing square that matches those pixels. It can track up to 8 objects at any time and up to 8 colour ranges
b) a video processed in real-time (as it is recorded)
It is live, at about 30fps.
c) a video processed after the fact (like a video file that gets processed a few seconds later?
No, it doesn't store anything, the data is pulled from the camera chip and pulled into the ATMega for processing, it does process the info line by line.
d) Is there sound? if there is, how is that processed?
No sound.
There is a way to make the camera return actual pixel data but that's a slow process because you are limited by how quickly the data can be pulled by the NXT over the I2C connection.
So the main question is, what do you intend to do with it?
- Xander