klf-cv

motion detection via h.264 compressed-domain analysis

completed · started 2019 · rebuilt december 2025

problem

The Konrad Lorenz Research Station (University of Vienna) has multiple terabytes of video data — continuous wildlife footage from fixed cameras monitoring ravens. 1TB translates to roughly 85 days of non-stop video.

Researchers manually watch and annotate behavioral events. Most footage contains nothing — punctuated by brief moments of activity. Missing short events is not an option, making this both time-consuming and demanding high concentration.

sample footage from the station


goal

Automatically detect activity segments.


first attempt (2019): gpu frame differencing

Stack: Docker, OpenCV, NumPy, CuPy (CUDA), PyNvVideoCodec (NVDEC hardware decoding).

Decode every frame, compute pixel-by-pixel differences between consecutive frames. Areas with large differences indicate motion.

frame differencing output — bright areas show motion

This works, but requires significant compute. I pursued several optimizations.

optimization 1: downsampling

Reduced resolution from 1440×1080 to 360×270, converted RGB to grayscale.

Original 1440×1080 rgb
Resized 360×270 rgb
Grayscale 360×270 gray
stage per frame 41,250 frames
1440×1080 RGB 4.67 MB 192 GB
360×270 RGB 292 KB 12 GB
360×270 Gray (float16) 194 KB 8 GB

optimization 2: frame batching

Split the 27:30 video into 4 equal segments (~7 min each). Sample one frame from each segment's current position, tile all 4 into a single GPU buffer. This allows processing 4 frames simultaneously — one from each video segment in parallel.

Video split into 4 segments (processed in parallel):

Segment 1
Segment 2
Segment 3
Segment 4
0:00 6:52 13:45 20:37 27:30
fseg1(t)fseg2(t)fseg3(t)fseg4(t) GPU(f₁, f₂, f₃, f₄)
tile 4 frames into single buffer → process as one batch → 4× throughput

GPU buffer → Parallel difference computation:

seg1
t
seg2
t
seg3
t
seg4
t
frame t
seg1
t+1
seg2
t+1
seg3
t+1
seg4
t+1
frame t+1
seg1
diff
seg2
diff
seg3
diff
seg4
diff
4× parallel diff

optimization 3: frame skipping

Process 2fps instead of 25fps. Tradeoff: sub-500ms events get missed. I suspect those are hard to catch manually too.

processed skipped 25fps → 2fps (12× fewer)

subject detection pipeline

After identifying motion regions, detect actual subjects (birds) to filter false positives from wind, shadows, lighting changes.

Step 1: Sparse sampling — pick one peak frame per activity region

0:00 0:18 0:50 1:20

Motion curve shows activity over time. Green vertical lines mark selected frames at peak activity in each region. Only 2 frames selected from 80 seconds of video.

Step 2 & 3: Segment selected frames, classify segments

3 frames
Meta SAM
(segment)
~12 segments
ResNet-50
(classify)
2 birds

Meta SAM segments each frame into regions. ResNet-50 (pretrained on ImageNet) classifies each segment. Only segments classified as "bird" with high confidence are kept.


new approach (2025): h.264 already computed this

Stack: Docker, FastAPI, PostgreSQL, React, ffmpeg, mvextractor, PyAV.

Debugging the pipeline above led to out-of-memory errors. Uncompressed frames exhausted RAM + swap. The 256MB video file expanded to gigabytes in memory. I knew H.264 used motion vectors but hadn't considered extracting them directly.

H.264 doesn't store frames as RGB matrices. It uses two main concepts to reduce storage — one of which I use to solve the problem:

Decoding starts at an I-frame, then each P-frame applies its motion vectors and residuals to reconstruct the next.

I
P
P
P
P
P
I
P
P
P
P
P
I-frame (~50KB) — full image P-frame (~10KB) — motion from previous

motion vectors (temporal compression)

Instead of storing pixels, P-frames store how 16×16 blocks moved from the previous frame. A vector like (3, -2) means "copy this block from 3 pixels right, 2 pixels up." The encoder searches for the best matching block and stores only the displacement.

This is why motion detection data is already embedded in the file — we just need to extract it.

discrete cosine transform (spatial compression)

I-frames and P-frame residuals are compressed using the Discrete Cosine Transform. Each 8×8 block is represented as a weighted sum of 64 basis patterns. High-frequency coefficients are typically near zero and can be discarded (run-length encoding).

64 basis patterns
(fixed, same for all blocks)

128
45
23
12
8
3
1
0
42
28
15
7
4
2
1
0
18
11
6
3
1
0
0
0
9
5
2
1
0
0
0
0
4
2
1
0
0
0
0
0
2
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

coefficients (weights)
64 values

128
45
23
12
8
3
0
0
42
28
15
7
4
0
0
0
18
11
6
0
0
0
0
0
9
5
0
0
0
0
0
0
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0

after quantization
~18 non-zero (stored)

Reconstructing an 8×8 block of the letter "A" from DCT coefficients.
left: result so far
center: weighted pattern being added
right: basis pattern × coefficient

IDCT reconstruction

what's actually stored in the file

I extracted the motion vectors and residuals from the compressed stream and rendered them as videos for visualization:

motion vectors

residuals

advantages


result

The target machine has a 4-core CPU and no GPU. Frame differencing would run slower than realtime — processing 1TB would take months.

frame differencing MV extraction
dev machine (16-core, GPU) 17× realtime (5 days/TB) 168× realtime (12 hours/TB)
target machine (4-core, no GPU) 0.7× realtime (4 months/TB) 52× realtime (1.5 days/TB)
[demo placeholder]

context

Built in cooperation with the University of Vienna. The Konrad Lorenz Research Station studies corvid behavior — specifically ravens and their social dynamics.

Behavioral science is outside my background. The signal processing problem was interesting, and it felt good to build something that might support actual research.