“What the Robot Saw” is in its initial implementation. The research aspects are still evolving, especially in regards to how WTRS uses neural nets cinematically. Within the current implementation of WTRS, I hope be able to refine the neural net-informed clip editing, grouping, and sound design as time and funding for upgraded server hardware permit. So look for more about this in the future. But here’s a brief description of how the system currently works:
The back end is written in Python. It uses YouTube’s API to filter clips based on low view and subscriber counts. It then uses a few different classical and neural net-based computer vision algorithms developed using OpenCV and TensorFlow to make clip selection (filtering of unsuitable clips) and editing decisions. (The automated editing is done with FFMPEG.)
An image classification algorithm based on a neural network pre-trained on ImageNet is currently used for organizing the clips into sequences, creating a stream of consciousness evolution that moves through clusters of related content — as the robot “saw” it — and as the human director (Amy Alexander) scripted the evolution. Some of these clusters become scenes in the film, and are introduced by intertitles: the Robot’s attempt to poetically interpret the topic of the scene — a topic dictated by the peculiar priorities of its AI (Imagenet) training. Haar Cascade based computer vision face and eye detection algorithms are used to generate the looping clips that focus on people’s eyes and mouths which appear periodically (usually in clusters), throughout the stream.
Neural net-informed sound design currently uses inaSpeechSegmenter to analyze videos for music and speech content, which the Robot then processes accordingly (Amy Alexander has also developed a publicly-available minor revision of inaSpeechSegmenter, inaSpeechSegmenterAJA, which allows for faster processing in production environments by providing options to analyze partial files and skip speaker gender detection.)
The front end (graphics) and sound mix are generated using Max/MSP/Jitter. The graphics “engine” is based on visual performance software developed by Amy Alexander and Curt Miller for PIGS (Percussive Image Gestural System) performances. Thus hand drawn and percussion-triggered gestures, performed and “recorded” (like a player piano roll) for the film, are at the root of the animations. On top of these basic animated gestures are layered numerous algorithms dealing with animation timing, image processing, sound, etc., as well as additional (non-animated) video layers. The animation algorithms respond to various aspects of sound and in some cases visual content.