How

“What the Robot Saw” is in its initial implementation. The research aspects are still evolving, especially in regards to how WTRS uses neural nets cinematically. For those of you with a geeky curiosity and somewhat lengthy attention span, here’s a brief (cough) description of how the system currently works:

The back end is written in Python. It uses YouTube’s API to search for clips that focus on personal narratives. Unfortunately, but not surprisingly, the API makes it easy for developers to find popular videos, but difficult to find videos with few views or subscribers. So various “inverse keyword” strategies are used to find lists of likely unpopular clips; the lists are then filtered to find those with low view and subscriber counts. The Robot then uses a few different traditional and neural net-based computer vision algorithms developed using OpenCV and TensorFlow to make clip selection (filtering of unsuitable clips) and editing decisions. (The automated editing is done with FFMPEG.)

An image classification algorithm based on a neural network pre-trained on ImageNet is currently used for organizing the clips into sequences, creating an evolution that moves through clusters of related content — as the robot “saw” it in its stream-of-(not) consciousness. Some of these clusters become scenes in the film, and are introduced by intertitles: the Robot’s attempt to poetically interpret the topic of the scene — a topic dictated by the peculiar priorities of its AI (Imagenet) training. Haar Cascade based computer vision face and eye detection algorithms are used to generate the looping clips of people’s eyes which appear periodically throughout the stream. Images from clips that feature individual speakers (aka “talking heads”), are evaluated using Amazon Rekognition. The Robot uses the features available in Rekognition’s face analysis software, resulting in the Robot’s captioning interviewees in its documentary with titles like, “Confused-Looking Female, age 22-34”. Since features like emotion, age, and gender are common in currently popular face analysis frameworks, the captions offer a glimpse into common ways robots, marketers, and others perceive us. (In cases where Rekognition’s predictions give a lukewarm probability value, the Robot will add a qualifier — e.g., “Ostensibly Confused-Looking Female.”)

Neural net-informed sound design currently uses inaSpeechSegmenter to analyze videos for music and speech content, which the Robot then processes accordingly (Amy Alexander has also developed a publicly-available minor revision of inaSpeechSegmenter, inaSpeechSegmenterAJA, which allows for faster processing in production environments by providing options to analyze partial files and skip speaker gender analysis.)

The front end (graphics) and sound mix are generated using Max/MSP/Jitter. The graphics “engine” is based on visual performance software developed by Amy Alexander and Curt Miller for PIGS (Percussive Image Gestural System) performances. Thus hand drawn and percussion-triggered gestures, performed and “recorded” (like a player piano roll) for the film, are at the root of the animations. On top of these basic animated gestures are layered numerous algorithms dealing with animation timing, image processing, sound, etc., as well as additional (non-animated) video layers. The animation algorithms respond to various aspects of sound and in some cases visual content.