How

“What the Robot Saw” originally launched in early 2020. After a five-year run, it was revised and relaunched in a somewhat more contemplative form in early 2025. Here are some of the techier details.

The back end is written in Python. It uses YouTube’s API to search for clips that focus on personal narratives. Unfortunately, but not surprisingly, the API makes it easy for developers to find popular videos, but difficult to find videos with few views or subscribers. So various “inverse keyword” strategies are used to find lists of likely unpopular clips; the lists are then filtered to find those with low view and subscriber counts. The Robot then uses several different traditional and neural net-based computer vision algorithms developed using OpenCV and TensorFlow to make clip selection (filtering of unsuitable clips) and editing decisions. (The automated editing is done with FFMPEG.)

An image classification algorithm based on a neural network pre-trained on ImageNet is currently used for organizing the clips into sequences, creating an evolution that moves through clusters of related content — as the robot “saw” it in its stream-of-(not) consciousness. Some of these clusters become scenes in the film, and are introduced by intertitles: the Robot’s attempt to poetically interpret the topic of the scene — a topic dictated by the peculiar priorities of its AI (Imagenet) training. Both Mediapipe and Haar Cascade based computer vision face and eye detection algorithms are used to generate the looping clips of people’s eyes which appear periodically throughout the stream. Images from clips that feature individual speakers (aka “talking heads”), are evaluated using Amazon Rekognition. The Robot uses the features available in Rekognition’s face analysis software, resulting in the Robot’s captioning interviewees in its documentary with titles like, “Confused-Looking Female, age 22-34”. Since features like emotion, age, and gender are common in currently popular face analysis frameworks, the captions offer a glimpse into common ways robots, marketers, and others perceive us. (In cases where Rekognition’s predictions give a lukewarm probability value, the Robot will add a qualifier — e.g., “Ostensibly Confused-Looking Female.”)

Neural net-informed sound design currently uses inaSpeechSegmenter to analyze videos for music and speech content, which the Robot then processes accordingly.

The front end (graphics) and sound mix are generated using Max/MSP/Jitter. The graphics “engine” is based on visual performance software developed by Amy Alexander and Curt Miller for PIGS (Percussive Image Gestural System) performances. So, hand drawn and percussion-triggered gestures, performed and “recorded” (like a player piano roll) for the film, are at the root of the animations and some of the timing. On top of these basic animated gestures are layered numerous algorithms dealing with other aspects of the animation timing, as well as image processing, sound, etc. There are also additional, non-animated, video layers that fade in and out. (The eyes and the sporadically appearing static background videos.) The animation algorithms respond to various aspects of sound and in some cases visual content. These decisions are based largely on the analyses of each clip performed by image and sound analysis described above. In other words, Max uses the results of all the Python-based computer vision and machine learning analysis to determine the sound mix and visual composition.