or How Do We Represent Video as an Input to Our Various Algorithms?
What is video anyway? To oversimplify, video is a series of images shown one after another. There are many display and storage formats; but, in essence, they are all just a set of images. How fast, how often the images are presented is measured in frames-per-second (fps). For example, television in usually shown at 30 fps. Each frame is one image or picture; so, television will show us 30 images* per second, one after another. You can think of video as time-series data.
Let's turn our attention to each frame (image) in the video. How many numbers do we need to represent an image? It depends on the size of the image. For illustration purposes, let's assume that our image is 160 pixels high and 120 pixels wide**. That means we have 19,200 pixels (160 x 120) to represent the image. If we have a color image/video, each pixel has color information -- what colors are present at that particular pixel of the image. Depending on how color was encoded, we could have 3 or 4 numbers to represent color. If we have a gray scale image, each pixel will only have the intensity information -- how dark or bright is that particular pixel. Thus, a gray scale image will need 19,200 numbers to represent it. If we treat video as time-series data, each data point will have 19,200 numbers associated with it. And that is exactly how the FSL recognition system we implemented treats video data. We can do this because all the images (frames) in a video has the same size.
* For TV, it's actually half-images per second. To keep the discussion simple, I'm ignoring that.
** It doesn't matter what storage format was used in the original video; at some point, it will have to be displayed on the screen, which has pixels.