In the US, first-run broadcasts of prime-time shows are watermarked with the broadcaster's logo in the lower-right corner. Handily enough, this logo reliably disappears during commercial breaks. This suggests a system to automatically identify such logos in the video stream.

Edge Detection

[logo]
[logo]

We start with the Canny edge detection algorithm:

  • Convert image to grayscale.
  • Smooth the image to get rid of noise (e.g., slightly blur everything).
  • Compute edge gradient values for each pixel. Intuitively, this is simply the contrast between a pixel and its neighbors (the difference in grey values). Edges will contrast with nearby pixels (high edge gradient value); non-edges (such as uniformly-colored regions) will not show much contrast (low edge gradient value).
  • Pick some threshold edge gradient value; declare all pixels with edge gradient values exceeding this threshold to be an "edge".

Here are some links to pages with nice pictures of the Canny algorithm in action, varied across a range of values for different parameters:

In the sample images to the right, a 97th-percentile threshold was chosen to select for only the most intense edges (this is a much higher value than is conventionally used, because we are only interested in "strong" edges). Note how the edges between the actors and the snow are more intense than the edges between the logo and its background.

[logo]
[logo]

When processing a whole image for edges, the Canny algorithm can "fail" in the context of television logo detection, even if a logo does indeed exist. The algorithm operates on a notion of a threshold, which means that all edges' intensity values are sorted against each other, and only the "best" ones kept. This means that many intense scene edges can take precedence over weaker logo edges (especially faintly transparent logos), causing the logos to be invisible to the Canny algorithm.

In the example images to the left, the text has used up all of the image's "edge quota", leaving none above the requisite threshold for the logo's edges (see bottom frame). Simply relaxing the threshold edge gradient value won't necessarily help; the result is just as likely to include more text edges as it is to include more logo edges.

[logo]

The obvious optimization (and correction) strategy is to restrict edge-detection processing to the areas of the screen believed to contain the logo (in this case, the lower right corner); fewer pixels need be processed for edge detection, and hopefully fewer scene edges will interfere with detection of logo edges. In the image to the right, the grayscale image has been overlaid with the results of edge-detection on only the indicated corner. Only some of the scene edges are marked, leaving the logo edges to be marked more intensely than before.

The sensitivity of the Canny edge detection algorithm to "false" (non-logo) edges implies a problem of correctly pruning the search space in a general way. Different broadcast networks utilize logos of different sizes and locations. Furthermore, even the high-level location can differ. Logos in US broadcasts tend to appear in the lower right (and sometimes lower left), but logos in UK broadcasts tend to appear in the upper corners.

Logo Area Identification

[logo]
[logo]
[logo]

Correctly identifying the search area is essentially the same as identifying the logo. One strategy for doing so exploits television logos' tendency to always appear in the same location throughout a broadcast. If the logo remains stationary, its edges will also remain stationary (see right).

Real-world problems begin to intrude here. One common phenomenon is pillarboxing or letterboxing, where black or grey bars appear on the screen to present some content in an aspect ratio that differs from that of the screen. For example, widescreen content is often letterboxed onto a "standard" screen; conversely, "standard" content is often pillarboxed when shown on a "wide" screen. These black or grey bars will produce edges for the Canny edge detector, which will in turn hinder efforts to identify logos based on their edges.

Prior to Canny edge detection, any such letterboxing or pillarboxing borders must first be identified. This is accomplished in much the same way that a human might infer the presence of borders: the image is scanned from top to bottom, bottom to top, left to right, and right to left, looking for the first "non-matching" (non-black, non-grey, etc.) pixel and using that location as a border for that edge.

We can construct a set of counters where each counter corresponds to a single pixel of the frame; each counter is initialized to zero. Immediately after running the Canny edge-detection algorithm on an image, each "edge" pixel is awarded one point. After some number of images, the pixels corresponding to statically-positioned logo edges should have consistently higher scores than other non-logo pixels. The result of this computation can itself be represented as an "edge-count" image (with scores scaled to greyscale values between black and white) where bright pixels represent high scores and dim pixels represent low scores.

The top image illustrates the sensitivity of the Canny edge detection algorithm to pillarboxing; note the vertical line through the logo. The second image shows the full edge-count image with borders accounted for, highlighting the logo itself and its location in the whole frame; note the lack of pillarboxing edges.

Finally, the edge-count image is thresholded to eliminate noise, and a minimal bounding box computed. This bounding box represents the search area for finding the channel logo in each frame. Given such exact size and position information, one can see how this kind of logo identification can be made robust against the text-heavy images illustrated earlier; the text will likely have been excluded from the search area, so its edges will not be considered.

Logo Template Matching

The logo-detection strategy of commercial detection is implemented in two phases.

Logo Identification

The first phase is a partial scan that only covers some initial number of frames of the entire stream. It is meant to simply collect enough data to identify the logo and generate a good template for the next phase; it does not need to sample every consecutive frame. In fact, because of variations in broadcast schedule (it is hard to tell exactly where in the stream represents the beginning of the broadcast), it is better to skip frames during this sampling phase (e.g., every 10 seconds' or so worth of frames), to be sure to collect a decent weighting of logo edges and avoid bad luck from picking a sequence of logo-less commercial frames.

This first phase is meant to yield the logo and its location. In the first phase, the selected frames are processed for border identification, edge detection of the in-border area, and then edge counting. At the very end, the region of interest (logo position and size) is determined.

Logo Matching

The second phase is a complete scan: every frame is processed for edge detection of the region of interest and closeness of the region's edges to the template edges. If a sufficient number of pixels match, the frame can be declared as containing a logo (e.g., a non-commercial). This second phase yields a sequence of values representing the "commercial-ness" of each frame. Heuristics are then applied to filter out the "noise". For example, extremely short "commercial breaks" can be ignored, presumed to be failures of the Canny edge-detection or template-matching steps. Any breaks longer than some threshold amount of time can be assumed to be "real".

There is a problem in deciding on a "sufficient" number of matching pixels. If the number of edge pixels matching the template is plotted against their frequency, a bimodal distribution is often (but not always) observed. Some example templates for some sample recordings, and their frequency histograms follow. Unless otherwise noted, all signals are from over-the-air broadcast HDTV (a clean digital signal), with the content transcoded down to resolutions varying between 640x480 to 856x480. Results for uncompressed HDTV are expected to follow similar trends, with only differences in the magnitude of absolute numbers of pixels.

First is a sample graph illustrating the number of matching template edge pixels over time. It so happens that the frames having a close match (the peaks) correspond to content, while the frames with a low match (the valleys) correspond to commercial breaks:

[graph]

The remaining problem is then to accurately determine this threshold value separating the matching frames from the non-matching frames, across a wide variety of content. This is accomplished by plotting histograms of "match counts" with their frequency of occurrence.

First are some very simple example templates, which have very simple edge-matching histograms. Note the high spike at the far right of the graphs, and the other at the far left, representing the fact that each frame either obviously matches or doesn't match the template:

[logo] [histogram] [a]
[logo] [histogram] [b]
[logo] [histogram] [c]

Next are some more complex templates. There is still a spike to the right of the graph, but there is a smoother decrease for non-matches, reflecting an increased amount of ambiguity. The fraction of frames "definitely" matching the template are much lower than in graphs [a]-[c]. Finally, a second lower mode starts to appear, representing frames that have a decent number of coincidentally-appearing edges in the same place as the candidate logo:

[logo] [histogram] [d]
[logo] [histogram] [e]

Following are some even more complex matches. The drop-off from right to left is even shallower, the absolute maximum value in these graphs is even lower than those of the preceding graphs [a]-[e], and the second mode is even more pronounced:

[logo] [histogram] [f]
[logo] [histogram] [g]

The histogram for [g] is interesting because while all the other broadcast stations have principally bi-modal distributions, the station represented by sample [g] consistently broadcasts content with a tri-modal distribution of frames matching the logo.

Next are some analog recordings. The histogram is still bi-modal, but the distribution is more uniform (note that the Y-axes have a much smaller range of values than in the above graphs); and the ambiguity even more pronounced (as measured by the smaller maximum fraction of frames matching the template, and by the very pronounced second mode):

[logo] [histogram] [h]
[logo] [histogram] [i]

Given the evidence of the tri-modal distribution (sample [g], above), it follows that the value at the end of the first mode is the best threshold value to use for separating matching frames from non-matching frames.

Successes

Logo detection works very well on its own in some cases (first-run episodes on major US broadcast networks):
  • NBC "Law & Order" series
  • CBS "CSI: Crime Scene Investigation" series
  • FOX "24"

Failures

The logo-detection strategy described for detecting broadcast commercials is expected to fail under any of the following conditions:

  • No broadcast logo present ("24" reruns).
  • Broadcast logo changes positions during video stream. This could be addressed by searching the entire frame for the logo template, rather than the fixed area, but will result in a large increase in time. It could also introduce more false positives. For example, television stations often cross-promote their other television shows during commercial breaks, and their broadcast logo will appear in a different area of the screen than during the main broadcast.
  • Broadcast logo varies during video stream. For example, many broadcasts return from a commercial break with one version of the logo with some kind of 30-second "ticker-tape" message, usually with some smaller advertisement, after which the "normal" logo is displayed. These logos are obviously similar to the human eye, but differ enough to confuse the logo detector (well, my logo detector).
  • Pathological threshold values. A very faint or extremely transparent logo will not have easily discernible edges. In the first logo-identification phase, the Canny algorithm would be likely to pass over the logo's edges in favor of edges in the main content.
  • Broadcast logo does not perfectly correspond to commercials and non-commercials:
    • Some television stations show the logo during commercials (syndicated content).
    • Some stations remove the logo during portions of the broadcast (WB cartoons, NBC shows).
    • Some shows display or remove the logo a few seconds before or after the commercial break starts or ends ("Bones" on FOX).

Identifying and handling the above failure conditions is a separate problem and outside the scope of simple logo detection.

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon] Last updated: Wed Mar 11 21:57:16 PDT 2009