Poster, presented at IWAR'99, San Francisco, October 20-21, 1999

A real-time panorama-based technique for annotation overlay on video frames

Masakatsu Kourogi, Takeshi Kurata, Katuhiko Sakaue, Yoichi Muraoka


In this paper, we will propose a panorama-based annotation method which uses a panoramic image for source of information about the positions of the annotations on video frames. The proposed method estimates affine parameters of image registration between an input frame and the panoramic image, and then maps the positions of annotations from the panoramic image to the input frame and display the input frame overlaid with those annotations. In order to allow the position of the camera to move, a set of panoramic images are prepared. We select the panoramic image that gives the smallest mean squared error of the image registration. The selected panoramic image will be appropriately switched while the camera moves around. Experimental results show that this method can, with low-cost PCs, finds image registration parameters and display input frames overlaid with the annotations in near real-time.

  1. Overview of proposed method
  2. In advance, a set of panoramic images with annotative information are prepared.
    First, the proposed method searches for the panoramic image that includes an input frame and finds affine transformation parameters for image registration between the frame and the panoramic image. In order to estimate the parameters, we use a fast and robust method [1] which is gradient-based and uses M-estimator for robust estimation. This method requires the initial estimate to be relatively close to the parameters. Therefore we apply the method as follows.
    1. We give multiple initial estimates so that the affine parameters to be estimated are sufficiently close to at least one of the given initial estimates. By taking the parameters that gives the smallest MSE (mean squared error), we can estimate the affine parameters for image registration between a frame and each panoramic image.
    2. We select the combination of the panoramic image and the affine parameters that give the smallest MSE.
    Once the panoramic image and the affine parameters are selected, subsequent estimation finds the affine parameters between the next input frames and the panoramic image by using the previous result of affine parameters as an initial estimate.
    Second, the method maps the positions of annotations from the panoramic image to the input frame, and displays the input frames overlaid with those annotations.
    Our approach using panoramic images has two advantages over others as shown below.
    1. A panoramic image provides strong and robust clues for location and orientation of video frames, since the clues do not depend on local features such as textures or corners.
    2. It is easy for a contents maker, compared to the case of sensor-based approaches, to place annotations on a panoramic image since objects to be annotated are visible in the image.

  3. Experimental Results

  4. The proposed method is implemented by a software on a PC cluster consisting of 4 PCs (2 DualPentiumII-450MHz and 2 Dual PentiumIII-500MHz, OS: Linux-2.2.9 SMP supported) connected with 100M ethernet LAN. We use a head-set type of head-mounted display (HMD) and a small color CCD camera attached on the HMD as shown left. Created video frames with annotations are shown in Figure 2. ([movie]) A set of panoramic images used are shown in Figure 3.

    Figure 2. Created video frames with annotations.

    Figure 3. A set of panoramic images.

    The method could robustly establish image registration between video frames and the panoramic images even if objects not present in the panoramic images or moving objects were present in the frames. The processing time required for searching for the panoramic image that includes the input frame and image registration is 30 x N x P [msec] where N is the number of initial estimates, and P is the number of panoramic images. In this experiment, N=20 and P=4. The throughput and delay of the processing required are 100-120 [msec] (or 8-10 [frame/s]) and 600 [msec], respectively.

  5. References
    1. M. Kourogi, T. Kurata, J. Hoshino and Y. Muraoka,``Real-time image mosaicing from a video sequence'', in Proc. of ICIP'99, 1999. (To appear)
    2. S. Feiner, B. MacIntyre and T. Hollerer, ``Wearing It Out: First Steps Toward Mobile Augmented Reality Systems'', in Mixed Reality -- Merging Real and Virtual Worlds, pp. 363-377, Ohmusha-Springer Verlag, 1999.

This material is posted here with permission of the IEEE. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by sending a blank email message to By choosing to view this document, you agree to all provisions of the copyright laws protecting it.