Technical Report No. VSR-0X.YY

 

Annual Technical Report

For the Project

Digital Human Modeling and Virtual Reality for FCS

 

 

By

Ajay Gopinath, M.S. Candidate

Electrical and Computer Engineering

Clarkson University, Potsdam, NY 13699-5720

 

Dr. James J. Carroll, Associate Professor

Electrical and Computer Engineering

Clarkson University, Potsdam, NY 13699-5720

 

 

 

 

CONTRACT/PR NO. DAAE07-03-D-L003/0001

 

 

 

 

 

 

 

Abstract

 

 

This document describes the work performed within the Digital Human Design thrust of the Virtual Soldier Research (VSR) Program. The objective is to develop a system to obtain real-time tracking data of a subject from multiple synchronized Infrared video streams and use this data to drive a gesture-based interaction with an avatar.

An inexpensive consumer off-the-shelf (COTS) infrared system is used for automatic acquisition of human body model and motion tracking. Acquired video frames are segmented through background subtraction. 3D voxel representations of the human body shape in each frame are computed in real-time from foreground silhouettes. The infrared cameras used are all calibrated, which means that their parameters like rotation and translation matrices, the distortion model, focal length, image plane center, etc are all precompiled. 

Voxels are created in the tracking area and the projection of each voxel on a given camera plane is precompiled and stored in a lookup table. At run time each voxel is classified as either inside or outside by performing intersection tests with corresponding foreground silhouettes. The voxels classified as outside are discarded leaving behind a carved body model of the subject present in the tracking region.

Model acquisition starts with a simple body part localization procedure based on template fitting, which uses prior knowledge of average body part shapes and dimensions. The human body model consists of ellipsoids and cylinders and is described using a twist based framework Based on the inverse kinematic equations for joint angles, the centroid positions of each ellipsoid is calculated using data from measurement points. A robust voxel labeling procedure is used to obtain quality measurements.

 

 

 

Chapter 1

 

1.1 Introduction

 

Human body modeling or posture estimation is the problem of constructing an accurate body model representation of a subject from video data and track the body, i.e. calculate the 3D coordinates of the body model as the subject moves in the tracking region. A reliable motion capture system would be valuable in many applications. One class of applications are those where extracted body model parameters are used directly to interact with a virtual world, drive an animated avatar in a computer graphics character animation. This can be used for applications ranging from soldier training to physiotherapy applications. Another class of applications uses extracted parameters to classify and recognize people, gestures or motions such as surveillance systems, intelligent environments, and content based indexing of sports video footage or advanced user interfaces for sign language translation, gesture driven control, gait or posture recognition [1-7]. Motion parameters can be used for motion analysis in applications such as personalized sports training, choreography, or clinical studies of orthopedic patients.

In the past decade, efforts have been put in developing methods for reconstructing 3D models of human actions or tracking human/hand movements from multiple cameras [8-14]. These methods can be divided into two categories. The first category focuses on detailed model reconstruction or segmentation of humans in 3D. Despite the fact that image acquisition is real time, the reconstruction or segmentation in this category is done offline. The second category involves real-time 3D human motion tracking without explicit or detailed 3D human body reconstruction. Almost all the methods in this category assume a generic 3D human model and the tracking/model fitting is done by comparing the camera projections of the generic model with the silhouettes, edges or the optical flow of the camera images. Though these methods work well in general, very few have tried to reconstruct humans in real-time directly in the 3D domain. In view of this, we have built a system that can achieve real-time 3D reconstruction of human body and movements. We use 3D voxel reconstructions of the human body shape at each frame as input to the model acquisition and tracking. This approach leads to simple and robust algorithms that take advantage of the unique qualities of voxel data. The price is an additional preprocessing step where the 3D voxel reconstructions are computed from the image data.

Infrared (IR) cameras are used in our system because tracking is performed on a subject inside a Virtual Reality (VR) cube. Inside the cube, background images are constantly changing depending on the immersive environment used. Hence a stable background model cannot be created for silhouette extraction. With IR cameras, the images on the VR cube are not seen since the only objects visible to the cameras are the ones that reflect IR light from the IR sources placed in the cube (Appendix B shows the lab setup). This results in a stable background environment on which robust silhouette (foreground) extraction can be performed.

We first calibrate the IR cameras and store the parameters obtained from calibration. Chapter 2 describes the camera calibration techniques used to calibrate the cameras. Appendix A has a detailed description of the procedure and the results of calibration. The internal parameters of the camera do not change unless the focal length is changed or the lens filter is changed. The external parameters change if the camera’s position or orientation is changed. Once the cameras are calibrated, the images that are grabbed from these cameras are segmented into background and foreground pixels. Chapter 3 describes the silhouette extraction techniques used along with the thresholding algorithm for image segmentation. Using the silhouettes and the corresponding external parameters of each camera, voxel carving is performed. Chapter 4 describes the methods used to perform voxel carving. 3D templates are fitted onto the voxel body model, chapter 5 describes the procedures used to perform template matching. Once the templates are fit on the voxel model, a twist based human body model is initialized and this body model is tracked based on 23 measurement points. This is listed in Chapter 6. Appendix A has the procedure for calibrating cameras and the calibrating results obtained. Appendix B the laboratory setup that and lists the hardware and software used. Appendix C has the anthropometric data used to initialize template fitting.

 

1.2 Related Work                                              

 

            Currently available commercial systems for motion capture require the subject to wear special markers, body suits or gloves. In the past few years, the problem of marker-less, unconstrained posture estimation using only cameras has received much attention from computer vision researchers.

            Many existing posture estimation systems require manual initialization of the model and then perform tracking. Very few approaches exist where the model is acquired automatically. Algorithms have been developed that take input from one or multiple cameras.[8,11]. A system for tracking a 3D articulated model of a hand, based on a layered template representation of self occlusion has been developed by Rehg and Kanade[15] .

            Kakadiaris and Metaxas [11] have developed a system for 3D human body model acquisition and tracking using three cameras placed in a mutually orthogonal configuration. The person under observation is requested to perform a set of movements according to a protocol that incrementally reveals the structure of the human body. Once the model has been acquired, the tracking is performed using a physics based frame work.

            Cheung and Kanade [8] use an algorithm called Sparse Pixel Occupancy Test (SPOT) for performing silhouette voxel intersection tests. Instead of testing all the pixels included in a voxel projection on the image plane, a uniformly distributed set of pixels within the bounding box of the voxel projection are chosen and if at least a fraction of these chosen pixels are silhouette pixels, then the voxel is classified as belonging to the foreground object. One big advantage of SPOT is that it is faster since not all pixels are tested. But the key question of using SPOT is to choose the right number of pixels to test. The total number of testing should be small while maintaining a low probability of misclassification of voxels. To answer this question, they perform two tests. The first one is False Acceptance (FA) which means that an outside voxel is falsely classified as an inside voxel. The second one is False Rejection (FR) which means that an inside voxel is falsely classified as an outside voxel. Graphs of FA and FR against the number of pixels tested is plotted and based on these graphs an optimum value for the number of pixels to be tested is chosen.

Cheung and Kanade use a step for internal voxel removal where after the voxel based 3D reconstruction is performed, each voxel is tested and only the surface voxels are extracted removing all internal voxels. This is useful in many applications where we may not be interested in the whole volume of the reconstructed data. Much CPU time can be saved by displaying only the surface voxels.

In the algorithms that work with the data in the image planes, the 3D body model is repeatedly projected onto the image planes to be compared against the extracted image features. A simplified camera model is often used to enable efficient model projection [16]. Another problem in working with the image plane data is that different body parts appear in different sizes and may be occluded depending on the relative position of the body to the camera and on the body pose.

When using 3D voxel reconstructions as input to the motion capture system, very detailed camera models can be used, since the computations that use the camera model can be done off-line and stored in lookup tables. The voxel data is in the same 3D space as the body model; therefore, there is no need for repeated projections of the model to the image space. Also, since the voxel reconstruction is of the same dimensions as the subject, the sizes of different body parts are stable and do not depend on the subject’s position and pose.

Szeliski et al. [17] use a different approach to create 3D models, in their scenario, an object rotates on a turntable in front of a static camera and an octree model is generated from multiple frames. Chien and Aggarwal [18] constructed an octree from three orthographic projections, an octree of the conic volume formed by the silhouette and the center of projection is computed for each viewpoint and the octrees from all of the viewpoints are then intersected.  Szeliski builds a coarse 3D octree description first and then a sophisticated shape-from-motion algorithm that relies on the detailed analysis of optic flow is used to refine the 3D model. Their 3-D projection operations, computation of the distance transform and bounding box intersection tests are easily parallelizable and can thus take advantage of modern massively parallel architectures.

Seitz and Dyer [19] have proposed an algorithm which computes a set of occupied voxels that is consistent with a large number of observed images. This algorithm makes a single pass through voxel space, first computing the visibility of each voxel and then its color. Their algorithm is based on the fact that each camera must agree on the color of an opaque voxel, but only when that voxel is visible from that camera.

De Bonet et al. [20] introduce a concept of “Poxels”: Probabilistic Voxel Reconstruction. Their probabilistic voxel approach addresses the assumption made in most direct 3D reconstruction methods that a pixel is either completely transparent or completely opaque. A 3D space is reconstructed by optimizing over the probability that each voxel is visible in each of the projections. An iterative algorithm is used to find the optimal probability distribution which jointly explains all the observed projections. A multi-step procedure which alternates between estimation of the colors and estimation of the opacities is used. Since this approach formulates the problem as one of optimization over the probability distribution of the visibility of each region of space, uncertainty due to lack of data, or perhaps contradictory data can be captured as well. It also tries to solve the transparent voxel coloring problem.

Pavlovic et al. [38] use a Dynamic Bayesian Network to model a complex dynamic system with multiple linear models that are indexed by a switching variable. The network is used as a multi-hypothesis predictor that initializes multiple template searches in the image space. The templates for body parts are initialized manually.

Bregler et al. and Yamamoto et al [21, 22] use systems with linear constraint equations, created by combining articulated-body models with dense optical flows. Yamamoto maintains the constraints between limbs by sequentially estimating the motion of each parent limb, adjusting the hypothesized position of the child limb, then estimating the further motion of the child limb. In contrast, Bregler takes full advantage of the information provided by child limbs to further constrain the estimated motions of the parents. Both Yamamoto and Bregler use a first-order Taylor series approximation to the camera/body rotation matrix, to reduce the number of parameters used to represent this matrix. Furthermore, both use an articulated model to generate depth values that are needed to linearize the mapping from 3D body motions to observed 2D camera-plane motions.

 

 

 

 

 

Chapter 2

 

2.1 Introduction to Camera Calibration

 

Camera calibration in the context of three-dimensional machine vision is the process of determining the internal camera geometric and optical characteristics (internal parameters) and/or the 3-D position and orientation of the camera frame relative to a certain world coordinate system (external parameters), for inferring 3D information concerning the location of the object, target, or feature from computer image coordinates. There are two kinds of 3D information to be inferred. They are different mainly because of the difference in applications.

The first is 3D information concerning the location of the object, target or feature. For simplicity, if the object is a point feature, camera calibration provides a way of determining a ray in 3D space that the object point must lie on, given the computer image coordinates. With two views from different cameras, the position of the object point can be determined by intersecting the two rays. Both extrinsic and intrinsic parameters need to be calibrated. In our application the camera calibration need to be done only once.

The second kind is 3D information concerning the position and orientation of a camera relative to the target world coordinate system.

Tsai’s seminal camera calibration paper [23] is widely referred for camera calibration. The theoretical modeling of Tsai’s imaging process includes lens distortion and perspective rather than parallel projection. The camera calibration does not include a high dimension nonlinear search. External parameters needs repeated calibration, this calibration approach allows enough potential for high-speed implementation.

Approaches that make use of a full scale nonlinear optimization can adapt to accurate yet complex imaging models, but they require a good initial guess to start the nonlinear search. Faig’s technique [24] is a good representative for these techniques. It uses a very elaborate model for imaging using at least 17 unknowns for each image and is computer-intensive.  Sobel [25] described a system for calibrating a camera using nonlinear equation solving. Eighteen parameters need to be optimized and the approach is similar to Faig’s method. Gennery [26] described a method that finds camera parameters iteratively by minimizing the error of epipolar constraints without using 3D coordinates of calibration points.

Techniques involving computing perspective transformation matrix use linear equation solving. They have an advantage since nonlinear optimization is not needed, but the disadvantage is that lens distortion cannot be considered. Sutherland [27] formulated very explicitly the procedure for computing the perspective transformation matrix given 3D world coordinates and 2D image coordinates of a number of points.

Svoboda [28] has presented a multi-camera self-calibration method that just needs a laser pointer which makes a set of virtual 3D points by waving it through the working volume. Its projection structures are computed via Euclidean stratification and by imposing geometric constraints. This linear estimation requires a post processing computation of radial distortion.

For our internal camera model, we use a calibration procedure used by Heikkila [29]. This procedure is most beneficial in camera based 3D measurements where high geometrical accuracy is needed. It uses an explicit calibration method for mapping 3D coordinates to image coordinates and an implicit approach for image correction

          The main initialization phase has been partially drawn from Zhang [30]. The distortion coefficients are not estimated at the initialization phase. The technique requires the camera to observe a planar pattern shown at different orientations. The planar pattern can be freely moved. It uses radial distortion for lens distortion modeling. The procedure for extracting grid points from a checkerboard planar pattern is derived from the methods described in [31].

 

2.2 External Parameters:

 

External parameters are needed to transform the object coordinates to a camera centered coordinate frame. The pinhole camera model is used through out the calibration procedure; this model is based on the principle of co-linearity, where each point in the object space is projected by a straight line through the projection center into the image plane.

            In order to express an arbitrary object point P at location () in image coordinates, we need to transform it to camera coordinates  [29]. This transformation consists of a translation and a rotation and it can be performed by using the following matrix equation.

 

                                                                 

                              (2.1)

 

2.3 Internal Parameters:

 

The list of internal camera parameters is:

 

1.      Focal length is stored in the 2x1 vector fc. Where  and  are the focal distance expressed in units of horizontal and vertical pixels.

2.      Principal point coordinates indicating the center of the camera image plane stored in the 2x1 vector cc.

3.      The skew coefficient defining the angle between the x and y pixel axes stored in the scalar alpla_c.

4.      The image distortion coefficients (radial and tangential) stored in the 5x1 vector kc.

 

Here as is the norm in computer vision literature, the origin of the image coordinate system is in the upper left corner of the image array. From the pinhole model, the projection of the point P to the image plane is represented as

 

       is the normalized (pinhole) image projection   (2.2)   

 

The pinhole model is only an approximation of the real camera projection. It is not valid when high accuracy is required and therefore a more comprehensive camera model must be used. The pinhole model is a basis that is extended with some corrections for the distorted image coordinates. The most commonly used correction is for the radial lens distortion that causes the actual image to be displaced radially in the image plane [32], also the centers of curvature of lens surfaces are not always strictly collinear which introduces another common distortion type, decentering distortion which has both a radial and tangential component. A proper camera model for accurate calibration can be derived by combining the pinhole model with the correction for the radial and tangential distortion components [31].

 

Let, after including lens distortion, the new normalized point coordinate is defined as follows:

 

                                       (2.3)

 

The 5-vector  contains both the tangential and radial distortion coefficients. This distortion model was introduced by Brown in 1966 and called “Plumb Bob” model (radial polynomial + “thin prism”) [33].

 

Once distortion is applied, the final pixel coordinates x_pixel = [ ]’ of the projection of P on the image plane is:

 

                                                            (2.4)

 

Therefore, the pixel coordinate vector x_pixel and the normalized (distorted) coordinate vector are related to each other through the linear equation:

 

                                                                                       (2.5)

 

Where KK is known as the camera matrix and is defined as follows:

 

                                                       (2.6)

 

2.4 Calibration Steps

 

·                    A frame containing N object points  whose world coordinates are known is grabbed.

·                    The pixel coordinates x_pixel of these points is detected.

·                    Initially assume distortion is zero kc=0

·                    For each point i with (), as the 3D object coordinate and the corresponding image     coordinate, linear equations with, , ,  and  as the unknowns is set up. With N (number of object points) much larger than five, an over determined system of linear equations is established and solved for the five unknowns. The ratio of   is  estimated from the aspect ratio.

·                    Equation (4) is solved with kc as the unknown term using a standard optimization scheme “steepest descent” and using the approximation for fc , T and R computed previously.

·                    Using the kc value computed from the above step, fc, T and R is calculated.

 

 

Chapter 3

 

3.1 Introduction to Silhouette Extraction

 

The capability of extracting moving objects from a video sequence is a fundamental and crucial problem of many vision systems that include video surveillance, traffic monitoring, human detection and tracking for video teleconferencing or human-machine interface among other applications. Typically, the usual approach for discriminating moving object from the background scene is background subtraction. The idea of background subtraction is to subtract the current image from a reference image, which is acquired from a static background during a period of time. The subtraction leaves only non-stationary or new objects, which include the objects’ entire silhouette region. The technique has been used for several years in many vision systems as a preprocessing step for object detection and tracking [34].

One difficult problem of generating silhouette by background subtraction is to remove shadows quickly and effectively. Most algorithms use color information to distinguish between shadow and non-shadow pixels. The intensity of the shadow pixel would be lower while the RGB value will remain the same as the background. Using this information shadow pixels are detected. But in our application, the images obtained are grayscale images and this technique for shadow detection cannot be used. Instead the algorithm described in Horpraset et al [34] was modified and implemented for grayscale images. An appropriate threshold value is chosen and images are segmented into background and foreground.

We create a background model by averaging 20 background frames taken at a regular interval. This is done to cancel effects of ambient light flickering and noise like salt and pepper noise that may creep into the images. Once the subject moves into the tracking region, image subtraction is performed by subtracting the background image model from the current image. By choosing an appropriate threshold value using Otsu’s thresholding algorithm [35], the subtracted image is segmented into background and foreground. The resulting image will be a binary image with the foreground part having ones and the background having zeros. Good quality silhouettes have been generated using these methods.

 

3.2 Steps Involved in Silhouette Extraction

 

1)                  Background modeling constructs a reference image representing the background.

The average of n set of background images is taken as the reference background image.

 

 In our experiments, n = 20             (3.1)

 

2)                  Subtraction operation involves subtracting the reference background image b from the stream of input images containing the foreground object.

           

                                                          (3.2)

 

3)                  Threshold selection which selects the appropriate threshold value to obtain a desired detection rate and segment the image into background and foreground based on the calculated threshold. Otsu’s threshold selection method [35] is used.

 

                                                                        (3.3)

                                   (3.4)

                    (3.5)

                                                        (3.6)

                                           (3.7)

index =mean(indices()))                                          (3.8)

threshold level = normalized( index )                                     (3.9)    

c is a 256 element vector having the histogram values of the image.

contains a vector having the cumulative sum of normalized values of c

 is a vector containing the cumulative sum of the product of vector c and the vector [1,2,3,4…..256] which represents the 0-255 gray levels.

Contains the last element of the vector. This element will have the sum of all the elements of vector.

index has the mean of the indices having the maximum value of .

threshold level is the normalized threshold value in the range of 0 to 1.

 

Note, the above algorithm is too computationally time consuming for frame-by-frame use in real-time, i.e., it will reduce the overall bandwidth of the image acquisition process. After running several trials, an acceptable constant threshold value of 0.0039 was obtained for the streamed images. This value depends on the lighting system and the position of the cameras. Since these parameters are relatively static, it is fair to estimate that the threshold value will remain ~0.0039 for all streamed images, hence this threshold value can be determined before runtime. After the subtraction operation, the image is converted to a binary image based on the threshold value. This binary image now contains the silhouette of the foreground object, as shown below in Figure 3.1.  

Figure 3.1: Raw IR image data (top) and silhouette extraction (bottom) for three body views.

 

            Reconstructing a 3D shape using 2D silhouettes from multiple images is also called voxel carving, volume intersection or shape from silhouettes. It has been shown that in the general case it is impossible to accurately reconstruct an arbitrary volume using a finite number of cameras. The closest approximation to the object that can be obtained by volume intersection is its visual hull. The visual hull of an object is defined as the maximal object that gives the original object’s silhouette from any viewpoint. Only objects that coincide with their visual hulls are exactly reconstructable. Fortunately, due to the convex, smooth shape of the human body, good quality reconstructions can be constructed with as few as five cameras for the majority of body postures [1].

            A probabilistic approach to voxel carving was proposed in [36]. In this framework, each voxel is assigned a probability which is which is computed by comparing the likelihoods for the voxel belonging and not belonging to the object. This approach guarantees that no holes are carved out in the model.

            We have adopted a more straightforward approach that checks for each voxel in the volume of interest if it is consistent with all the 2D silhouettes. In this system, the person is known to always be inside a predetermined volume of interest. A projection of each voxel in that volume onto each of the image planes is pre-computed and stored in a lookup table. Then, at run time, the process of checking whether the voxel is consistent with a 2D silhouette is very fast since the use of the lookup table eliminates most of the necessary computations.

            Mikic [1] describe an approach in which they pre-compute a lookup table that maps points from undistorted sensor plane coordinates in Tsai’s camera model to the image pixels. Then at runtime an extra computation of mapping from world coordinates to undistorted sensor-plane coordinates is performed. We use an approach where the pre-computed lookup table contains the mapping of each voxel’s world coordinate to its sensor plane coordinates. Based on the image coordinates of the silhouette, the corresponding voxels are classified as inside or outside.

 

4.2 3D Reconstruction Theories

 

By the principle of perspective projection, an object has to lie within the cone (bounding volume) formed by its silhouette from the corresponding camera view-point. Hence given multiple silhouette images of an object from different viewpoints, its 3-D shape can be reconstructed by intersecting the corresponding bounding volumes by the silhouette constraints as shown in Figure 4.1.

 



 

Figure 4.1: Shape from silhouettes: (a) bounding volume constraints, (b) intersection of bounding volumes [8].

 

In practice, however it is not implemented as shown in Figure 4.1(b) because intersecting rays in 3D is numerically unstable. Hence a voxel based implementation is used. The whole volume of interest is divided into voxels. Each voxel v is tested if it belongs to the foreground objects by projecting itself onto all the K silhouette images.

 

 

 

 

4.3 Procedure Used for Voxel Reconstruction

 

The volume of interest is divided into 40 x 40 x 40 equal-sized voxel each a cube of size 50mm, as shown below in Figure 4.2.

Figure 4.2: The volume of interest divided into voxels.

 

The projection of voxels on a particular image plane is pre-calculated and stored in a look-up table, the projection of a particular voxel on a plane results in 8 points, a bounding box which contains all these 8 points is created and silhouette intersection tests are performed on these boxes, as shown below in Figure 4.3.

Figure 4.3: Projection of voxels on the camera image plane [8].

 

 

The algorithm given below was used to find the projection of each voxel on a camera image plane and the bounding box associated with that voxel:

 

Internal Parameters:  is the focal length of the camera,  is the image plane center,  is the skew coefficient, is a 1x5 matrix containing the radial and tangential distortion coefficients.

 

External Parameters:  is the translation matrix and  is the rotational matrix.

Chapter 2 has an explanation of camera parameters and how they were calculated.

 

                                             (4.1)    

 

                                                                                                     (4.2)

 

                                                                                                (4.3)

                                                                     (4.4)    

 

 

                                                                (4.5)

       

                                          (4.6)

 

                                                                   (4.7)

 

                                                          (4.8)

 

The above algorithm was implemented for each of the 8 vertices of all the voxels. and  contains the x and y coordinates of the projection of a voxel’s vertex onto the image plane.

 

From the 8 points that are obtained from the above algorithm, the maximum and the minimum values of the x and y coordinates was calculated. From this a bounding box, which is a square that contains all the 8 projected points was constructed.

 

Figure 4.4 below shows the bounding box projection of each voxel onto the camera image plane calculated using the above algorithm along with the silhouette obtained from that particular camera.


Figure 4.4: Projection of the bounding boxes of each voxel on the image plane of a camera and the corresponding silhouette obtained from that camera.

 

 

 

Each voxel is classified as either inside or outside silhouette the according to the algorithm given below. The outside voxels are discarded and voxel carving is performed [1], as shown below for three cameras in Figures 4.5 through 4.7.

 

The silhouette bounding-box intersection pseudo-code used for classifying voxels is defined as follows: Let (x_min, y_min) , (x_max, y_min) , (x_min, y_max) and x_max , y_max) be the four points of a bounding-box.

 

    

 

 


(a) Camera one.

(b) Camera two.

 

(c) Camera three.


 

Figure 4.5: Image silhouettes from three cameras placed at different angles.

 


Figure 4.6: The volume of interest carved by performing silhouette intersection tests using the silhouettes in Figure 4.5(a) and 4.5(b).

Figure 4.7: Shows the voxel model being further refined by performing the silhouette intersection test using the silhouette in Figure 4.5(c).



5.1 Introduction to Ellipsoid Template Fitting

 

            After obtaining the 3D voxel data from the procedures described previously, ellipsoid models are used to fit the 3D voxel data in real time.

Mickic et al. [1] use a sequential template growing procedure to locate body parts. A spherical crust template whose inner and outer diameters correspond to the smallest and largest expected head dimensions is created. The spherical crust is then moved in the 3D space until the voxels that are inside the larger crust is maximized. We found this technique of searching for the location of the head unstable and not feasible for our application in real time.

Kanade and Cheung [8] use 6 ellipsoid shells (the head, body, two arms and two legs) to fit onto the 3D voxel model. To make their fitting process fast for a real-time system, they use a two step Expectation-Maximization (EM) like approach. A proximity criterion is used to segment the voxel data. Based on this, voxels are assigned to shells.

We fit the templates by locating the head first from the 3D voxel data, this initialization would however fail if the subject has his arms raised above his head during initialization. Once the head is fitted with a template, the torso and the limbs are located from the voxel data and are fitted with cylinders and ellipsoids.

 

5.2 Procedure for Ellipsoid Template Fitting

 

The head is fitted with a sphere of radius that is determined from the anthropometric data that would correspond to the subject (example: short male’s head size is 140x180mm). The head is located by measuring the height of the 3D voxel model. This part of the ellipsoid initialization would fail if the subject has a posture in which the head is below another body part (example: if the subject has his hands raised above his head). Refer Appendix C for anthropometric data used in template fitting.<