
Technical Report No. VSR-0X.YY
Annual Technical Report
For the Project
Digital Human Modeling and Virtual Reality for FCS
By
Ajay Gopinath, M.S. Candidate
Electrical and Computer Engineering
Clarkson University, Potsdam, NY 13699-5720
Dr. James J. Carroll, Associate Professor
Electrical and Computer Engineering
Clarkson University, Potsdam, NY 13699-5720
CONTRACT/PR NO. DAAE07-03-D-L003/0001
Abstract
This document describes the work performed within the Digital Human Design thrust of the Virtual Soldier Research (VSR) Program. The objective is to develop a system to obtain real-time tracking data of a subject from multiple synchronized Infrared video streams and use this data to drive a gesture-based interaction with an avatar.
An inexpensive consumer off-the-shelf (COTS) infrared system is used for automatic acquisition of human body model and motion tracking. Acquired video frames are segmented through background subtraction. 3D voxel representations of the human body shape in each frame are computed in real-time from foreground silhouettes. The infrared cameras used are all calibrated, which means that their parameters like rotation and translation matrices, the distortion model, focal length, image plane center, etc are all precompiled.
Voxels are created in the tracking area and the projection of each voxel on a given camera plane is precompiled and stored in a lookup table. At run time each voxel is classified as either inside or outside by performing intersection tests with corresponding foreground silhouettes. The voxels classified as outside are discarded leaving behind a carved body model of the subject present in the tracking region.
Model acquisition starts with a simple body part localization procedure based on template fitting, which uses prior knowledge of average body part shapes and dimensions. The human body model consists of ellipsoids and cylinders and is described using a twist based framework Based on the inverse kinematic equations for joint angles, the centroid positions of each ellipsoid is calculated using data from measurement points. A robust voxel labeling procedure is used to obtain quality measurements.
Chapter 1
1.1
Introduction
Human body modeling or posture estimation is the problem of constructing an accurate body model representation of a subject from video data and track the body, i.e. calculate the 3D coordinates of the body model as the subject moves in the tracking region. A reliable motion capture system would be valuable in many applications. One class of applications are those where extracted body model parameters are used directly to interact with a virtual world, drive an animated avatar in a computer graphics character animation. This can be used for applications ranging from soldier training to physiotherapy applications. Another class of applications uses extracted parameters to classify and recognize people, gestures or motions such as surveillance systems, intelligent environments, and content based indexing of sports video footage or advanced user interfaces for sign language translation, gesture driven control, gait or posture recognition [1-7]. Motion parameters can be used for motion analysis in applications such as personalized sports training, choreography, or clinical studies of orthopedic patients.
In the past decade, efforts have been put in developing methods for reconstructing 3D models of human actions or tracking human/hand movements from multiple cameras [8-14]. These methods can be divided into two categories. The first category focuses on detailed model reconstruction or segmentation of humans in 3D. Despite the fact that image acquisition is real time, the reconstruction or segmentation in this category is done offline. The second category involves real-time 3D human motion tracking without explicit or detailed 3D human body reconstruction. Almost all the methods in this category assume a generic 3D human model and the tracking/model fitting is done by comparing the camera projections of the generic model with the silhouettes, edges or the optical flow of the camera images. Though these methods work well in general, very few have tried to reconstruct humans in real-time directly in the 3D domain. In view of this, we have built a system that can achieve real-time 3D reconstruction of human body and movements. We use 3D voxel reconstructions of the human body shape at each frame as input to the model acquisition and tracking. This approach leads to simple and robust algorithms that take advantage of the unique qualities of voxel data. The price is an additional preprocessing step where the 3D voxel reconstructions are computed from the image data.
Infrared (IR) cameras are used in our system because tracking is performed on a subject inside a Virtual Reality (VR) cube. Inside the cube, background images are constantly changing depending on the immersive environment used. Hence a stable background model cannot be created for silhouette extraction. With IR cameras, the images on the VR cube are not seen since the only objects visible to the cameras are the ones that reflect IR light from the IR sources placed in the cube (Appendix B shows the lab setup). This results in a stable background environment on which robust silhouette (foreground) extraction can be performed.
We first calibrate the IR cameras and store the parameters obtained from calibration. Chapter 2 describes the camera calibration techniques used to calibrate the cameras. Appendix A has a detailed description of the procedure and the results of calibration. The internal parameters of the camera do not change unless the focal length is changed or the lens filter is changed. The external parameters change if the camera’s position or orientation is changed. Once the cameras are calibrated, the images that are grabbed from these cameras are segmented into background and foreground pixels. Chapter 3 describes the silhouette extraction techniques used along with the thresholding algorithm for image segmentation. Using the silhouettes and the corresponding external parameters of each camera, voxel carving is performed. Chapter 4 describes the methods used to perform voxel carving. 3D templates are fitted onto the voxel body model, chapter 5 describes the procedures used to perform template matching. Once the templates are fit on the voxel model, a twist based human body model is initialized and this body model is tracked based on 23 measurement points. This is listed in Chapter 6. Appendix A has the procedure for calibrating cameras and the calibrating results obtained. Appendix B the laboratory setup that and lists the hardware and software used. Appendix C has the anthropometric data used to initialize template fitting.
1.2 Related
Work
Currently available commercial systems for motion capture require the subject to wear special markers, body suits or gloves. In the past few years, the problem of marker-less, unconstrained posture estimation using only cameras has received much attention from computer vision researchers.
Many existing posture estimation systems require manual initialization of the model and then perform tracking. Very few approaches exist where the model is acquired automatically. Algorithms have been developed that take input from one or multiple cameras.[8,11]. A system for tracking a 3D articulated model of a hand, based on a layered template representation of self occlusion has been developed by Rehg and Kanade[15] .
Kakadiaris and Metaxas [11] have developed a system for 3D human body model acquisition and tracking using three cameras placed in a mutually orthogonal configuration. The person under observation is requested to perform a set of movements according to a protocol that incrementally reveals the structure of the human body. Once the model has been acquired, the tracking is performed using a physics based frame work.
Cheung and Kanade [8] use an algorithm called Sparse Pixel Occupancy Test (SPOT) for performing silhouette voxel intersection tests. Instead of testing all the pixels included in a voxel projection on the image plane, a uniformly distributed set of pixels within the bounding box of the voxel projection are chosen and if at least a fraction of these chosen pixels are silhouette pixels, then the voxel is classified as belonging to the foreground object. One big advantage of SPOT is that it is faster since not all pixels are tested. But the key question of using SPOT is to choose the right number of pixels to test. The total number of testing should be small while maintaining a low probability of misclassification of voxels. To answer this question, they perform two tests. The first one is False Acceptance (FA) which means that an outside voxel is falsely classified as an inside voxel. The second one is False Rejection (FR) which means that an inside voxel is falsely classified as an outside voxel. Graphs of FA and FR against the number of pixels tested is plotted and based on these graphs an optimum value for the number of pixels to be tested is chosen.
Cheung and Kanade use a step for internal voxel removal where after the voxel based 3D reconstruction is performed, each voxel is tested and only the surface voxels are extracted removing all internal voxels. This is useful in many applications where we may not be interested in the whole volume of the reconstructed data. Much CPU time can be saved by displaying only the surface voxels.
In the algorithms that work with the data in the image planes, the 3D body model is repeatedly projected onto the image planes to be compared against the extracted image features. A simplified camera model is often used to enable efficient model projection [16]. Another problem in working with the image plane data is that different body parts appear in different sizes and may be occluded depending on the relative position of the body to the camera and on the body pose.
When using 3D voxel reconstructions as input to the motion capture system, very detailed camera models can be used, since the computations that use the camera model can be done off-line and stored in lookup tables. The voxel data is in the same 3D space as the body model; therefore, there is no need for repeated projections of the model to the image space. Also, since the voxel reconstruction is of the same dimensions as the subject, the sizes of different body parts are stable and do not depend on the subject’s position and pose.
Szeliski et al. [17] use a different approach to create 3D models, in their scenario, an object rotates on a turntable in front of a static camera and an octree model is generated from multiple frames. Chien and Aggarwal [18] constructed an octree from three orthographic projections, an octree of the conic volume formed by the silhouette and the center of projection is computed for each viewpoint and the octrees from all of the viewpoints are then intersected. Szeliski builds a coarse 3D octree description first and then a sophisticated shape-from-motion algorithm that relies on the detailed analysis of optic flow is used to refine the 3D model. Their 3-D projection operations, computation of the distance transform and bounding box intersection tests are easily parallelizable and can thus take advantage of modern massively parallel architectures.
Seitz and Dyer [19] have proposed an algorithm which computes a set of occupied voxels that is consistent with a large number of observed images. This algorithm makes a single pass through voxel space, first computing the visibility of each voxel and then its color. Their algorithm is based on the fact that each camera must agree on the color of an opaque voxel, but only when that voxel is visible from that camera.
De Bonet et al. [20] introduce a concept of “Poxels”: Probabilistic Voxel Reconstruction. Their probabilistic voxel approach addresses the assumption made in most direct 3D reconstruction methods that a pixel is either completely transparent or completely opaque. A 3D space is reconstructed by optimizing over the probability that each voxel is visible in each of the projections. An iterative algorithm is used to find the optimal probability distribution which jointly explains all the observed projections. A multi-step procedure which alternates between estimation of the colors and estimation of the opacities is used. Since this approach formulates the problem as one of optimization over the probability distribution of the visibility of each region of space, uncertainty due to lack of data, or perhaps contradictory data can be captured as well. It also tries to solve the transparent voxel coloring problem.
Pavlovic et al. [38] use a Dynamic Bayesian Network to model a complex dynamic system with multiple linear models that are indexed by a switching variable. The network is used as a multi-hypothesis predictor that initializes multiple template searches in the image space. The templates for body parts are initialized manually.
Bregler et al. and Yamamoto et al [21, 22] use systems with linear constraint equations, created by combining articulated-body models with dense optical flows. Yamamoto maintains the constraints between limbs by sequentially estimating the motion of each parent limb, adjusting the hypothesized position of the child limb, then estimating the further motion of the child limb. In contrast, Bregler takes full advantage of the information provided by child limbs to further constrain the estimated motions of the parents. Both Yamamoto and Bregler use a first-order Taylor series approximation to the camera/body rotation matrix, to reduce the number of parameters used to represent this matrix. Furthermore, both use an articulated model to generate depth values that are needed to linearize the mapping from 3D body motions to observed 2D camera-plane motions.
Chapter 2
2.1 Introduction to Camera
Calibration
Camera calibration in the context of three-dimensional machine vision is the process of determining the internal camera geometric and optical characteristics (internal parameters) and/or the 3-D position and orientation of the camera frame relative to a certain world coordinate system (external parameters), for inferring 3D information concerning the location of the object, target, or feature from computer image coordinates. There are two kinds of 3D information to be inferred. They are different mainly because of the difference in applications.
The first is 3D information concerning the location of the object, target or feature. For simplicity, if the object is a point feature, camera calibration provides a way of determining a ray in 3D space that the object point must lie on, given the computer image coordinates. With two views from different cameras, the position of the object point can be determined by intersecting the two rays. Both extrinsic and intrinsic parameters need to be calibrated. In our application the camera calibration need to be done only once.
The second kind is 3D information concerning the position and orientation of a camera relative to the target world coordinate system.
Tsai’s seminal camera calibration paper [23] is widely referred for camera calibration. The theoretical modeling of Tsai’s imaging process includes lens distortion and perspective rather than parallel projection. The camera calibration does not include a high dimension nonlinear search. External parameters needs repeated calibration, this calibration approach allows enough potential for high-speed implementation.
Approaches that make use of a full scale nonlinear optimization can adapt to accurate yet complex imaging models, but they require a good initial guess to start the nonlinear search. Faig’s technique [24] is a good representative for these techniques. It uses a very elaborate model for imaging using at least 17 unknowns for each image and is computer-intensive. Sobel [25] described a system for calibrating a camera using nonlinear equation solving. Eighteen parameters need to be optimized and the approach is similar to Faig’s method. Gennery [26] described a method that finds camera parameters iteratively by minimizing the error of epipolar constraints without using 3D coordinates of calibration points.
Techniques involving computing perspective transformation matrix use linear equation solving. They have an advantage since nonlinear optimization is not needed, but the disadvantage is that lens distortion cannot be considered. Sutherland [27] formulated very explicitly the procedure for computing the perspective transformation matrix given 3D world coordinates and 2D image coordinates of a number of points.
Svoboda [28] has presented a multi-camera self-calibration method that just needs a laser pointer which makes a set of virtual 3D points by waving it through the working volume. Its projection structures are computed via Euclidean stratification and by imposing geometric constraints. This linear estimation requires a post processing computation of radial distortion.
For our internal camera model, we use a calibration procedure used by Heikkila [29]. This procedure is most beneficial in camera based 3D measurements where high geometrical accuracy is needed. It uses an explicit calibration method for mapping 3D coordinates to image coordinates and an implicit approach for image correction
The main initialization phase has been partially drawn from Zhang [30]. The distortion coefficients are not estimated at the initialization phase. The technique requires the camera to observe a planar pattern shown at different orientations. The planar pattern can be freely moved. It uses radial distortion for lens distortion modeling. The procedure for extracting grid points from a checkerboard planar pattern is derived from the methods described in [31].
2.2 External
Parameters:
External parameters are needed to transform the object coordinates to a camera centered coordinate frame. The pinhole camera model is used through out the calibration procedure; this model is based on the principle of co-linearity, where each point in the object space is projected by a straight line through the projection center into the image plane.
In order to express an arbitrary
object point P at location (
) in image coordinates, we need to transform it to camera
coordinates
[29]. This transformation consists of a
translation and a rotation and it can be performed by using the following
matrix equation.
(2.1)
2.3 Internal Parameters:
The list of internal camera parameters is:
1. Focal
length is stored in the 2x1 vector fc.
Where
and
are the focal distance expressed in units of
horizontal and vertical pixels.
2. Principal
point coordinates indicating the center of the camera image plane stored in the
2x1 vector cc.
3. The
skew coefficient defining the angle between the x and y pixel axes stored in
the scalar alpla_c.
4. The
image distortion coefficients (radial and tangential) stored in the 5x1 vector kc.
Here as is the norm in computer vision literature, the origin of the image coordinate system is in the upper left corner of the image array. From the pinhole model, the projection of the point P to the image plane is represented as
is the normalized
(pinhole) image projection (2.2)
The pinhole model is only an approximation of the real camera projection. It is not valid when high accuracy is required and therefore a more comprehensive camera model must be used. The pinhole model is a basis that is extended with some corrections for the distorted image coordinates. The most commonly used correction is for the radial lens distortion that causes the actual image to be displaced radially in the image plane [32], also the centers of curvature of lens surfaces are not always strictly collinear which introduces another common distortion type, decentering distortion which has both a radial and tangential component. A proper camera model for accurate calibration can be derived by combining the pinhole model with the correction for the radial and tangential distortion components [31].
Let
, after including lens distortion, the new normalized point
coordinate is defined as follows:
(2.3)
The
5-vector
contains both the
tangential and radial distortion coefficients. This distortion model was
introduced by Brown in 1966 and called “Plumb Bob” model (radial polynomial +
“thin prism”) [33].
Once distortion
is applied, the final pixel coordinates x_pixel
= [
]’ of the
projection of P on the image plane is:
(2.4)
Therefore, the
pixel coordinate vector x_pixel and
the normalized (distorted) coordinate vector
are related to each other through the linear equation:
(2.5)
Where KK is known as the camera matrix and is defined as follows:
(2.6)
2.4 Calibration Steps
·
A frame containing N object points
whose world coordinates are known is grabbed.
· The pixel coordinates x_pixel of these points is detected.
· Initially assume distortion is zero kc=0
·
For each point i
with (
),
as the 3D object coordinate and the corresponding image coordinate, linear equations with
,
,
,
and
as the unknowns is
set up. With N (number of object
points) much larger than five, an over determined system of linear equations is
established and solved for the five unknowns. The ratio of
is
estimated from the aspect ratio.
· Equation (4) is solved with kc as the unknown term using a standard optimization scheme “steepest descent” and using the approximation for fc , T and R computed previously.
·
Using the kc value
computed from the above step, fc, T and
R is calculated.
Chapter 3
3.1 Introduction to Silhouette Extraction
The capability of extracting moving objects from a video sequence is a fundamental and crucial problem of many vision systems that include video surveillance, traffic monitoring, human detection and tracking for video teleconferencing or human-machine interface among other applications. Typically, the usual approach for discriminating moving object from the background scene is background subtraction. The idea of background subtraction is to subtract the current image from a reference image, which is acquired from a static background during a period of time. The subtraction leaves only non-stationary or new objects, which include the objects’ entire silhouette region. The technique has been used for several years in many vision systems as a preprocessing step for object detection and tracking [34].
One difficult problem of generating silhouette by background subtraction is to remove shadows quickly and effectively. Most algorithms use color information to distinguish between shadow and non-shadow pixels. The intensity of the shadow pixel would be lower while the RGB value will remain the same as the background. Using this information shadow pixels are detected. But in our application, the images obtained are grayscale images and this technique for shadow detection cannot be used. Instead the algorithm described in Horpraset et al [34] was modified and implemented for grayscale images. An appropriate threshold value is chosen and images are segmented into background and foreground.
We create a background model by averaging 20 background frames taken at a regular interval. This is done to cancel effects of ambient light flickering and noise like salt and pepper noise that may creep into the images. Once the subject moves into the tracking region, image subtraction is performed by subtracting the background image model from the current image. By choosing an appropriate threshold value using Otsu’s thresholding algorithm [35], the subtracted image is segmented into background and foreground. The resulting image will be a binary image with the foreground part having ones and the background having zeros. Good quality silhouettes have been generated using these methods.
3.2 Steps Involved in
Silhouette Extraction
1) Background modeling constructs a reference image representing the background.
The average of n set of background images is taken as the reference background image.
In our experiments, n
= 20 (3.1)
2) Subtraction operation involves subtracting the reference background image b from the stream of input images containing the foreground object.
(3.2)
3) Threshold selection which selects the appropriate threshold value to obtain a desired detection rate and segment the image into background and foreground based on the calculated threshold. Otsu’s threshold selection method [35] is used.
(3.3)
(3.4)
(3.5)
(3.6)
(3.7)
index =mean(indices(
))) (3.8)
threshold level = normalized( index ) (3.9)
c is a 256 element vector having the histogram values of the image.
contains a vector having the cumulative sum of normalized
values of c
is a vector
containing the cumulative sum of the product of vector c and the vector [1,2,3,4…..256] which represents the 0-255 gray
levels.
Contains the last element of the vector
. This element will have the sum of all the elements of
vector
.
index has the mean of the indices having
the maximum value of
.
threshold level is the normalized threshold value in the range of 0 to 1.
Note,
the above algorithm is too computationally time consuming for frame-by-frame
use in real-time, i.e., it will reduce the overall bandwidth of the image
acquisition process. After running several trials, an acceptable constant
threshold value of 0.0039 was obtained for the streamed images. This value
depends on the lighting system and the position of the cameras. Since these
parameters are relatively static, it is fair to estimate that the threshold
value will remain ~0.0039 for all streamed images, hence this threshold value
can be determined before runtime. After the subtraction operation, the image is
converted to a binary image based on the threshold value. This binary image now
contains the silhouette of the foreground object, as shown below in Figure 3.1.






Figure 3.1: Raw IR image data (top) and silhouette extraction (bottom) for three body views.
Reconstructing a 3D shape using 2D silhouettes from multiple images is also called voxel carving, volume intersection or shape from silhouettes. It has been shown that in the general case it is impossible to accurately reconstruct an arbitrary volume using a finite number of cameras. The closest approximation to the object that can be obtained by volume intersection is its visual hull. The visual hull of an object is defined as the maximal object that gives the original object’s silhouette from any viewpoint. Only objects that coincide with their visual hulls are exactly reconstructable. Fortunately, due to the convex, smooth shape of the human body, good quality reconstructions can be constructed with as few as five cameras for the majority of body postures [1].
A probabilistic approach to voxel
carving was proposed in [36]. In this framework, each voxel is assigned a
probability which is which is computed by comparing the likelihoods for the
voxel belonging and not belonging to the object. This approach guarantees that
no holes are carved out in the model.
We have adopted a more straightforward approach that checks for each voxel in the volume of interest if it is consistent with all the 2D silhouettes. In this system, the person is known to always be inside a predetermined volume of interest. A projection of each voxel in that volume onto each of the image planes is pre-computed and stored in a lookup table. Then, at run time, the process of checking whether the voxel is consistent with a 2D silhouette is very fast since the use of the lookup table eliminates most of the necessary computations.
Mikic [1] describe an approach in
which they pre-compute a lookup table that maps
points from undistorted sensor plane coordinates in Tsai’s camera model to the
image pixels. Then at runtime an extra computation of mapping from world
coordinates to undistorted sensor-plane coordinates is performed. We use an
approach where the pre-computed lookup table contains the mapping of each
voxel’s world coordinate to its sensor plane coordinates. Based on the image
coordinates of the silhouette, the corresponding voxels are classified as
inside or outside.
4.2
3D Reconstruction Theories
By the principle
of perspective projection, an object has to lie within the cone (bounding
volume) formed by its silhouette from the corresponding camera view-point. Hence
given multiple silhouette images of an object from different viewpoints, its
3-D shape can be reconstructed by intersecting the corresponding bounding
volumes by the silhouette constraints as shown in Figure 4.1.

Figure 4.1: Shape from silhouettes: (a) bounding volume constraints, (b) intersection of bounding volumes [8].
In practice, however it is not implemented as shown in Figure 4.1(b) because intersecting rays in 3D is numerically unstable. Hence a voxel based implementation is used. The whole volume of interest is divided into voxels. Each voxel v is tested if it belongs to the foreground objects by projecting itself onto all the K silhouette images.
4.3 Procedure
Used for Voxel Reconstruction
The volume of interest is divided into 40 x 40 x 40
equal-sized voxel each a cube of size 50mm, as shown below in Figure 4.2.

Figure 4.2: The volume of interest divided into
voxels.
The projection of voxels on a particular image
plane is pre-calculated and stored in a look-up table, the projection of a
particular voxel on a plane results in 8 points, a bounding box which contains
all these 8 points is created and silhouette intersection tests are performed
on these boxes, as shown below in Figure 4.3.

Figure 4.3: Projection of voxels on the camera
image plane [8].
The algorithm given below was used to find the projection of each voxel on a camera image plane and the bounding box associated with that voxel:
Internal
Parameters:
is the focal length
of the camera,
is the image plane
center,
is the skew
coefficient,
is a 1x5 matrix containing the radial and tangential
distortion coefficients.
External
Parameters:
is the translation
matrix and
is the rotational
matrix.
Chapter 2 has an explanation of camera parameters and how they were calculated.
(4.1)
(4.2)
(4.3)
(4.4)
(4.5)
(4.6)
(4.7)
(4.8)
The above
algorithm was implemented for each of the 8 vertices of all the voxels.
and
contains the x and y
coordinates of the projection of a voxel’s vertex onto the image plane.
From the 8 points that are obtained from the above algorithm, the maximum and the minimum values of the x and y coordinates was calculated. From this a bounding box, which is a square that contains all the 8 projected points was constructed.
Figure 4.4 below shows the bounding
box projection of each voxel onto the camera image plane calculated using the
above algorithm along with the silhouette obtained from that particular camera.

Figure 4.4: Projection of the bounding boxes of each voxel on the image plane of a camera and the corresponding silhouette obtained from that camera.
Each voxel is classified as either inside or
outside silhouette the according to the algorithm given below. The outside
voxels are discarded and voxel carving is performed [1], as shown below for
three cameras in Figures 4.5 through 4.7.
The silhouette bounding-box intersection
pseudo-code used for classifying voxels is defined as follows: Let (x_min,
y_min) , (x_max, y_min) , (x_min, y_max) and x_max , y_max) be the four points
of a bounding-box.

(a) Camera one.

(b) Camera two.

(c) Camera three.
Figure 4.5: Image silhouettes from three cameras placed
at different angles.

Figure 4.6: The volume of interest carved by
performing silhouette intersection tests using the silhouettes in Figure 4.5(a)
and 4.5(b).

Figure 4.7: Shows the voxel model being further
refined by performing the silhouette intersection test using the silhouette in
Figure 4.5(c).
5.1 Introduction to Ellipsoid Template Fitting
After obtaining the 3D voxel data from the procedures described
previously, ellipsoid models are used to fit the 3D voxel data in real time.
Mickic et al. [1] use a sequential template growing
procedure to locate body parts. A spherical crust template whose inner and
outer diameters correspond to the smallest and largest expected head dimensions
is created. The spherical crust is then moved in the 3D space until the voxels
that are inside the larger crust is maximized. We found this technique of
searching for the location of the head unstable and not feasible for our
application in real time.
Kanade and Cheung [8] use 6 ellipsoid shells (the
head, body, two arms and two legs) to fit onto the 3D voxel model. To make
their fitting process fast for a real-time system, they use a two step
Expectation-Maximization (EM) like approach. A proximity criterion is used to
segment the voxel data. Based on this, voxels are assigned to shells.
We fit the templates by locating the head first
from the 3D voxel data, this initialization would however fail if the subject
has his arms raised above his head during initialization. Once the head is
fitted with a template, the torso and the limbs are located from the voxel data
and are fitted with cylinders and ellipsoids.
5.2 Procedure for Ellipsoid Template Fitting
The head is fitted with a sphere of radius that is determined from the anthropometric data that would correspond to the subject (example: short male’s head size is 140x180mm). The head is located by measuring the height of the 3D voxel model. This part of the ellipsoid initialization would fail if the subject has a posture in which the head is below another body part (example: if the subject has his hands raised above his head). Refer Appendix C for anthropometric data used in template fitting.<