Vision Processing for Realtime 3D Data Acquisition Based on Coded Structured Light 
S.Y. Chen, Member, IEEE, Y.F. Li, Senior Member, IEEE, and Jianwei Zhang, Member, IEEE 
Abstract?Structured light vision systems have been successfully used for accurate measurement of 3D surfaces in computer vision. However, their applications are mainly limited to scanning stationary objects so far since tens of images have to be captured for recovering one 3D scene. This paper presents an idea for realtime acquisition of 3D surface data by a specially coded vision system. To achieve 3D measurement for a dynamic scene, the data acquisition must be performed with only a single image. A principle of uniquely colorencoded pattern projection is proposed to design a color matrix for improving the reconstruction efficiency. The matrix is produced by a special code sequence and a number of state transitions. A color projector is controlled by a computer to generate the desired color patterns in the scene. The unique indexing of the light codes is crucial here for color projection since it is essential that each light grid be uniquely identified by incorporating local neighborhoods so that 3D reconstruction can be performed with only local analysis of a single image. A scheme is presented to describe such a vision processing method for fast 3D data acquisition. Practical experimental performance is provided to analyze the efficiency of the proposed methods.
Index Terms—realtime measurement, 3D data acquisition, unique code, colorencoded, structuredlight, vision sensor, computer vision, robotics, perception
EDICS Categories?/span> HDWDSGN, STESTIP, HDWEMBD, STESTCD, SDPSCAN, OTHOPTI
Computer vision has become a very important means to obtain the threedimensional (3D) model of an object. A number of 3D sensing methods have been explored by researchers in the past 30 years [1][7]. The structured light has made its progress from single lightspot projection to complex coded pattern, and consequently the 3D scanning operation speeds up from several hours per image to dozens of images per second [4] [8] [9].
The first stage of feasible structured light systems came in early 1980 when the binary coding or gray coding methods were employed. Figure 1 illustrates a typical set of light patterns by Inokuchi et al. [10]. This kind of pattern can achieve high accuracy in the measurements [11][16]. This is due to the fact that the pattern resolutions are exponentially increasing among the coarsetofine light projections and the stripe gap tends to 0, but the stripe locations are easily distinguishable since a small set of primitives is used, and therefore the position of a pixel can be encoded precisely. It also takes the advantage of easy implementation and thus this method is still the most widely used in structured light systems. The main drawback is that they cannot be applied to moving surfaces since multiple patterns must be projected. In order to obtain a better resolution, a technique based on the combination of gray code and phase shifting is often used [11]. Its drawback is that a larger number of projection patterns (e.g. > 20 images) are required.
With the aim to project only one light pattern before capturing a scene image, color stripes are invented for replacing multiple black/white projections. This idea brings a development of “oneshot?3D image acquisition and it is possibly applied in measuring moving objects. People have attempted a lot of such systems for practical implementation [7], [9] [17][23], in which a phaseshifting method can also be employed [22]. Among them, the De Bruijn sequences are the mostly used techniques [20] [21]. Although these promise realtime applications, limitations of this method are still considerable. One is its tradeoff between reliability and accuracy. Since adjacent color stripes should have enough spectral difference, people have to use a limited number of color stripes or apply them periodically, which produces either stripe ambiguity or rough resolution. Another limitation is the flexibility of its system setup. Since it is a onedimensional (1D) spatial coding method, the baseline between the camera and the projector should be nearly orthogonal with light planes. It is suitable for setting up a fixed system, but not for some applications where dynamic reconfiguration and recalibration if multiple degrees of freedom are required.
Twodimensional (2D) spatial coding has the advantages of windowed image processing and flexible system configuration. The works in the community by Griffin et al [24] and Salvi et al [2] contribute to this technique. In such a coded structured light system, the patterns are specially designed so that codewords are assigned to a set of pixels. As every coded pixel has its own codeword, there is a direct mapping from the codewords to the corresponding coordinates of the pixel in the pattern. To this end, a mathematical study is carried out in [24] (Fig. 2) to determine what should be the largest size allowed for a coded matrix of dot pattern. It is based on several assumptions. First, a dot position is coded with information emitted by itself and the information of its four neighbors. Second, there cannot be two different dot positions with the same code. Third, the information is determined using a fixed basis, which determines the symbols used to code the matrix. Fourth, the biggest matrix is desired, i.e. the matrix which gives a better resolution. The codewords are simply numbers, which are mapped in the pattern by using grey levels, color [2] or geometrical representations [24]. However, these special geometrical shapes or color lines have to be placed separately for them to be detected in an image (Fig. 3). Otherwise, the uncertainty in real scene would make this detection very difficult due to noise, distortion, and discontinuity. In fact, the adjacent shapes or lines should be different and placed on each other with direct contact, as formulated in this paper later.
In order to gain flexibility during the acquisition process, adaptive techniques can be used. Researchers have investigated some active stereo systems that can adapt the color of the projected pattern to tackle the problem of light reflections generated by the scanned objects [1] [3] [25]. An interesting work is carried out by Koninckx etc. [26]. They propose a realtime scanner that can adapt itself to the scene. It aims to generate better patterns online by taking the properties of scene and setup into account. The code lines are generated according to epipolar geometry (Fig. 4). A weighted combination of different coding cues yields a robust way to solve the correspondence problem. The system, however, is a little complex as it requires predicting, labeling, and tracking scene features. An assumption is also based on temporal continuity between subsequent frames. Regarding codification, a single code line, as explained on the other hand, poses too much of a risk to go undetected in large parts of the image. More vertical codelines generate a higher codedensity, but the decoding becomes worse conditioned. Thus a tracking algorithm has to be involved.
In this paper, we propose a new idea in designing a grid solid pattern for 3D reconstruction with fast matching strategies. Based on this idea, the system combines the advantages of realtime, lowcost, reliable, and accurate 3D data acquisition. The steps for vision processing, including color codification, pattern rendering, word seeding and flood searching, mesh amendment, and 3D computation are investigated in the paper. Efficiency analysis and experimental implementation are also reported in following sections.
The structured light system in this work consists of a CCD camera and a digital projector (Fig. 5). That is similar to the traditional stereo vision system, but with its second camera replaced by the light source which projects a known pattern of light on the scene. Another single camera captures the illuminated scene. The required 3D information can be obtained by analyzing the deformation of the imaged pattern with respect to the projected one. Here the correspondences between the projected pattern and the imaged one can be solved directly via codifying the projected pattern, so that each projected light point carries some information. When the point is imaged on the image plane, this information can be used to determine its coordinates on the projected pattern.
Different from the case of the stripe light vision system where the coordinates on the projector can be determined by analyzing the bitplane stack obtained from multiple images, the coordinates in the color projection vision system have to be determined in a single image. In an effort to avoid the drawbacks of the traditional coding techniques, we attempt to improve the projected light pattern since a practical system has to consider more application requirements.
A method is developed for designing the grid patterns that can meet the practical requirements of uniquely indexing for solving uncertain occlusions and discontinuities in the scene. Let P be a set of color primitives, P={1,2,? p} (where the numbers {1,2,? p} representing different colors, e.g. 1=white, 2=red, 3=green, 4=blue, etc.). These color primitives are assigned to an m?/span>n matrix M to form the encoded pattern which may be projected onto the scene. We define a word from M by the color value at location (i, j) in M and the color values of its 4adjacent neighbors. If x_{ij} is the assigned color point at row i and column j in M, then the word for defining this location, w_{ij}, is the sequence { x_{ij}, x_{i,j1}, x_{i1,j}, x_{ij+1}, x_{i+1,j}} where i?/span>{1, 2, ? m} and j?/span>{1, 2, ? n}, i.e. w_{ij} is a substring as following
_{}? (1)
If a lookup table is maintained for all of the word values in M, then each word defines a location in M. Then we can know that an m?/span>n matrix M has (m1)?/span>(n1) words. These words are made up of a set W. We need to assign the color primitives of P to the matrix M so that there are no two identical words in the matrix, i.e.
Condition 1: _{} ?2)
Furthermore, every element has a color different from its adjacent neighbors in the word, i.e.
Condition 2: _{} (3)
In this way, each defined location is uniquely indexed and thus correspondence will be of no problem. That is, if the pattern is projected onto a scene, and the word value for an imaged point (u, v) is determined (by determining the colors of that imaged point and its 4adjacent neighbors), then the corresponding position (i, j) in M of this imaged point is uniquely defined. Of course, in addition to having each word of M be unique, we also wish to optimize the color code assignments so that matrix M is as large as possible.
A problem should be considered in the assignment. Because there are only three primary colors, the color pattern should be divided into several distinguishable color codes. To reduce the complexity of identifying color codes of a grid point among its neighbors, every two color codes should have enough distance. This requires a tradeoff between the number of color codes and the average code distance. The white color should be utilized mostly for segmentation of neighbor grid points so that the pattern will produce maximum image irradiance values.
According to the perspective transformation principle, the image coordinates and the assigned code words of a spatial point are correspondent to its world coordinates. We can establish such a mapping relation between an image point in the image coordinate system and the spatial point in the world coordinate system. X, Y, and Z are the coordinates of a world point, corresponding with the image coordinates u, v and x, y. Together with the system calibration parameters, the 3D information of the surface points can be easily computed. Effectively, it can guarantee that the measurement system has a limited cost of computation since it only needs to analyze a small part of the scene and identify the coordinates by local image processing. Therefore the acquisition efficiency is greatly improved.
First, with a given color set P, we try to make a longest horizontal code sequence,
_{} ?4)
where m is the sequence length. For any adjacent color pair, it satisfies
_{} ?5)
and any triplet of adjacent colors, T_{3i} = [c_{i}? c_{i+1}?c_{i+2}], is unique in the sequence,
_{} ?6)
The maximal length of the horizontal sequence S_{h} is:
Length(S_{h}) = p(p1)(p1)+2. ?(7)
This is obvious, since for p colors, the maximal number of independent triplets is p(p1)(p1). Suppose all of them can be linked together, and the chain length is p(p1)(p1)+2. Practical deduction also proved that this chain length is always attainable.
Since analytical solution to derive such a horizontal sequence is complex, it can be generated by a randomsearch algorithm instead. In this work, we tested with all color numbers less than 32 and every color set can generate a chain with its maximum length in a few seconds or minutes. The execution time for 7 colors is 0.2 seconds to generate a chain with 254 digits. The time for 15 colors is 25 seconds for a 2942length chain. The time for 32 colors is about 5 minutes to generate a 30754chain (no practical vision system requires so much codes actually). The searching time increases exponentially with the number of colors. It is, however, generated offline only once and will not affect the realtime performance. Therefore no much attention was paid to improving the searching algorithm. Furthermore, a 1024length horizontal chain is enough for common gridcolor coding since the projector resolution is limited. Usually we need a grid size with at least 5x5 square.
Second, with a given color set P, we try to make a color state transition sequence (STS), which will be used to derive the color sequence from one row to a new one,
_{} ?8)
where n is the sequence length. For any adjacent color pair, it does not need to satisfy condition (5), but has to ensure its uniqueness in the sequence,
_{} ?9)
With unique adjacent state pairs, one can get an STS with the longest Length(S_{s}) = (s1)^{2} +1 in the following way:
S_{s} = [1 1 2 1 3 1 4 ...1 s, 2 2 3 2 4 ...2 s, ..., (s1) s, s 1] (10)
The sequence S_{s} in (10) satisfies (9) and all its pairs are unique in the sequence. In fact, the first part of the sequence [1 1 2 1 3 1 4 ...1 s] contains all pairs within a 1 except for [s 1]. It has 2(s1)+1 digits and each pair is unique. The second part contains all unique pairs within a 2 and this part adds extra 2(s2)+1 digits. In this way, adding a 1 at the end of the sequence, we find that each pair in the sequence is unique.
For p colors, the available states to change is one less than it, i.e. s = p1. Finally, the matrix for color projection can be generated by the longest STS and a maximal horizontal sequence (S_{s} and S_{h}). This produces a matrix with the size of (p1)^{2}+2 by p(p1)^{2}+2 which is the maximum possible size for each codeword being unique in the matrix. The first row in the matrix can be defined by S_{h}. A second row is created by adding the first element of S_{s} to each element of S_{h} modulo p, where the modulo operation is only on the set {1, 2 . . . . . p} and does not include the 0 element as does the traditional modulo operation. Then we can create a third row by adding the second element of S_{s} to each element of its above row modulo p. In this way, for a 4color set the construction is an 11?/span>38 matrix. If it is defined as M = S_{h} ?/span> S_{s}, it can be proved that according to definition (1) each word in the matrix S is uniquely located.
The abovementioned method generates a special code matrix which satisfies conditions (2) and (3). This generation scheme is a finite automata: after the first row is defined, a following row is generated by a number of transitions jumped from its row above. However, this scheme has the drawback that the matrix has a “long band? shape which is sometimes not what we want. For example, a 4color set generates an 11?/span>38 matrix, a 5color set generates an 18?/span>82 matrix, a 6color set generates a 27?/span>152 matrix, etc. The practical digital projector usually has an image with 4:3 or 16:9 for width:height. Therefore, we desire to generate a matrix like that shape or a square. On solution is to generate a very large matrix and we only cut a part of it to fit the practical projector, but this wastes many color codes. On the other hand, while it is still difficult to mathematically generate such matrices by a formulation, this paper solve this by computer simulation. A program is developed to find a maximum square matrix using a randomsearch algorithm. Examples of the generated grid patterns are given in the next subsection.
In case that a color set contains four different colors, a matrix can be formulated by the generation scheme, M = S_{h} ?/span> S_{s}, which generates an 11?/span>38 coded pattern. Practically, however, we usually need a square pattern to output to a digital projector, and thus a matrix of only 11?/span>11 can be utilized. Using the randomsearch algorithm, it found the maximum square matrix of a 4color set is with size of 18?/span>18.
To increase the matrix size so that the digital projector will project a light pattern with better resolution, we have to increase the color number. In our laboratory, a set with 7 colors is often used, which can generate a matrix for a 38?/span>212 rectangle or an 82?/span>82 square. Figure 6 illustrates the light pattern with each grid size = 20?/span>20 pixels.
When such a coded matrix is projected by a digital color projector, each word in the pattern can be found with its unique coordinates from an image.
For 3D reconstruction, an important step is to find a unique word (initial seed) in an unknown area of the acquired image. This can be implemented in the following way. First, randomly generate a position in the image or in the WOI. The color at this position should not be BLACK. Simply judge it by a logical function
_{}? (11)
where r, g, and b are the three color components of the sampled point, T_{b} is a black threshold, I_{s} is a small neighbor area of the concerned point, gray(x) is the function to convert color to gray value, and count(x) is the number of pixels of an area.
Then, find the square grid point at that position. A color similarity measurement (12) is used to search a quadrangle in which colors are changing slightly compared with those outside.
_{}? (12)
where c_{1} and c_{2} are two color values, r_{e}, g_{e}, and b_{e} are related coefficients obtained from color calibration. The grid point is set to be the centroid of this quadrangle.
Based on this grid point, we try to locate its four adjacent neighbors. Simply set the offset to be the gridsize estimated, the left, right, above, and nether points are initialized and the four square areas are determined. If this grid point is found not in a regular shape, or any one of the four neighbors failed to be located, another initial position should be generated. Finally, the coordinates of the seed word are determined according to the five grid points by corresponding their color codes in the pattern matrix.
With the known grid size and initial seed word, it is easy to find all adjacent words by a flood search algorithm [27][28]. It firstly tries to search several grid points around the seed word, and then search more grid points near the known area. Each point to be added in the known partial net has to satisfy three conditions ?its color, size, and regularity.
The color measured in the image is often not ideal as what should be due to the distortion in the vision system and scene reflection. Besides the color calibration strategies to be discussed later, we can determine it by a color likelihood function. The image pixel is compared with all the seven ideal colors in the coding set. If the desired code color corresponds to one of the three largest likelihood values, the grid point is accepted in the net.
Since it is a “onepass? method, i.e. the pixels are computed only in a small local area once, the image processing can be performed very fast, promising realtime applications. The speed evaluation will be analyzed in the next section for performance analysis and also in the experiment section.
The mesh amendment and grid interpolation procedures are developed in this paper for optimization of 3D results. The projection of the coded pattern should result in a regular mesh. However, due to the complexity of the scene and uncertainty in image processing, the constructed grid matrix could have some faults (namely holes and leaves). To correct these faults, this research develops a Mesh Amendment Procedure to find and amend them. For some cases, it can decide directly whether “insertion?or “deletion?is necessary to amend the net (as illustrated in Fig. 7 and Fig. 8). Under a few other conditions, such an operation has to be determined according to its actual image content and with a likelihood measurement (Fig. 9).
After all possible code words have been identified from the image, it is easy to compute the 3D world coordinates of these points since the coordinates on both the image (xc, yc) and the projector (xp, yp) are known. This yields a rough 3D map of the scene. In order to improve the resolution, we may perform an interpolation algorithm on such map. Depending on the application requirements, the interpolation may be only on the segment of two adjacent grid points or inside the square area formed by four regular grid points.
This section provides theoretical analysis of the time complexity for obtaining a 3D image so that we may select better strategies to implement for realtime applications. There are many image operations in the abovementioned procedure for 3D surface construction. To simplify the analysis but without loss of much precision, in this paper the elemental arithmetic operators of the machine processor are assumed to contain only operations equivalent to add and multiply. Equivalent add operations are such like addition, subtraction, comparison, assignment, and logical operations. Equivalent multiply operations are such like multiplication, division, and square operations. Assume that an equivalent add costs a time complexity of C_{add} and an equivalent multiply costs complexity of C_{mul}. The main steps that affect the online speed for 3D image acquisition are analyzed below.
Assume that a grid point in the image has a rough size of u?/span>v. To locate and identify the area of such a grid point, we need to do the following steps:
To locate to a new grid point, since the rough position is generated by the flood searching algorithm or generated with seeds, this step is simply to copy the coordinates in the image. The cost is equivalently equal to four additions:
t_{p1} = 4C_{add}.
To determine if the candidate position is admissible in the sense of color pattern satisfaction, the color likelihood is measured. The cost is equivalent to
t_{p2} = 2C_{mul} +?3(1C_{mul} + 4C_{add}).
Suppose that we need to evaluate every pixel inside the square area of a grid point. The computation complexity is mainly occupied by (12):
t_{p3} = 2uv [4C_{mul} + 5C_{add}]
where the square root function was eliminated since practically it is not necessary to perform it.
In fact, the optimization can be carried out to evaluate only some pixels near the four boundaries for a practical system. That will greatly reduce the computation complexity.
The complexity for determining the centroid of a square area can be estimated as:
t_{p4} = 2uv [2C_{mul} + 2C_{add}] + 2C_{mul}
If for reason of speed, this centroid estimation can also be given with a rather low cost, i.e. to simply determine it according to the four boundaries. Then it is nearly
t^{o}_{p4} = 6C_{add}
In total, the computation cost without optimization is about
C_{p} = t_{p1} + t_{p2} + t_{p3} + t_{p4}
= 12uvC_{mul} + 14uvC_{add} + 7C_{mul} + 12C_{add} ?(13)
It is obvious that the computation complexity is dominated by the first part, i.e. O(C_{p}) = O(12uvC_{mul}). Therefore, the time cost for identification of a grid point in the image can be estimated as
T_{P} = 12uvT_{mul}. (14)
where T_{mul} is the average time for an equivalent multiply operation of the machine processor.
For an image of size m?/span>n, the time cost of flood search to identify all possible grid points is
_{}? ?15)
where C_{F} is the time complexity for constructing the searching mesh and it can be estimated as
_{}? ?(16)
Here we assumed that each point is visited two times during the flood search. In fact, there are many algorithms proposed to visit each point with only one time [27][28]. This paper would not adopt these best strategies, but only roughly estimate the possibility for realtime application and leave more optimization opportunities for engineering implementation.
Combining (13) with (16), we get
_{}? ?(17)
The time cost for flood search over the whole image can be estimated as
T_{5} = 12mnT_{mul}. (18)
Based on the above analysis, we can further estimate the time cost for mesh amendment to be
_{} (19)
where c_{match} is a coefficient reflecting the matching complexity in pattern comparison between the constructed image mesh and the amendment pattern list (here c_{match} = 49 for those as in Fig. 79).?T_{add} is the average time for an equivalent add operation by the processor.
We can see that the time cost for computation of 3D coordinates, T_{7}, is mainly to solve a linear system as formulated similarly with stereo vision:
X = (A^{T}A)^{1}A^{T}B (20)
where X = [x, y, z] is the world coordinates of a specific point in the scene, the matrices A with size 4?/span>3 and B with size 4?/span>1 are constructed directly from the calibration matrices, coordinates (xc, yc) and (xp, yp). Then T_{7} can be estimated to be
_{} ?(21)
For computing the 3D mesh, we can choose to do it from either the original image data with formula (20) or simply interpolating 3D coordinates of known grid points in T_{6} step. The former can give better accuracy but low efficiency, as compared to be
_{} ?(22)
_{}
_{} (23)
That means that the latter runs about eight times as fast as for direct computation from original image data.
Now consider the construction of the 3D surface from the whole image. Similar to compute for the 3D mesh, it can also be processed with two ways:
_{} (24)
_{} ? ?(25)
It can also be eight times faster for a complete 3D surface when only interpolating from 3D grid points.
Table I summarizes the relative time costs of these main functions for acquisition of a 3D image from the vision system. Practical verification is carried out in experimental studies and will be presented in the next section.
TABLE I. Relative Time Complexity of Some Main Processing Functions for Image Analysis and 3D Reconstruction

Function 
Theoretical time 
T4 
locating a seed word 
5Tp=60uvT_{mul} 
T5 
Flood search for all grids 
12mnT_{mul} 
T6 
Mesh amendment 
98mnT_{add}/uv 
T7 
3D computation of mesh grid points 
16mnT_{mul}/uv 
T8 
3D wire shape 
T_{8a} = (u + v)T_{7} T_{8b} = T_{8a}/8 
T9 
Interpolation or computation for the full 3D surface 
T_{9a} = uvT_{7} T_{9b} = T_{9a}/8 
To implement the idea in a practical vision system and analyze the performance, we have to consider many other factors and conditions. In fact, this method has to be integrated with other techniques and algorithms for automating the modeling process, such as system calibration, image processing, 3D representation, and visualization. Thanks to considerable fundamental works on computer vision developed in our early projects [29], the experimental system is convenient to reset up for this purpose.
The vision system in our laboratory includes a structured light sensor set up with a projector and a camera. The projector is a palmsized digital projector with 2000:1 contrast and 1.211.8 m working distance. It is connected to a computer and is controlled to generate the color encoded patterns for 3D reconstruction. The CCD camera (PULNIX TMC9700) has a 1inch sensor and a 16 mm lens. A 32Bit PCI Frame Grabber for machine vision by Coreco Imaging Co. Ltd., PC2Vision, is used to capture live images in 640?/span>480 size. The main computer is a common PC, with a 2.1GHz CPU and 512MB RAM, for image processing and 3D computation.
In the experiments, a 44?/span>212 encoded pattern generated from a 7color set (Fig. 6) is used to illuminate the scene. The grid size is 25?/span>25 pixels. Figure 10 illustrates an image captured by the camera, in which there are about 30 * 37 = 1110 grid points. A seed word is identified randomly in the image. Then grid points are detected by a floodsearch algorithm. Repeating the work until no large area is possible to yield more points, the whole net will be merged from them. The amended mesh after detecting isolated holes and abnormal leaves is also illustrated in Fig. 10. Finally, the 3D mesh was reconstructed after performing 3D computation and a typical example is illustrated in Fig. 11.
To observe the execution performance and evaluate the efficiency of each function in the vision system, this paper used the Performance Analyzer (a program development tool) to check the time spent in some important procedures. For a common PC nowadays with a 2GHz CPU, a typical float multiply operation takes several to tens nanoseconds (ns). If an additional floating point coprocessor is integrated, it can be done below 1 ns. Here if we, with a little conservation, assume a typical multiplication takes 10 ns and a typical addition takes 1 ns, the theoretical time and practical time measured in our lab is listed in Table II. It can be seen that for 3D reconstruction in lowlevel or midlevel resolution (only computing the 3D coordinates on grid points or grid edges), it takes 70 to 100 ms (T10 and T11). That speed is adequate for most applications. Although theoretically the speed can be achieved to about 38 ms, we currently have some break points for debugging in the program that affect some efficiency.
TABLE II. Analysis of Execution Performance

Function 
Theoretical time 
Actual time 
Remark 
T0 
Pattern Coding 
secondsminutes 
NA 
Done offline 
T1 
Light projection 
17 ms (first frame) 0 ms (new frames) 
0 
Pattern is never changed 
T2 
Image acquisition 
Dependent on imaging device hardware 
25 ms 
For image freezing 
T3 
Memory allocation, movement 
Dependent on the computer 
0.3 ms 
Can be faster when using high speed cache 
T4 
locating a seed word 
60uvT_{mul} @135us 
0.1 ms 

T5 
Flood search for all grid points 
12mnT_{mul} @36.864ms 
44 ms 

T6 
Mesh amendment 
98mnT_{add}/uv @133.8us 
0.1ms 

T7 
3D computation of grid points 
16mnT_{mul}/uv @218.45us 
3 ms 

T8 
3D wire shape 
T_{8a} = (u+ v)T_{7} @6.55ms T_{8b} = T_{8a}/8 @819.19us 
32 ms 
Measured for T_{8a} 
T9 
full 3D surface 
T_{9a} = uvT_{7} @49.15ms T_{9b} = T_{9a}/8 @6.14ms 
85 ms 
Measured for T_{9a} 
T10 
3D imaging with lowresolution 
37.35ms 
72.5 ms 
T1+T2+T3+T4 +T5+T6+T7 
T11 
3D imaging with midresolution 
43.90ms / 38.17ms 
104.5 ms 
T8+T10 
T12 
3D imaging with highresolution 
86.50ms / 43.49ms 
157.5 ms 
T9+T10 
1 ns = 10^{3} us = 10^{6} ms = 10^{9} second.
Theoretical time is calculated when u=v=15, m=640,n=480, T_{mul}=10ns.
Furthermore, by applying the speed optimization as proposed in Section IV, we can estimate that the processing efficiency can be further improved by 30% to 50%. More possible opportunities still exist to optimize for realtime use. Furthermore, the time for imaging in Table II can also be eliminated since image freezing and image processing can be made in parallel. Then the speed is limited to the bigger number. With good engineering skills for system implementation in both software and hardware, it will not be difficult to achieve the speed of about 30 frames per second (fps) for the 3D imaging system.
The proposed method is based on a specially coded pattern projection which allows the 3D vision processing to be performed locally. The coding scheme assures realtime processing in three aspects: (a) we may choose only the region of interest in the image to be reconstructed, (b) image processing is performed locally in a oneword area, (c) we may choose to reconstruct the 3D surface in a high, middle, or low resolution according to practical situations (to reconstruct a surface in lower resolution according to hardware limitation and processor speed).
The efficiency is analyzed by both theoretical computation complexity and experimental observation. Although practical experiments coincide well with the analytical estimations, the computing time depends on many factors, e.g. segmentation complexity, surface color appearance, block size, image distortion, chip speed, caches/buffers, memory access speed, operating system, thread load, etc.
For comparison with a few other good results reported, Koninckx et al. [26] reported that the frame rates varying between 10 and 25 fps, dependent on scene complexity. Zhang & Huang [22] implies that the scanning speed would be varying between 26240 fps according practical hardware and software conditions. Tsalakanidou et al. [23] deliver 1725 fps. Our method has no obvious advantage over theirs on the aspect of processing speed, but our proposed scheme has the advantages of reliability (every word is independent), hardware reconfiguration (relative pose between the camera and projector is not restricted and even dynamically adjustable), flexibility (selectable resolution and color sets), etc.
Although the 3D computation is briefly discussed above, no metrological evaluation is given in the paper yet. In fact, the accuracy or precision of the proposed system is similar to other typical structured light systems since the triangulation is based on the same geometry. The best accuracy of those systems using stripe light projection (without phaseshifting) is expected to be achievable in this system, but all factors that affect the accuracy in common structured light systems will also result in some errors. The experiments in this paper were carried out with a typical system setup. In fact, the accuracy of dimensional measurement is dependent on many factors (e.g. the length of baseline, image resolution, projector resolution, object distance, calibration accuracy, surface orientation, etc.), but affected less by the vision processing method. To estimate the measurement precision in detail, some previous works on structured light can be referred to [8] [15] [20] [31][34]. For example, Sansoni etc. [15] analyzed the systematic errors dependent on the baseline length d, object distance L, and imaging angle a.
From these contributions, we know that the metrological sensitivity is mainly on its system structure. In this paper, the baseline is about 200 mm, the object distance is about 350 mm, the projection angle is about 20 degrees, and the image resolution is about 640x480. Since we didn’t perform a very careful calibration before it is used for 3D data acquisition, the observed error is about 2% relatively on some specific line segments in the scene. The accuracy can be raised greatly by some engineering skills, e.g. about 0.01% as in [23].
System calibration errors will directly introduce measurement errors [8], and thus careful calibration should be performed for an engineering system. LegardaSaenz et al. [31] and Vargas et al. [34] give us a detailed description of an accurate procedure to calibrate the structured light system. Methodology for error estimation can also be found in [33].
If the spatial resolution does not meet the requirement of measurement precision, a subpixel strategy may be employed for improvement [7][20]. When the image resolution increases, however, more processing time is required accordingly.
In the above deduction, the system requires that the sensor has a uniform spectral response; that is, the intensity of the red, green, and blue signals are comparable. To achieve this, a calibration procedure should be performed. Wust et al. [18] use a flat plane of the same color as that of the surface to be measured to sample standard colors for rectifying these responsive curves.
Sometimes color space transformation, performed to use intensity, hue, and saturation instead of RGB values, is helpful to improve image segmentation and word identification. However, this introduces some extra computational load and it can be applied only when processing time is not critical.
The hardware setup can also be implemented something like that in the patent [9] for practical cost consideration. Instead of using a digital projector, the one replaced by a common light source together with a transparent (grass or plastic) plate will greatly cut down the hardware cost (even cheaper than a common stereo vision setup). This, however, reduces the flexibility of the 3D imaging system since light pattern can not be dynamically changed according to scene conditions.
Parallel computation can also be considered if we need a highspeed 3D imaging system. Since the proposed method is based on local processing, this can be implemented without much difficulty.
In general, this paper did not adopt the best strategies in many issues, but only focuses on the estimation of the possibility for realtime application and leave more optimization opportunities for engineering implementation.
Realtime, lowcost, reliable, and accurate 3D data acquisition is a dream for us in the vision community. While the available technology is still not able to reach all these features together, this paper makes a significant progress to the goal. An idea was presented and implemented for generating a specially colorcoded light pattern, which combines the advantages of both fast 3D vision processing from a single image and reliability and accuracy from the principle of structured light systems. With a given set of color primitives, the patterns generated are guaranteed to be a large matrix and desired shape with the restriction that each word in the pattern matrix must be unique. By using such a light pattern, correspondence is solved within a single image and therefore this is used in a dynamic environment for realtime applications. Furthermore, the method does not have a limit in the smoothness of object surfaces since it only requires analyzing a small part of the scene and identifies the coordinates by local image processing, which greatly improves the 3D reconstruction efficiency. Theoretical analysis and experimental results show that acquisition of a 3D surface with midlevel resolution takes about 100 ms which is adequate for many practical applications. Some software and hardware skills may be applied to further improve the speed to above 30 frames per second (fps). A parallel processing scheme will further increases the efficiency several times.
References
[1] M. Ribo and M. Brandner. ?/span>State of the art on visionbased structured light systems for 3d measurements? IEEE Int. Workshop on Robotic Sensors: Robotic and Sensor Environments, Ottawa, Canada, pp. 2, Sep. 20057.
[2] J. Salvi, J. Pags, J. Batlle. ?/span>Pattern Codification Strategies in Structured Light Systems? Pattern Recognition 37(4) , pp. 827849, April 2004.
[3] D. Desjardins, P. Payeur, "Dense Stereo Range Sensing with Marching PseudoRandom Patterns", 4th Canadian Conf. on Computer and Robot Vision, pp. 216  226, May 2007.
[4] F. Blais, ?/span>Review of 20 Years of Range Sensor Development?/span> J. of Electronic Imaging, 13(1) , pp. 231240, 2004.
[5] S. Osawa, etc., ?D Shape Measurement by Selfreferenced Pattern Projection Method? Measurement, Vol. 26, pp. 157166, 1999.
[6] C.S. Chen, Y.P. Hung, C.C. Chiang, J.L. Wu, Range, “Data Acquisition Using Color Structured lighting and Stereo Vision? Image and Vision Computing, Vol. 15, pp. 445456, 1997.
[7] L. Zhang, B. Curless, S. M. Seitz, “Rapid Shape Acquisition Using Color Structured Light and Multipass Dynamic Programming? IEEE Proc. on 3D Data Processing Visualization and Transmission, Padova, Italy, pp. 2436, June?2002.
[8] Li, Y. F. and Chen, S. Y., "Automatic Recalibration of an Active Structured Light Vision System", IEEE Trans. on Robotics and Automation, Vol. 19, No.2, pp. 259268, April 2003.
[9] T. Lu, J. Zhang, "Three dimensional imaging system", United States Patent 6252623, June 26, 2001.
[10] S. Inokuchi, K. Sato, and F. Matsuda, "Rangeimaging system for 3D object recognition", 7th Int. Conf. Pattern Recognition, Montreal, Canada, pp. 806808, 1984.
[11] J.Gühring, “Dense 3D surface acquisition by structured light using offthe shelf components? Videometrics and optical Methods for 3D Shape Measurement, vol. 4309, pp.200231, 2001.
[12] W. Krattenthaler, K. J. Mayer, etc., ?Dsurface measurement with coded light approach,?4th Int. Workshop for Digital Image Processing and Computer Graphics, Oldenburg, Germany, Vol. 12, pp. 103?14, 1993.
[13] S, Rusinkiewicz , O. HallHolt , M. Levoy, "RealTime 3D Model Acquisition", ACM Trans. on Graphics, Vol. 21, No.3, pp. 438446, 2002.
[14] P. Vuylsteke and A. Oosterlinck, "Range Image Acquisition with a Single BinaryEncoded Light Pattern", IEEE Trans. on Pattern Analysis and Machine Intelligence 12(2) , pp. 148164, 1990.
[15] G. Sansoni, M. Carocci, R. Rodella, ?/span>ThreeDimensional Vision Based on a Combination of GrayCode and PhaseShift Light Projection: Analysis and Compensation of the Systematic Errors?/span>, Applied Optics, Vol. 38, Issue 31, pp. 65656573, 1999.
[16] M. Young, E. Beeson, etc., ?/span>ViewpointCoded Structured Light? IEEE Computer Society Conf. on Computer Vision and Pattern Recognition (CVPR), Minneapolis, Minnesota, June 2007.
[17] E. Schubert, H. Rath, and J. Klicker, "Fast 3D Object Recognition Using a Combination of ColorCoded PhaseShift Principle and ColourCoded Triangulation," SPIE, vol. 2247, pp. 202213, 1994.
[18] C. Wust and D.W. Capson. "Surface Profile Measurement Using Color Fringe Projection," Machine Vision and Applications, 4, 1991, pp. 193203.
[19] K.L. Boyer and A.C. Kak, "Colorencoded structured light for rapid active ranging," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 9, 1987, pp. 1428.
[20] H. Li, R. Straub, and H. Prautzsch, "Structured Light Based Reconstruction Under Local Spatial Coherence Assumption", 3rd IEEE Int. Symp. on 3D Data Processing, Visualization and Transmission, Washington, DC, USA, pp. 575582, June 2006.
[21] J. Pages, J. Salvi, etc., ?/span>Optimised De Bruijn patterns for oneshot shape acquisition? Image and Vision Computing, 23(8) , pp.707720, 2005.
[22] S. Zhang; P. Huang, "HighResolution, Realtime 3D Shape Acquisition," 2004 Conf. on Computer Vision and Pattern Recognition Workshop, pp. 28 28, June 2004.
[23] F. Tsalakanidou, etc. "Realtime acquisition of depth and color images using structured light and its application to 3D face recognition", RealTime Imaging, Vol. 11, No. 56, pp.358369, Dec. 2005.
[24] P.M.Griffn, L.S.Narasimhan, S.R.Yee, “Generation of uniquely encoded light patterns for range data acquisition? Pattern Recognition, 25(6), pp. 609616, 1992.
[25] R. Furukawa, H. Kawasaki, "Dense 3D Reconstruction with an Uncalibrated Stereo System using Coded Structured Light", IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp. 107  107, June 2005.
[26] T. P. Koninckx, L. V. Gool, "RealTime Range Acquisition by Adaptive Structured Light," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 432445, March 2006.
[27] A. Glassner, "Fill 'er up! [Graphics filling algorithms]", IEEE Computer Graphics and Applications, Vol.21, No.1, pp. 78  85, Jan. 2001.
[28] Anton Treuenfels, ?/span>An Efficient Flood Visit Algorithm? C/C++ Users Journal, Vol.12, No. 8, Aug. 1994. http:// www.cuj.com
[29] S. Y. Chen and Y. F. Li, "Vision Sensor Planning for 3D Model Acquisition", IEEE Trans. on Systems, Man and Cybernetics, Part B, Vol. 35, No. 5, pp. 894904, Oct. 2005.
[30] C. Sinlapeecheewa and K. Takamasu, "3D profile measurement using color multiline stripe pattern with one shot scanning", Integrated ComputerAided Engineering, Vol. 12, pp. 333?41, 2005.
[31] R. LegardaSaenz, T. Bothe, and W. P. Jüptner, "Accurate Procedure for the Calibration of a Structured Light System," Opt. Eng. 43, 2004, pp. 464471.
[32] Z.J. Geng, "Rainbow ThreeDimensional Camera: New Concept of HighSpeed Threedimensional Vision Systems," Optical Engineering, vol. 35, pp. 376383, 1996.
[33] J.A. Beraldin, M. Rioux, etc., ?/span>Traceable 3D Imaging Metrology? Videometrics IX  SPIE Electronic Imaging 2007, Vol. 6491, pp. B.1B.11, 2007.
[34] J. Vargas, J. A. Quiroga, and M. J. TerronLopez, "Flexible calibration procedure for fringe projection profilometry", Optical Engineering, Vol. 46, pp. 023601, Feb. 2007.
S. Y. Chen (M?1) received the Ph.D. degree in computer vision from the Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong, in 2003.
He joined Zhejiang University of Technology, China, in Feb. 2004 where he is currently an Associate Professor in the Department of Information Engineering. Since July 2004, he has been invited as a guest researcher in the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. From Aug. 2006 to Aug. 2007, he received a fellowship from the Alexander von Humboldt Foundation of Germany and works in the Dept of Informatics, University of Hamburg, Germany. His research interests include computer vision, robotics, 3D object modeling, and medical image analysis.
Dr. Chen is a committee member of IEE/IET Shanghai Branch. He has published over 50 papers in important international journals and conferences. He received the Research Award of FokYingTung Education Foundation by Ministry of Education of China in 2006, was awarded as the Champion in 2003 IEEE Region 10 Student Paper Competition, and was nominated as a finalist candidate for 2004 Hong Kong Young Scientist Award.
Y. F. Li (M?1–SM?1) received the Ph.D. degree in robotics from the Department of Engineering Science, University of Oxford, Oxford, U.K., in 1993.
From 1993 to 1995, he was a Postdoctoral Research Associate in the Department of Computer Science, University of Wales, Aberystwyth, U.K. He joined City University of Hong Kong, Hong Kong, in 1995 where he is currently an Associate Professor in the Department of Manufacturing Engineering and Engineering Management. His research interests include robot vision, sensing, and sensorbased control for robotics. In these areas, he has published over 100 papers in international journals and conferences. He is an associate editor of IEEE Transactions on Automation Science and Engineering (TASE).
Jianwei Zhang (M?2) received the Ph.D. degree in informatics from University of Karlsruhe, Germany, in 1994.
From 1994 to 2002, he was an Associate Professor and Full Professor in the Technical Informatics Group, University of Bielefeld, Germany. He joined University of Hamburg in 2002 where he is currently a Full Professor (C4) and Director at the Institute of Technical Multimodal Systems in the Department of Informatics. He has authored more than 100 journal and conference papers and received the IEEE ROMAN Award in 2002. His research interests include robot learning, cognitive interface, intelligent signal processing, and neurofuzzy systems.
Manuscript received November 1, 2006, revised July 1, 2007. This work was supported by the NSFC [60405009], the Research Grants Council of Hong Kong [CityU 1206/04E], and the Alexander von Humboldt Foundation of Germany. The paper was presented in part at the IEEE International Conference on Robotics and Automation, Roma, Italy, April 2007.
S. Y. Chen is with the College of Information Engineering, Zhejiang University of Technology, 310014 Hangzhou, China. He was a guest researcher (Humboldt Fellow) in the Dept of Informatics, University of Hamburg, Germany in 20062007. (email: sy@ieee.org).
Y. F.?Li is with the Dept of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong (email: meyfli@cityu.edu.hk).
Jianwei Zhang is with the Dept of Informatics, University of Hamburg, Germany (email: zhang@informatik.unihamburg.de)