Physiological Reviews

Higher Order Visual Processing in Macaque Extrastriate Cortex

Guy A. Orban


The extrastriate cortex of primates encompasses a substantial portion of the cerebral cortex and is devoted to the higher order processing of visual signals and their dispatch to other parts of the brain. A first step towards the understanding of the function of this cortical tissue is a description of the selectivities of the various neuronal populations for higher order aspects of the image. These selectivities present in the various extrastriate areas support many diverse representations of the scene before the subject. The list of the known selectivities includes that for pattern direction and speed gradients in middle temporal/V5 area; for heading in medial superior temporal visual area, dorsal part; for orientation of nonluminance contours in V2 and V4; for curved boundary fragments in V4 and shape parts in infero-temporal area (IT); and for curvature and orientation in depth from disparity in IT and CIP. The most common putative mechanism for generating such emergent selectivity is the pattern of excitatory and inhibitory linear inputs from the afferent area combined with nonlinear mechanisms in the afferent and receiving area.


The visual system of primates consists of three main parts: a projection from the retina to the primary visual cortex or striate cortex, retinal projections to subcortical visual centers, and higher order or associative visual cortical areas beyond the primary visual cortex. This latter cortical expanse, the so-called extrastriate cortex, provides the outputs to the other cerebral systems and is the portion of the cortex where most of the analysis of the visual signals is performed. Thus any understanding of the behavioral role of vision is contingent on unraveling the function of extrastriate cortex. This cortical region is greatly expanded in primates compared with other mammals and has a characteristic architecture in primates. All primates share not only V1 and neighboring V2 and V3 areas, but also middle temporal (MT)/V5 area and likely V3A (121). The neuronal operations preformed by this associative cortex are the topic of this review. In particular, we shall focus on the novel selectivities, for complex aspects of the retinal image, that arise in these cortical regions, beyond the selectivity for simple features that is characteristic of V1 neurons. Thus we do not include in this review any further elaboration of these simple selectivities such as their invariance for position or size. For example, the relative sensitivity to achromatic and isoluminant gratings is size invariant in V2 but not V1 (290). We also restrict the review to studies using subjects that were passive with respect to the visual stimulus, either alert or anesthetized. Such studies reveal the basic visual processing capabilities of extrastriate neurons that can be modulated in many ways such as by the task, by attention, etc. These modulatory influences are often summarized as top-down signals. With regard to that terminology, we are restricting the review to bottom-up processing. Attention and task-related signals are only two examples of what are generally referred to as extraretinal signals. These extraretinal signals also include signals originating from other senses, inputs related to the motor system, such as proprioceptive or vestibular signals, or even signals originating in the motor structures (corollary discharges). In this review we restrict ourselves to the processing of the retinal signals in extrastriate cortex in either the forward (from lower order to higher order areas) or backward (from higher order to lower order areas) direction.

The extrastriate cortex includes a large number of cortical regions, some of them well known, many of them far less explored. Thus the review is also limited by the data that are available. The aspect of visual processing that has received the most attention is motion processing, probably because the parameters of motion are relatively simple, motion can be readily manipulated in display systems, and the initial stage of motion processing beyond V1 was discovered early on. Indeed, the visual area V5 (53) or middle temporal (MT) area in the caudal STS (5) was one of the first extrastriate areas to be discovered. Interestingly, recent functional imaging studies in which monkey and human extrastriate cortical regions are directly compared (201, 202, 325) have revealed a more extensive processing of motion signals in human compared with monkey cortex. Specialization for motion processing has been reported even in human striate cortex (232). The importance of motion processing may be related to the greater mobility of humans and even more so to the extensive use of their hands to manipulate objects, particularly tools, a proposal that has recently received direct support (296). We will review two novel selectivities that emerge in MT/V5: pattern direction selectivity and selectivity for speed gradients. The third aspect of motion processing is the extraction of optic flow components in MSTd, one of the areas receiving input from MT/V5. Much less progress has been made regarding the processing of static object attributes. Although selectivity for object shape has long since been attributed to infero-temporal (IT) cortex (86), progress has been slowed for several reasons. Many different types of stimuli can be used for its exploration, and it is not clear that the many man-made or abstract stimuli that are used in most IT studies are particularly useful for revealing the underlying function of this part of cortex. Furthermore, it has only recently become clear that shape is also processed in parietal regions (48, 274). Recently, two other advances have been made with respect to extrastriate function. On one hand, a number of studies suggest that early extrastriate cortex, i.e., cortical regions in the immediate vicinity of striate cortex, notably V2 and V3, plays a role in segmentation and the definition of figure-ground relationships. On the other hand, some advances have been made in unraveling the processing of stereoscopic information beyond the simple specification of depth.


A. Orientation and Spatial Frequency Selectivity

Hubel and Wiesel discovered the prototypical selectivity of striate neurons: orientation selectivity, first in cats (104, 105) and later in monkeys (106). They also described the receptive field organization of orientation selective neurons: simple, complex, and hypercomplex neurons. Today these hypercomplex neurons are more generally referred to as end-stopped neurons (205), since they could be of either the simple or complex variety (269) and to avoid the wiring implications of the term hypercomplex. Hubel and Wiesel (108, 109) also described a functional organization for orientation selectivity: the columnar organization of preferred orientations. This organization has been confirmed by several techniques including optical imaging (19, 194), deoxyglucose labeling (110, 326), and calcium imaging (195). Since the time of the original Hubel and Wiesel studies, orientation selectivity has been investigated quantitatively with a variety of stimuli, including bars and gratings, in both anesthetized and alert animals (36, 50, 90, 155, 246, 247, 270, 331). Hubel and Wiesel (104) proposed that the orientation selectivity of simple cells arose from the excitatory convergence of a set of geniculate afferents, the receptive fields (RFs) of which were aligned. This is also the archetype of one sort of mechanism with which to construct cortical selectivity: a specific pattern of excitatory inputs to the neuron. This is an interesting issue, as it tells us the limitations of our present technology, insofar as the exact circuit generating this selectivity is still under discussion. Evidence has been provided for the contribution of geniculate inputs (67, 123, 245), but evidence that intracortical inhibition plays a role has also been obtained (281, 284). In particular, recent studies of the dynamism of orientation selectivity have revealed several inhibitory mechanisms, both tuned and untuned (247, 275, 285, 344, 345). Together with orientation, spatial frequency defines the power spectrum of images. Spatial frequency selectivity in V1 neurons has been described quantitatively (29, 50, 64, 69, 271). This selectivity is determined by factors similar to those implicated in orientation selectivity, mainly the pattern of lateral geniculate nucleus (LGN) afferents and intracortical inhibitory inputs, whether tuned or not (344).

B. Direction and Speed Selectivity

Hubel and Wiesel (104, 106) first discovered direction selectivity in V1 neurons. Since then, direction selectivity has been studied quantitatively with moving edges, bars, and gratings in anesthetized and alert preparations (2, 50, 69, 206, 270, 287). There is now agreement that direction selectivity in V1 is concentrated in the laminae that project to MT/V5: layers 4B and 6 (90, 92, 161, 206). Recordings from large numbers of V1 neurons reveal a bimodal distribution of direction selectivity indices, suggesting that direction-selective neurons are a separate population of V1 neurons (90, 270).

The speed selectivity of V1 neurons was initially investigated with moving bars (206). This revealed an eccentricity and laminar dependence of the speed sensitivity, with neurons having preferences for faster speeds occurring in laminae 4B and 6 and at larger eccentricities. Thus, even when eccentricities are matched, one has to exercise caution when comparing the speed sensitivities of V1 and MT/V5 neurons: overall V1 neurons respond to slower motion than MT/V5 neurons (177, 178, 198); however, speed ranges are more similar between the populations of direction-selective neurons (235). By using gratings rather than bars, it becomes possible to compare the speed tunings for different spatial frequencies and to disentangle temporal frequency selectivity from speed tuning. Following this strategy, Priebe et al. (235) were able to show that there is a major difference between the direction-selective simple and complex cells of V1. In simple cells, spatial and temporal frequencies are separable, and only in complex cells is there evidence for speed tuning. This speed tuning is relatively similar to that of MT/V5 neurons (233), although the relationship between speed tunings for patterns other than gratings and the predictions derived from the tuning for gratings can differ between V1 complex cells and MT/V5 neurons (235).

C. Other Selectivities

V1 neurons are tuned along axes in color space when tested with isoluminant stimuli (51, 91, 152, 312, 337; for review, see Refs. 79, 153). Compared with geniculate neurons, the peaks of the chromatic tunings are much more widely distributed in V1 than in the LGN (152, 337). Chromatic tuning is invariant for contrast (289) and size (290) in only a minority of V1 neurons. Color induction effects, i.e., the shift of the color of a stimulus away from the color of the background, have been documented for V1 neurons in alert animals (337).

Initial studies by Hubel and Wiesel (107) failed to document disparity selectivity in V1 neurons. Later studies in awake animals (229, 230, 236) reported tuning for horizontal disparity as well as asymmetric disparity response curves (far and near cells). Since then, it has been shown that V1 neurons are selective for absolute, not relative, disparity (42), indicating that they only signal position in depth relative to the fixation point, not another stimulus. Furthermore, in central vision, V1 is specialized for horizontal disparity (39), but this vanishes in the more peripheral visual field representation of V1 (61, 62, 80). Although the disparity selectivity present at the level of V1 neurons imposes bounds on stereoscopic perception (192, 193), V1 neurons are far removed from perception. They respond even to anticorrelated stereograms (41) which elicit little or no depth perception (for review, see Ref. 40).

Thus V1 neurons are selective for simple attributes covering all aspects of vision: shape, motion, color, texture, and depth.

D. Receptive Field Issues

Hubel and Wiesel (104, 105) introduced two distinctions in the receptive field organizations of V1 cells with oriented RFs: that between simple and complex cells and that between end-stopped and end-free cells. While the distinction between simple and complex is routinely used in many V1 studies, see, e.g., speed tuning above, the role of end-stopped cells has received far less attention. A substantial fraction (∼25%) of V1 neurons are end-stopped (98, 265, 269), but these neurons are difficult to observe in awake animals because of their exquisite stimulus requirements, and their numbers may even be underestimated by the almost systematic use of extended stimuli such as gratings. Nonetheless, these neurons are most abundant in the superficial layers, those projecting to extrastriate cortex. Initially, these neurons were considered to contribute primarily to the analysis of shape, since they had been shown to respond to corners, end points of lines, and curved stimuli (52, 98, 105, 330, 347). It has been suggested by Bishop and co-workers (172) that end-stopping might be useful for restricting disparity tuning in directions both parallel and orthogonal to the preferred orientation. This mechanism might partially explain the specialization for horizontal disparities in central vision mentioned above, but the role of end-stopped cells in disparity processing in V1 of the monkey still has to be evaluated. End-stopped neurons have been shown to play a similar role in motion processing, insofar as end-stopped neurons can encode direction of motion in two dimensions, while end-free cells can only encode the direction orthogonal to their preferred orientation and thus suffer from the aperture problem (i.e., they signal motion in one direction, whatever the actual motion direction, Ref. 216). End-free cells will respond to an infinity of velocity vectors, ranging from the shortest vector orthogonal to the preferred orientation to the longest one nearly parallel to the preferred orientation. Indeed, it was documented long ago (340) that V1 neurons can respond to very fast speeds for bars oriented parallel to the preferred orientation. End-stopped cells should not respond under these conditions, although this has not been explicitly tested.

Finally, it is worth mentioning that even at the level of macaque V1, the classical receptive field from which excitatory responses are evoked is already surrounded by an antagonistic region that suppresses the responses evoked from the center (35, 120, 129, 231, 266, 282), as has also been reported for area 17 of the cat (45, 89, 167, 187) and owl monkey (6). In many studies the surround is considered to include the end-stopping regions (45, 266), but the sensitivity of end-stopping to contrast (347) casts doubt on this view. In addition to inhibitory influences from beyond the classical receptive field, facilitatory influences have also been shown that arise mainly from the end regions of the RFs (122).


The last tabulation of extrastriate regions was published some time ago (65, 328). At that time, roughly 30 extrastriate regions had been described (Fig. 1 A), and each of these areas was, on average, connected reciprocally with a dozen other areas. The complexity of this wiring diagram has suggested to Van Essen et al. (328) that the visual system must be a dynamic system that adapts itself to the needs of the subject, depending on the task he needs to perform with his vision. Such task-dependent processing is beyond the scope of the present review, but see Reference 210 for an early review and Reference 133 for a recent demonstration.

FIG. 1.

The extrastriate cortex in macaque monkey. A: lateral, medial, and flattened view of one hemisphere with the eyeball attached. Colors indicate visual regions; green and light blue indicate motor and auditory cortex, respectively. [Modified from Van Essen et al. (328).] B: the two pathways of extrastriate cortex for the ventral pathway. SC (A) and CS (B), colliculus superior.

There is still disagreement, however, about the exact parcellation at higher levels in the temporal and parietal cortex (327). These various possibilities can be visualized in Caret (329), including the new parcellation of the IPS by Lewis and Van Essen (157, 158). In fact, cortical areas are identified by a set of four independent criteria: cyto- and myeloarchitectonic organization, connection pattern with other cortical and subcortical regions, retinotopic organization, and functional properties. In the past, these properties were often documented within disparate studies, and progress towards an exact parcellation of the extrastriate cortex has consequently been slow. This less than satisfactory state of affairs may now change with the advent of fMRI in the awake monkey (324). This technique allows the sampling of a wide range of functional properties in the same set of subjects (186), as well as revealing the retinotopic organization (68) and even more recently, includes the ability to trace anatomical connections by means of electrical stimulation (63). A number of additional regions have been documented since the Van Essen et al. (328) compilation. In the parietal cortex, this includes V6 and V6A (76, 166), AIP (183, 253), and some of the subdivisions of the inferior parietal lobule (IPL) (253). In the superior temporal sulcus (STS), this includes the stereo part of TE (TEs) (118) in the anterior tip of the lower bank, and lower superior temporal area (LST) and the middle part of superior temporal polysensory region (STPm) in the lower and upper bank of midlevel STS, respectively (186).

Finally, it is now well established that the extrastriate cortex is organized into two parallel streams, one dorsal or occipito-parietal stream and one ventral or occipito-temporal stream (Fig. 1B). Originally (321), it was suggested that these two streams process different visual attributes, but more recently the difference in the behavioral goal of their processing has been emphasized (81). Indeed, there is increasing evidence that a number of attributes such as two- and three-dimensional shape are processed in both pathways (48, 201), even when there is no arbitrary mapping between shape and response (82).


A. Antagonistic Surrounds in MT/V5

Tanaka et al. (307) described the antagonistic surrounds of MT/V5 neurons and demonstrated their direction selectivity. The surround was maximally suppressive when the motion was in the same direction as the preferred direction of classical receptive field (CRF). The suppression was also maximal when the relative speed was unity. They documented the large size of the surrounds and attempted to demonstrate their isotropy by comparing the suppressive effects of the two halves of the surround. Raiguel et al. (242) documented the strength and size of these surrounds by fitting the integral of a differences of Gaussian function to the diameter-response curves of a large number of MT/V5 neurons. The surround radius averaged 3.3 times that of CRF, a value larger than that for V1 neurons (266). Surrounds were weakest and smallest in the input layers 4 and 6, where they measured only 2.5 times the radius of the CRF, a value close to that of V1 neurons. Surrounds were stronger in the superficial layers and in layer 5 where they reached five times the size of the CRF. These data provide evidence that the surrounds in MT/V5 are not just a reflection of those contained in the V1 input, but arise from further processing in MT itself (21). Just as in V1, however, the surrounds are contrast dependent and vanish at low contrast (215). As a consequence, MT neurons will respond better to a large stimulus with a low contrast than to one with a high contrast. How this fits with changes in speed perception with varying contrast is still unclear (142, 159, 215, 234).

B. Speed Gradient Selectivity of MT/V5 Neurons

Initially, two alternative views have been proposed as mechanisms for the detection of speed gradients: hotspots in the CRF itself (314) or a surround-based mechanism (342). These mechanisms differ both in spatial origin and in sign, since the CRF is excitatory and the surround suppressive. Subsequent work has indicated that the main mechanism is surround based. In anesthetized monkeys, about half (27/57) of MT/V5 neurons were selective for the direction of the speed gradient, and different neurons were tuned to different directions of speed gradients corresponding to different tilts in depth (direction in which the 3-dimensional surface is angled away from the fronto-parallel plane, Ref. 341). These authors also showed that the speed gradient selectivity critically depended on the surround, since all of the selective neurons had antagonistic surrounds and the selectivity was strongly reduced when the surround was masked. Further studies (343) showed that the majority of MT/V5 neurons have nonhomogeneous antagonistic surrounds, as predicted by computational models (16, 77). Figure 2 illustrates the stimuli used to map the excitatory CRF (A), to unmask the presence of a surround (spatial summation test, B), and to map the spatial distribution of the surround effects either coarsely (in 1 of 8 directions around the CRF, C) or in detail (same spacing as excitatory mapping, D). The neuron in Figure 2 had a strong surround effect (F), which arose from an antagonistic region located above the CRF (G and H). Such unimodal surrounds allow the neuron to compute a spatial derivative of speed (thus a speed gradient), provided that the inhibitory surround influence is itself speed dependent. In many MT/V5 neurons, this is indeed the case (341): their surrounds are strongly asymmetric but only when the stimuli in the surround moved at the same or faster speed than the dots moving over the CRF. About half the MT/V5 neurons had asymmetric surrounds with a single major suppressive region on one side of the CRF. Another quarter of MT/V5 neurons had two suppressive regions on either side of the CRF, and a final quarter possessed a suppressive surround that indeed encircled the CRF (343). These latter uniform surrounds could perform more general functions such as contrast gain control or normalization as in V1 (254). They could also remove common motion (22) or provide modulatory signals for segmentation in depth (see below). The double symmetric regions, on the other hand, might be involved in the extraction of curved surfaces from motion (342). Notice that a seemingly obvious function of MT surrounds, the extraction of discontinuities in the velocity field, was not supported by the data: MT/V5 neurons are not selective for either the orientation or the position of a kinetic boundary (169).

FIG. 2.

Asymmetric surround in middle temporal (MT)/V5 neuron. A–D: stimuli used to map the classical receptive field (CRF) and the antagonistic surround. E–H: the result for a single MT/V5 neuron (red, excitation; blue, suppression; yellow, no effect). Single, square patch of moving random dots presented at different positions (A) yields the CRF (E). A circular patch of moving random dots (B) of increasing diameter yields the summation curve in F. The double stimulation with one central patch and one peripheral patch in 8 different positions (C) yields one suppression map (G). The double stimulation with one square central patch and one small peripheral stimulus in 24 positions yields a detailed suppression map (H). In F–H, poststimulus time histograms are shown for selected conditions. In G, the arrow points to the position with strongest suppression, and bottom histograms indicate mean response for complete stimulation of surround and central patch only. [From Orban (200), with permission from Elsevier.]

These studies in the anesthetized monkey have been recently replicated in the alert monkey by Nguyenkim and DeAngelis (191), who used large stimuli involving the surround, similar to those used by Xiao et al. (341). These authors confirmed the selectivity of MT/V5 neurons for speed gradients and provided an important control test: the selectivity for the speed gradient is invariant with respect to the average speed in the display. Furthermore, they showed that MT/V5 neurons are frequently selective for disparity gradients and occasionally even texture gradients, but to a lesser degree than for speed gradients. Furthermore, the selectivity for the three cues combined reflected largely that for the speed gradients (200; J. D. Nguyenkim, unpublished data). Although the optimal tilt for the different gradients was not always congruent, Nguyenkim and DeAngelis (191) obtained evidence for increased selectivity when cues were combined.

The selectivity of MT/V5 neurons for disparity gradients might be thought of being similarly surround-based since MT/V5 neurons also exhibit surround effects in the disparity domain (24). Nguyenkim and DeAngelis (190), however, provided evidence that they arose instead from heterogeneities in the receptive field. This finding further suggests that the selectivity for disparity gradients does reflect spatial variations in position disparity rather than orientation disparity. That the gradient selectivity for disparity arises from the CRF, while that for speed arises from interaction between the surround and the CRF might reflect the well-documented difference in the coding of speed and disparity at this level. Speed tuning in MT/V5 neurons is rather coarse (145, 174, 233), whereas that of disparity is finer (46, 175). Hence, capturing the value of the gradient may require a larger distance for speed than for disparity, explaining the need to resort to interactions between the surround and the CRF to extract speed gradients. Indeed, as mentioned above, surrounds are on average 3.3 times larger than the CRF in MT/V5 (242). It is worth pointing out that in MT/V5 a columnar organization has been described not only for direction of motion (2) but also disparity (44). This organization favors the sort of precise readout of disparity values required for building gradient-sensitive hot spots in the CRF.

Finally, one study reported an impairment of structure-from-motion perception after a lesion of MT/V5 (8), indicating that this area is a critical component of the three-dimensional shape-from-motion extraction pathway.

C. Extraction of Speed Gradients Beyond MT/V5

Selectivity for speed gradients has also been documented in MSTd, a region receiving input from MT/V5 and which is known to process optic flow (55, 83, 144, 255, 321). Duffy and Wurtz (58) noted that the speed gradients superimposed on flow components significantly affect the firing of MSTd neurons in the awake monkey. Sugihara et al. (298) superimposed speed gradients onto rotatory flow, producing rotating planes with various three-dimensional orientations. A substantial fraction (43/97) of MSTd neurons were found to be selective for the tilt, i.e., the direction of the speed gradient. Selectivity for slant (the amount by which the 3-dimensional surface is angled away from the fronto-parallel plane) was also observed in nearly half the MSTd neurons.

In anterior superior temporal polysensory area (STPa), neurons are also selective for optic flow components (10, 186), and a fraction of the population is selective for three-dimensional structure from motion (9). These authors used a transparent sphere similar to the hollow cylinder used in the perceptual experiments of Siegel and Andersen (278). While the sphere stimulus is characterized by second-order speed variations, it does not allow the parameters describing the surface to be easily manipulated. A number of STPa neurons responded at the onset of three-dimensional structure from motion, and some of these were selective for the axis or rotation and were size invariant.


In principle, a small oriented receptive field such as that of a V1 neuron suffers from the aperture problem in the sense that it can measure only the velocity component orthogonal to its preferred orientation, which is compatible with an infinite number of velocity vectors, as the parallel component of the velocity is unknown (Fig. 3 A). Images of real objects are generally bound by several intersecting contours, and these intersections/corners provide additional information with which to determine the velocity of the object motion. MT/V5 neurons have been shown to use both of the sources of information provided by images of moving objects: multiple contours (Fig. 3B) and the intersections between the contours (Fig. 3D).

FIG. 3.

Extraction of moving contours (A) and intersections (C) in V1 and their integration in MT/V5 (B and D). Hatching indicates RFs. The image of the moving object is a diamond moving to the right.

A. Use of Multiple Contour Motion: Pattern Direction Selectivity

Movshon et al. (180) introduced plaid stimuli, made by the superposition of two sinusoidal gratings angled 90 or 135° apart, to test the integration of multiple moving contours. Testing these stimuli on 108 MT/V5 neurons, they showed that while a large fraction of MT/V5 neurons (∼40%) signaled the direction of motion of each of the component gratings (component direction selective cells), another group of ∼25% of MT neurons was selective for the direction of the pattern (pattern direction selective cells), and a third remained unclassified. These proportions have remained remarkably stable in subsequent studies (286, 295). In contrast, all V1 neurons were component direction selective or remained unclassified. This was later confirmed for V1 neurons identified as projecting to MT/V5 (182). Rodman and Albright (250) showed that pattern and component direction selectivity correlated with the angle between the preferred orientation of a given MT neuron and its preferred direction of motion. Recently, Smith et al. (286) studied the time course of component and pattern direction-selective neurons' responses in MT. Component cells had a 6 ms shorter average latency than pattern cells, and the selectivity of the latter neurons took ∼50–75 ms to develop, suggesting that the integration of two contours takes time.

Rust et al. (254) tested MT neurons with a whole family of plaid stimuli parameterized by the angle between the component gratings. They showed that the range of responses of MT neurons ranging from strong component selectivity to complete pattern selectivity could be captured by a cascade model including two nonlinearities sandwiching a linearity. The initial nonlinearity is attributed to the direction-selective V1 neurons, and supposedly reflects both nontuned and tuned components of normalization, the latter introduced by, e.g., a surround. The linear operation is the combination (summing) of excitatory and inhibitory inputs from these direction-selective V1 neurons, whereas the second nonlinearity is due to the spike threshold of the MT neurons themselves. The most important factors determining pattern direction selectivity are the pattern of excitatory and inhibitory inputs, underscoring the importance of this mechanism, first proposed by Hubel and Wiesel (104) to account for orientation selectivity of V1 neurons. Notice, however, that the modeling study (254) revealed that the patterns of excitatory and inhibitory weights were equally important, while initially only the pattern of excitatory LGN inputs had been emphasized. It is worth mentioning that the success of the particular model developed by Rust et al. (254) does not need to imply that the particular implementation of the components used in that model is unique. For example, the first nonlinearity may arise from synchronization between direction-selective inputs, as has been documented for LGN input to simple cells (283), while the second nonlinearity may partially reflect intracortical inhibitory interactions from the surrounds of MT/V5 neurons (242), or other intracortical inhibitory inputs.

B. Use of Other Information: Moving Terminators

Moving plaids contain only contour information. Indeed, Movshon et al. (181) have emphasized the importance of perfect superposition of the two gratings in the plaids to create additive plaids. Failure to do so by using nonadditive plaids, in which the intersections have the same contrast as the contours, actually generates an additional low-contrast component at the intersections. Hence, the plaid studies reveal only one aspect of the solution to the aperture problem (Fig. 3, A and B). They completely eliminate the contribution from two-dimensional features of an image such as end points, corners, and intersections which allow accurate two-dimensional velocity measurements (Fig. 3C). These features, called terminators (278), could also contribute to solving the aperture problem, especially considering that a class of V1 neurons, the end-stopped cells, are able to signal their motion (216). To study the effects of terminators, Pack and Born (213) introduced a field of short lines that moved either at right angle to their orientation or at angles of 45°. In response to such a tilted stimulus, all MT neurons gradually altered their preferred direction over a period of 75 ms to eventually coincide with that for orthogonal stimuli. While the time course is similar to that observed by Smith et al. (286) for pattern direction selectivity, this pattern of behavior was observed in all MT neurons, and not just in the 25% that were pattern direction selective. Pack et al. (211) confirmed this result with nonadditive plaids, which also contain terminators. With this stimulus, they obtained 60% pattern-selective neurons and 6% component cells, probably reflecting the low contrast of the terminators. Since all studies of the Movshon group were performed on anesthetized animals, they recorded from the same animals under anesthesia and observed only 7% pattern-selective cells and 45% component-selective cells. The latter result has been disputed, however (181, 212). One possibility is that by using a different anesthetic, isoflurane rather than the sufentanil used by the Movshon group, Pack et al. (211) strengthened the GABA inhibition, which may have upset one or several of the inhibitory mechanisms involved in the integration of the contour or terminator signals. Studies in the LGN have indeed observed differences using the two anesthetics (291).

To study the integration of terminator signals with contour signals, Pack et al. (214) used barber pole-like stimuli of which the elongation was manipulated. These stimuli are created by showing a square-wave grating inside a rectangular window at two different orientations. The results confirm that integration of terminator signals takes time. The steady-state angular deviations of the preferred directions of MT neurons were very close to the deviations predicted by the integration of the sole terminator motion vectors for three different aperture geometries. The deviations largely vanished when the straight-edge aperture was replaced by an aperture with 0.4° indentations, a situation in which the barber pole illusion vanishes. The deviations in preferred direction induced by the change in orientation of the rectangular window did not depend on the size of the aperture nor on its position in the receptive field. In conclusion, the use of terminator vectors to compute object motion is useful only when the terminators are intrinsic to the object (Fig. 3, C and D). In the classical barber pole stimulus, terminators are intrinsic. Under these conditions, the MT neurons seem mainly to integrate the terminator vectors at the expense of the contour vectors. It has been shown that the end-stopped direction-selective neurons of V1 can provide the terminator motion vectors to MT neurons. It has also been shown that end-stopping takes time to develop (216), which might at least partially explain the time that MT neurons need to accurately signal the direction of motion of the pattern-containing terminators. The remainder probably reflects the integration of the V1 signals, since the indentation experiment (214) shows that the scale of these V1 signals is much smaller than the RF size in MT/V5 neurons. Indeed, it has recently been shown that in IT, nonlinear integration of afferent signals develops over a similar time scale (34).

The two mechanisms of motion integration are not incompatible and might, in fact, depend on the two types of inputs from V1 to MT/V5. The terminator signals might reach MT through the projection of layer 4B, in which many of the neurons are end-stopped. The contour information might reach MT via layer 6, where many direction-selective neurons are end free (Fig. 4). It is worth noting that anatomically, the projection from layer 4B dominates the inputs into MT. One would thus expect terminator signals to dominate over contour signals. This has been observed by monitoring the MT activity through smooth pursuit eye movements, the control signals of which transit through MT (84, 136, 137, 188). When monkeys track a single bar whose orientation can be tilted compared with the direction of motion, smooth pursuit initially follows the component predictions, but after 50–100 ms gradually turns to the pattern prediction (23).

FIG. 4.

Types of visual signals sent by different laminae in V1 to MT/V5.

C. Limits of Motion Integration

Integration of motion signals is relevant only when these signals belong to the image of the same object. In both of the situations in which MT neurons integrate contour vectors or terminator vectors, this integration is reduced when there is sufficient visual evidence indicating that the contours or terminators belong to different objects. For the plaid stimuli, which test integration of contours, the limit is the transparency of the motions of the two gratings. Any condition that promotes the impression of transparency reduces the pattern component behavior in all MT neurons (295). For the barber pole stimuli, the introduction of an occlusion cue indicates that the grating is located behind the aperture, making the terminators extrinsic (or accidental). To make the terminators extrinsic, Pack et al. (214) surrounded the aperture by a bright frame which introduces an occlusion cue and reduced the barber pole illusion. This manipulation greatly reduced the deviations in the preferred directions but did not abolish them. To contrast the effect of extrinsic and intrinsic terminators, they placed parallel bars next to the aperture and changed the orientation of these flanking bars. This change shifted the preferred directions of the MT neurons as predicted, although not completely. The effect was stronger than that obtained by Duncan et al. (60), who introduced occlusion by a stereo cue on opposite sides of a square aperture. This weaker effect may be due to the larger distance between the occlusion cue and the aperture in the Duncan et al. experiment, in which the occlusion cue could act only through the surround of the MT/V5 neurons. It is interesting to note that MT/V5 receives afferents not just from V1, which is the major input, but also from V2 and V3 (21). We will see that neurons in V2 are sensitive to occlusion cues and might contribute to the ordering of stimuli in depth (see below). Input from these early extrastriate areas might provide the MT neurons with the signals required to restrict integration of motion signals to those belonging to the same object (Fig. 4).


The signals from MT/V5 are sent in parallel to a number of neighboring regions. Initially, Ungerleider and Desimone (321) distinguished two such regions, the medial superior temporal visual area (MST) and the floor of the superior temporal visual area (FST). Subsequently, Komatsu and Wurtz (136) proposed a further distinction between dorsal and ventral MST, the latter being involved in smooth pursuit. This distinction was further supported by the work of Tanaka et al. (310), reporting that optic flow selectivity was observed mainly in the dorsal part, implying a role in the analysis of self-motion. Neurons of ventral MST (MSTv) were more selective for translation, particularly for small stimuli, confirming this subregion's role in the analysis of visual trajectories and the generation of pursuit. Recent data have extended the scope of this MSTv functionality to include the control of arm trajectories (112). Recent imaging data (186) have shown that FST reacts very differently from the MST regions and that FST could be at the origin of an action-processing pathway veering off ventrally into the STS and involving newly defined/recognized regions LST and STPm. Thus MT dispatches motion signals to its three satellites (Fig. 5) for further processing of self motion (MSTd), trajectories of moving objects (MSTv), and actions/motion of animate entities (FST). In this review we concentrate on MSTd, which has been far better investigated that the other two regions.

FIG. 5.

Different types of motion signals processed by MT/V5 and its satellites.

A. Selectivity of the MSTd for Optic Flow

The seminal observation by Tanaka's group (255) that a number of MSTd neurons are selective for expansion/contraction or for rotation but do not respond to translation was the starting point of the research into the role of MSTd in the processing of optic flow (for review of this early work, see Ref. 199). Next it became clear that MSTd cells selective for only expansion/contraction or rotation were in the minority and that most MSTd cells were selective for multiple flow components (55, 83, 144). Since expansion/contraction and rotation are basically spatial configurations of local translations, it became essential, for any meaningful analysis of those MSTd neurons selective for translation, to define a criterion allowing one to determine whether, in addition to their translation selectivity, they are selective for expansion/contraction or rotation. Position invariance, tested over a wide region of the visual field encompassing the RF for translation, is such a criterion (144). If the neuron is not selective for the flow component but only for translation, the spatial response map for the flow component will be located on either side of the translation RF for the two directions of the flow component. In contrast, a neuron selective for a flow component will have a spatial response map for only one direction of the flow component, and this map will overlap the translation RF. It should be noted that position invariance requires explicit testing; a large RF does not in itself guarantee position invariance. Furthermore, the position invariance criterion draws a sharp distinction between MT/V5 neurons that have no selectivity for radial motion or rotation and MSTd neurons which can be selective for these higher order motions. This criterion established that selectivity for radial motion or rotation is a novel type of selectivity emerging at the level of MSTd (144). Given that selectivity was observed for two of the first-order components of optic flow, expansion/contraction and rotation, the question then arose as to whether MSTd decomposes optic flow. The answer to this question was negative, since very few neurons are selective for the third component of flow, deformations (144), and because the response of an MSTd neuron decreases when the component of flow for which the neuron is selective is mixed with increasing amounts of a different component (209). The MSTd neurons do not calculate derivatives of flow but signal how well the flow present on the retina matches their preferred flow component or mixture of components.

Generally, the RFs of MSTd neurons are described as very large and difficult to map even in awake animals. Quantitative measurements of the sizes of these RFs have shown that they are indeed large, but do not span the entire visual field (163, 241). This is important, otherwise translation-selective neurons could not contribute to the determination of heading direction (see below). The Raiguel et al. (241) study also showed that the ipsilateral representation of the visual field was much more extensive in MSTd than in MT/V5 and that RF size did not change with eccentricity in MSTd. MSTd neurons are broadly sensitive to the speed of radial motion or rotation (208, 308) and exhibit strong spatial summation, with antagonistic surrounds being less frequent than in MT/V5 (56, 144, 308). Tanaka and Saito (308) and Geesaman and Andersen (78) showed that the selectivity for radial motion, rotation, or translation of MSTd neurons did not depend on the carrier of the motion, whether it consisted of random dot patterns, windmills, rings, squares, or even non-Fourier stimuli. In the same vein it has recently been reported that MSTd neurons even respond to Glass patterns evoking the perception of rotation (141).

In a very careful study, Tanaka et al. (306) investigated the different cues present in flow stimuli and concluded that the selectivity of MSTd neurons for radial and rotatory motion reflects the spatial pattern of translations present in these motion stimuli. Indeed, a set of eight translations in appropriate positions are undistinguishable from the real rotation or radial motion for MSTd neurons. The selectivity of MSTd neurons for optic flow might therefore arise from a combination of excitatory MT inputs. These inputs not only have to be direction selective but must arise from neurons with RFs at the appropriate locations. Thus, to build on the work of Rust et al. (254) and Brincat and Connor (33), it may well be that a NL/L/NL cascade model also applies to flow selectivity in MSTd neurons. Such a model generates some degree of position invariance (33), and it may not be necessary to repeat the configuration of excitatory inputs, as initially suggested by Tanaka et al. (306). In the instance of the MSTd neuron model, the initial nonlinearity could easily be provided by the suppressive surrounds of the MT/V5 neurons, which are particularly abundant in the superficial layers of MT/V5 (242). To what extent inhibitory inputs are involved is not yet clear, however (56).

B. Selectivity of MSTd Neurons for Heading Directions

The visual processing of MSTd neurons can be summarized as a template matching with radial motion, rotation, and translation, or their combinations. This paved the way for a radical change in the investigation of MSTd processing, initiated by Duffy and Wurtz (57): testing the selectivity for heading direction specified by the position of the focus of expansion (FOE). Initially the position of the FOE was varied in the fronto-parallel plane (57, 147, 217, 218). Only recently was a variation of the FOE position in the horizontal plane added (163), which is essential for capturing the role of cells selective for radial motion. Indeed, if the FOE varies only in the fronto-parallel or vertical direction, the problem can be solved with only translation-selective neurons (147). So far, only one study has completely investigated heading in three-dimensional space using 26 directions of heading (88). Remarkably, this study reveals that nearly all MSTd neurons (251/255) are selective for heading in three dimensions (Fig. 6 A)! As far as we are aware, no effort has been made to investigate the effects of speed along these 26 directions in space. The second important finding of this study is that all directions of heading in space are equally represented (Fig. 6B): there is no preference for straight ahead trajectories or trajectories along the ground plane. This may seem surprising, but only because we tend to think of monkeys as moving around as humans do. In fact, they behave very differently insofar as much of their locomotion consists of jumping and moving about in the treetops rather than walking or running on the ground. Once one recognizes that, for the life-style of monkeys, an assessment of self motion in all directions of space is useful, the role of MSTd neurons selective for rotation or mixtures containing rotation becomes clearer. These neurons might analyze rotation around the axis of heading, which itself can take any direction in space. Such an analysis is, of course, very different from that of rotation with respect to gravity or space, which is analyzed by the vestibular system.

FIG. 6.

Selectivity of the dorsal part of the medial superior temporal visual area (MSTd) neurons for heading direction in 3 dimensions. Elevation and azimuth describe all directions in space. A: response of 3 MSTd neurons as a function of elevation and azimuth. B: preferred directions of headings of all MSTd neurons studied. [Modified from Gu et al. (88), copyright 2006 by the Society for Neuroscience.]

Heading direction is an instantaneous measurement; its integration yields information about the path followed. An initial study found little effect of temporal integration in the selectivity of MTSd neurons (219). Yet using a more natural stimulation, consistent with following the same rotatory path in opposite directions, Froehler and Duffy (72) showed that a number of MSTd neurons are selective for the path followed by the subject. Finally, heading discrimination is impeded by the presence of a moving object. Logan and Duffy (163) investigated the heading coding of MSTd neurons for optic flow stimuli alone, a moving object alone, and the combination of the two when the object moved with the flow or in the opposite direction. While MSTd neurons could derive the heading from the moving object responses, the combination of the two stimulations, when in agreement, was largely dominated by the optic flow signals. Yet when the object moved in directions opposite to the optic flow, the heading signals arising from the MSTd neurons were strongly reduced, indicating strong interactions between the two types of signals. This result also shows that the presence of moving object responses in MSTd does not necessarily contradict the concept of an MSTd specialized for processing self-motion (Fig. 5).

C. Mixing of Visual With Vestibular and Pursuit Signals in MSTd: Out of Scope

If the heading direction problem is to be solved by using the location of the FOE, this necessarily implies that neurons involved in heading have to be influenced by pursuit eye movements. Indeed, it is well known that the FOE is a function of both the heading and the pursuit eye movements, and extraretinal signals have also been implicated in heading during pursuit (15, 338). A complete review of these studies is beyond the scope of this review. Suffice it to say that there is ample evidence that MSTd neurons are influenced by pursuit (26, 217, 218, 276) and that, in general, the compensation for the shift of the FOE by pursuit is not complete, even if it is improved by the presence of multiple depth planes (323).

Similarly, the integration of visual and vestibular signals in MSTd is also beyond the scope of this review. MSTd receives vestibular input (32, 54), which may be stronger in the postero-medial part of MSTd (88). The role of this combination of signals might be more about the conversion of heading signals into a head-centered reference (88) than the generation of a robust heading representation (218). The combination of these different signals in MSTd has triggered considerable interest in modeling. One view is that MSTd neurons are actually basis functions useful for representing multiple variables: the FOE in different coordinates, eye motion, and position (18). To what extent these signals are used in the brain is unclear. Far greater attention to the output of this region will be necessary to understand the computations in which MSTd is involved, one of these likely being that of self motion.

D. Optic Flow Selectivity of Other Extrastriate Regions

MSTd projects to a number of other extrastriate areas, including STPa, in the upper bank of the anterior STS, VIP in the intraparietal sulcus, and 7a, the posterior part of the IPL. Optic flow selectivity has been observed in all three areas. Recording in STPa, Anderson and Siegel (10) observed a substantial fraction of neurons selective for optic flow, especially expansion. Only a small proportion of these neurons were responsive to translation. It remains unclear whether the region from which they recorded corresponds to STPm defined by Nelissen et al. (186) by, amongst other fMRI criteria, its responsiveness to optic flow.

MSTd and MT/V5 project to VIP in the fundus of the intraparietal sulcus. The properties of the neurons in VIP are surprisingly similar to those of MSTd (30, 267). In fact, the selectivity of VIP neurons for heading is also similar to that of MSTd neurons (349). VIP also receives vestibular inputs (31). Two differences are noteworthy. First, compensation for pursuit seems more complete than in MSTd, suggesting that heading might be coded in head-centered coordinates (349). VIP is also more multimodal than MSTd, as it also receives somatosensory input (11, 59) and auditory input (272).

Neurons in area 7a are also selective for optic flow (179, 244, 256, 257, 280, 294). Phinney and Siegel (228) have reported interactions between speed and optic flow selectivity in 7a. Steinmetz et al. (294) and Merchant et al. (176) reported that 7a neurons are also especially sensitive to expansion stimuli.

PEc neurons on the medial side of the superior parietal lobule are also responsive to optic flow stimuli (240), but this region, part of area 5, is only indirectly linked with MSTd and VIP. PEc is an area involved in the control of arm movements (66). An association between radial flow and the control of arm trajectories was suggested earlier by Steinmetz et al. (294). Likewise, VIP projects to area F4 (249), an area involved in the planning of arm movements. STPa is responsive to the observation of actions, including hand actions (225). Thus it may well be that selectivity to optic flow does not necessarily reflect involvement in heading and self-motion processing.


A. Distinction Between Internal and External Contours of Objects: Selectivity for Nonluminance Defined Boundaries

Far in the ventral pathway, in area TE, the anterior part of infero-temporal cortex, neurons are selective for two-dimensional shape. It has been shown that at this high level, neurons are invariant for the cue defining the shape. Shape here refers to two-dimensional shape and more particularly the outline of the object in the image, i.e., the external contour of the object. Many IT neurons encode simple shapes whether defined by motion, texture difference, or luminance difference (264). For a number of IT neurons, this invariance extends to orientation selectivity as measured with gratings defined by this same set of cues (263). It was subsequently shown that IT neurons are also selective for shapes defined by disparity differences (304). It is important to note that the differences in disparity or texture, used in these studies (264, 304), are steep steplike changes or discontinuities and are very different from the smooth changes in those features that constitute cues to the three-dimensional shape of objects, i.e., the curvature or orientation of their surfaces in three dimensions. These discontinuities generally evoke a sharp impression of a contour, although physically no contour is present in the image. Thus these contours are created by the brain very much as color is. The computational importance of this cue convergence is the distinction between internal and external contours of objects. Indeed, at the edge of an object, i.e., its external contours, it is likely that many aspects of the image change: luminance and color, texture, depth, motion, etc. In that sense the extraction of such nonluminance-defined contours is essential for delineating the objects and thus represents a first step in figure-ground segregation. On the other hand, internal contours corresponding to surface markings or shadows will exhibit mainly luminance changes without changes in depth, motion, or texture. Thus shape-selective IT neurons onto which different cues converge are likely to encode the shape of objects or object parts. This in turn poses the question as to which of the levels preceding IT first gives rise to this selectivity for nonluminance-defined contours. In principle, this could be V1, but this possibility is generally considered unlikely, since selectivity for many of the features upon which these nonluminance-defined contours are based, first emerges only at the level of V1. A notable exception might be contours defined by temporal texture (37), as temporal frequency influences geniculate neurons.

1. Illusory contours (anomalous or subjective contours)

The reports of von der Heydt and co-workers (226, 334, 335) drew attention to early extrastriate cortex, particularly V2, for its role in the processing of nonluminance-defined contours. These authors showed that V2 neurons were selective for the orientation of illusory contours, that this selectivity was similar to that for luminance-defined stimuli, and that the response to illusory contours could not be explained by any spurious stimulation of the RF. They studied two types of contours: those arising from abutting gratings and those bridging gaps as in the Kanisza triangle. The first type can be seen as a discontinuity in texture reflecting an object boundary. The second type corresponds to a discontinuity in luminance induced by the figural elements (the inducers) at the ends of the contour. Recently, it has been shown that cortical neurons (V2 neurons more than V1 neurons) can even signal illusory contours defined by a step in disparity between two surfaces of equal luminance (99) or illusory contours defined by placing an occluder in the near plane (14). Peterhans and von der Heydt (226) proposed that the selectivity for illusory contours in V2 arose by pooling appropriate signals from V1 and V2 end-stopped neurons (98, 100) and combining them with those from end-free V1 neurons, an operation reminiscent of that for the integration of motion signals in MT (Fig. 4). In V2, neurons selective for illusory contours are located mostly in the pale but also in the thick stripes (98, 227).

In these initial studies, virtually no V1 cell was found to be selective for illusory contours. Subsequent studies have observed small fractions of V1 neurons selective for illusory contours, however. In some of these studies, it could be argued that luminance cues were present in some of the stimuli used (17, 85), but in others that was not the case (150, 243). Alternative explanations might relate to laminar positions in V1 (150), since end-stopped neurons are more frequent in superficial layers (98). Furthermore, it has been argued that the latency of responses to illusory contours is longer in V1 than in V2 and that the V1 responses represent feedback from V2 (150). Whether this feedback is required for the emergence of the percept of a sharp contour is unclear. It might simply be intended to keep V1 and V2 in register. It has also been suggested that illusory contours deactivate a number of V1 neurons, a proposal intended to disentangle responses to real and illusory contours (243).

2. Kinetic and other contours

Kinetic contours are created by differences in the direction or speed of motion between two abutting random dot fields. Such contours, although generated by motion, do not themselves move. Originally it was thought that orientation selectivity for such stimuli would arise in MT/V5, but MT/V5 neurons are not selective for the orientations or positions of kinetic boundaries (169). This result was obtained by testing MT/V5 neurons with edges and gratings in which motion was either orthogonal or parallel to the boundary orientation. Results were very similar for gratings and edges. MT/V5 neurons signal only the local motion present in this spatial pattern of translations, as is the case for flow components (144). The lack of MT/V5 involvement in kinetic boundary processing was confirmed by the lesion study of Lauwers et al. (149). Lesions of MT/V5, even large ones, little affect the discrimination of small differences in orientation of kinetic gratings, while severely impairing the discrimination of small differences in motion direction.

Very few V1 neurons and only a small proportion (13/113) of V2 neurons were selective for kinetic boundary orientation when tested using kinetic gratings in which motion was either orthogonal or parallel to the grating orientation (168). In V4, selectivity for kinetic boundary orientation was more evident, as there were more neurons selective for this attribute in proportion to the number of neurons selective for luminance grating orientation, even though the overall proportion was still small (52/482) (184). As was the case for illusory contours, the higher order (V4) neurons had shorter latencies than lower order (V2), suggesting that the kinetic responses in V2 might represent feedback signals from V4. It is difficult to compare latencies across studies, since small changes in stimulus parameters may change the latency values. However, the latency differences between V2 and V4 are differences in relative latencies, taking latency for luminance-defined stimuli as the reference. In V2, the latency of response to kinetic boundaries is 50–60 ms longer than that to luminance-defined boundaries in kinetic orientation-selective neurons. In nonselective neurons, this differences was reduced to 20–30 ms, and the latency of kinetic and uniform motion responses was similar (168). In V4, the response latency for kinetic gratings was only 25–30 ms longer than that for luminance defined gratings, was not different for selective or nonselective neurons, and was similar to that for uniform motion (184). These latency data are compatible with the following hypotheses: 1) responses to kinetic patterns in nonselective neurons reflect motion responses in V2 and V4; 2) responses in kinetic selective and nonselective V4 neurons are assembled in parallel from the V2 motion input, according to different combination rules (see below); and 3) the kinetic selective V2 neurons receive their inputs from selective V4 neurons. Furthermore, it is important to notice that the kinetic orientation-selective V4 neurons exhibit three key properties required for the representation of cue invariant boundaries: 1) orientation selectivity for an impoverished stimulus that 2) was invariant for changes in the carrier, in this case direction of motion, and that 3) matched the selectivity for orientation of luminance-defined stimuli. It is noteworthy that a small number (9/452) of V4 neurons were selective for the orientations of kinetic grating but not luminance gratings.

Leventhal et al. (156) observed V2 neurons that were orientation selective for texture-defined boundaries. For this type of boundary, as for illusory contours, the preferred orientation and tuning widths were similar to those obtained with luminance-defined stimuli. In addition, the authors showed that the gain of the tuning function increased when the saliency of the border increased and when a subliminal luminance-defined bar was added to a weak texture-defined bar. Von der Heydt et al. (336) reported that a fraction of the V2 neurons is tuned to the orientation of disparity-defined edges. These authors compared responses of V2 neurons to optimally oriented luminance-defined and disparity-defined squares at different positions. They observed edge-selective responses in most V2 neurons for both types of figures and observed a correlation between the preferred orientations of the two types of edge responses: the mean difference between the preferred orientation was only 2.7° (9.2° SD) for the seven neurons tuned for orientation with both types of stimuli. To further test the proposition that these cyclopean edge-selective responses represent a cue invariant boundary signal, Bredfeldt and Cumming (28) tested V2 neurons with single disparity edges at various positions and orientations for both signs of the edge, along with uniform-disparity random dot stereograms. In these tests, V2 neurons' orientation selectivity for disparity steps, although broadly correlated with that for luminance-defined stimuli, is not as selective as orientation selectivity for luminance stimuli. More importantly, the disparity edge responses frequently (>50%) originated from different locations in the RF and often (again >50%) the orientation selectivity depended on the choice of the disparities defining the edge. Bredfeldt and Cumming (28) conclude that cue invariance is not achieved at the level of V2 and that this first step towards invariance can be accounted for by feed-forward projections using the appropriate combination of excitatory inputs. A similar scheme has been proposed for V2 neurons by Ito and Komatsu (113) to explain corner responses and by Orban and Gulyas (203) to explain selectivity in cortical neurons for kinetic boundary orientation. Because similar proposals have been made for illusory contour orientation selectivity (98, 100), it may be that once again a cascade model, in which two nonlinearities sandwich a linear combination of inputs, is applicable to these selectivities in V2 and V4 neurons.

In conclusion, there is mounting evidence that as the visual message travels from V1 to V4 over V2, cue invariant boundary representations gradually emerge, perhaps earlier (that is, in V2) for illusory and texture-defined boundaries and rather late (in V4) for kinetic and perhaps, disparity-defined edges. For illusory contours as well as for kinetic contours, it has been proposed that the selectivity observed at lower levels (V1 for illusory contours and V2 for kinetic contours) reflects feedback signals.

B. Segmentation or Depth Ordering in Static Images: Border Ownership and Surface Representations

Even if, at the level of V4, the brain manages to distinguish between the external contours or boundaries of objects and internal contours reflecting shadows or internal markings, it has only solved half the problem of figure-ground segregation, i.e., to determine which parts in the image correspond to the figure (object image) and which correspond to the background. Indeed, at the level of V4, the visual information is still carried by local RFs that process only part of the object boundary. It has been shown that even V1 neurons respond to the presence of a figure over their RF (146, 150, 351; but see Ref. 252). V2 neurons respond similarly, and in either area this response is little affected by attention (170). These responses, which need time to develop (between 60 and 150 ms in Marcus and Van Essen, Ref. 170), have been taken as evidence that even V1 neurons contribute to the segmentation of figure from background, although these long-latency signals most likely reflect feedback from higher order areas. Whether these responses provide indications that V1 and V2 neurons process object surfaces (encompassed by the object boundaries) is less clear. This question can be addressed by testing the effect of presenting a figure at different positions relative to the RF, as done by von der Heydt et al. (336) for disparity-defined figures and Friedman et al. (71) for color figures. These tests have revealed that the vast majority of V1 and V2 neurons signal the presence of the edge of the figure (Fig. 7 A) and only ∼20% the presence of the figure itself. The latter neurons may contribute to the analysis of object surfaces by signaling, e.g., their color and/or luminance. This is in marked contrast to findings at a much higher level. In studying three-dimensional shape extracted from disparity in IT neurons, Janssen et al. (115) observed equal numbers of neurons signaling the depth structure of boundaries and surfaces.

FIG. 7.

Edge neurons and border ownership selectivity. A: response of a surface and an edge neuron (V2) as a function of the position of a square figure. [From Friedman et al. (71), with permission from Blackwell Publishing.] B: schematic indication of four types of neurons (stripes indicate RF) signaling the direction of the figure with respect to the edge (b) or not (a) and signaling the polarity of the figure (c) or not (d). C: distribution of contrast polarity discrimination (c-d) and side of ownership discrimination (b-a) in V1, V2, and V4. [Modified from Zhou et al. (350).]

Further experiments, however, have indicated that in V2, even edge cells can provide information about which side of the boundary the actual figure is on. By manipulating the contrast polarity of edges and their inclusion in a square figure, Zhou et al. (350) demonstrated that a number of early extrastriate neurons can signal the side on which the figure is located with respect to the edge (Fig. 7B). Such neurons were observed in V2 and V4, but far fewer were found in V1 (Fig. 7C). A number of these neurons can additionally signal the luminance/color of a figure for which they signal the location with respect to the RF (top right corner in squares of Fig. 7C). It is noteworthy that these border ownership signals emerge rapidly, ∼10–25 ms after response onset, much faster than the figure-ground signals recorded by Marcus and Van Essen (170). Thus, even signals that require integration over long distances, far beyond the classical RF (see review in Albright and Stoner, Ref. 4) may emerge relatively quickly. In fact, surround effects have a relatively similar time course (20–30 ms after response onset, Ref. 12). In a subsequent study, Qiu and von der Heydt (239) confirmed the presence of boundary ownership signals in V2. Indeed, if these V2 neurons signal that a figure is present on a given side of the edge, then these neurons should also respond to the presence, in stereograms, of “near” figures on the same side. By systematically comparing responses to figural occlusion and disparity cues, these authors showed that this convergence indeed occurred in a significant number of V2 neurons, but only rarely in V1.

So far, most studies addressing the ordering of surfaces in depth have studied the simple situation of a single figure on a background, with three notable exceptions. Zhou et al. (350) showed that about half the edge cells in V2 and V4 (plus a small proportion in V1) that signal border ownership for a single figure can also signal border ownership for overlapping figures. Bakin et al. (14) showed that the neural basis for contour completion, that is, the facilitation of neural responses to stimuli located within the RF by contextual lines lying outside the RF, is blocked by an orthogonal line intersecting the contour, but is recovered when the orthogonal line is placed in a “near” depth plane. This recovery was observed more frequently in V2 than in V1. Sugita (299) showed that V1 neurons do not respond to an optimal moving bar when it is partially occluded by a small patch. Response was restored by adding crossed disparity to the patch so that it appeared to be in front of the bar, while adding uncrossed disparity had no effect. Notice that this type of completion is very different from that observed in IT (140), where the addition of disparity is not necessary for neurons to complete the shape.

C. Segmentation of Moving Planes in MT/V5

Objects are generally opaque; thus occlusion is the rule between objects at different depths. A moving object near the observer will therefore dynamically occlude other objects at greater distance from the observer. Hegde et al. (93) have suggested in a psychophysical study that second-order or non-Fourier motion stimuli such as contrast modulated moving stimuli may signal dynamic occlusion. Thus the response of MT/V5 neurons to these non-Fourier motion stimuli (3, 196) might signal the dynamic occlusion of one object by another, even when the objects are not distinguished from one another by luminance differences.

Full transparency is relatively rare in natural scenes. Exceptions include shadows and to some extent foliage, especially fine foliage. To disentangle moving shadows from moving objects, the visual system should be able to process transparent motion signals. Snowden et al. (288) compared responses of V1 and MT/V5 neuron responses to moving random dots or to transparent motion in which two sets of random dots moved in opposite directions. V1 neurons responded equally well to random dot and transparent motion, while MT/V5 neurons responded less strongly to transparent motion than to the random dot motion. The inverse relationship found between direction selectivity and response to transparent motion suggested that it was the inhibition responsible for suppressing responses in the nonpreferred direction that decreased the response to the transparent motion. Indeed, this inhibition was stronger in MT/V5 than in V1. These authors noted that for the same reason, MT/V5 neurons responded less strongly to kinetic gratings containing opposed motion in segregated bands than to uniformly moving random dot patterns. This agrees with the findings of Marcar et al. (169) that MT/V5 neurons do not signal kinetic boundary orientation (see above). The distinction between V1 and MT/V5 was further explored by using paired and unpaired dot patterns (238). Both types of stimuli include dots moving in opposite directions, but the former is locally balanced and appears to flicker, while the latter is unbalanced and gives the impression of two transparent surfaces. While V1 neurons respond equally to these two types of stimuli, MT/V5 neurons respond far less well to the paired than to the unpaired dot patterns. This suggested that in MT/V5, the second step in motion processing following V1, suppression occurs when locally different directions of motion are present in the image. The aim of this suppression is to reduce the response to flicker, a process that is incomplete in MT/V5 but continues in MSTd (144). Subsequently, Bradley et al. (27) reported that the suppression in a transparent display could be decreased by introducing disparity between the two sets of moving dots. Thus MT/V5 neurons, while rejecting motion noise (flicker), can still represent transparent surfaces at different depths. While this does not apply to moving shadows, it might be useful for discerning a moving object through moving foliage. In general, this property will be useful in cluttered dynamic scenes since it might resolve the three-dimensional structure of such scenes. This property of MT/V5 neurons has been used in a depth-ordering task (25, 87), sometimes referred to a structure from motion task. In fact, depth ordering is indeed useful in case of partial occlusions, typical of cluttered scenes.


Neurons in V1 are selective for horizontal disparity, but this is a selectivity for absolute disparities (42), signaling only position in depth relative to the fixation point. In V2, a fraction of the neurons are selective for relative disparity (311), signaling position in depth with respect to another plane. Such signals probably underlie the precision of stereoacuity. Neurons in V2 are also subject to disparity capture (14). V4 neurons are often selective for near disparities (103), which might relate to the processing of objects segregated from the background. This lower-order disparity selectivity in V1, V2, and V4 neurons signals only position in depth (but see below for V4 neurons). Higher-order disparity refers to gradients of disparity which signal orientation or curvature in depth. Neuronal selectivity for higher order disparity has been documented mainly in two cortical regions: the caudal part of the lateral bank of IPS, CIP, explored by Sakata, Taira, and colleagues, and a small region in lower bank of the STS, TEs, explored by our group (for review, see Refs. 204, 260).

A. Higher Order Disparity Selectivity in TEs, Part of the Infero-Temporal Complex

The Janssen et al. (117) study, reporting that a fraction of IT neurons were selective for three-dimensional shape defined by disparity, was not only the first study to report selectivity for second-order disparity stimuli, but it was also the first to report disparity selectivity as such in the ventral stream. Indeed, stereo has been classically associated with the dorsal stream (322, 328), although lesion studies had indicated some involvement of the ventral stream in stereoscopic processing (38, 237, 268). Many subsequent studies have confirmed that stereo is processed in the ventral stream (95, 96, 101103, 302304, 319, 320,339). In these studies of the Leuven group, vertical disparity gradients were imposed on textured surfaces included in relatively complex outlines, about 5° in diameter. Higher order selectivity was demonstrated in IT neurons by showing that the selectivity for curved surfaces of opposite sign (convex and concave) did not depend on the position in depth of the surfaces. The use of position invariance as a criterion is reminiscent of the test used by Lagae et al. (144) to demonstrate higher order motion selectivity in MSTd neurons.

Subsequent studies (118) indicated that neurons selective for three-dimensional shape defined by disparity were not scattered throughout IT, but were concentrated in a small region in the rostral part of the lower bank of the STS. This region, labeled TEs (119), houses many neurons selective for three-dimensional shape from disparity, in contrast to the convexity of IT. The two parts of IT also differ in their degree of binocular summation, which is stronger in TEs than in lateral TE (118). Since the anatomical connectivity of this lower STS region is also different from the remainder of the convexity (165, 261), Janssen et al. (118) proposed that TEs is a separate cortical region linked to the IPS.

Finally, TEs neurons were shown to be endowed with another higher order property that had been frequently postulated but never observed: the rejection of false matches such as those in anticorrelated stereograms (116). In contrast to V1 neurons (41), TEs neurons, which are selective for three-dimensional shape depicted by correlated random dot stereograms (RDS), do not respond selectively to anticorrelated RDS. In this respect anticorrelated RDS are similar to decorrelated RDS, which also evoke no differential responses from TEs neurons. Thus, at the level of TEs, the so-called “stereo correspondence problem” (171) is solved. This need not imply that it has not already been solved at some earlier level. Recent results suggest that the false matches are greatly reduced in V4 (302), but not in V2 (7).

B. Exquisite Coding of Three-Dimensional Shape From Disparity by TEs Neurons

In their initial studies, Janssen et al. (117, 118) emphasized the selectivity of TEs neurons for second-order disparity stimuli. In fact, TEs houses neurons selective for all three orders of depth signaled by disparity. Figure 8 shows an example of zero-order, first-order, and second-order disparity-selective TEs neurons. The defining criterion for a higher order neuron was a selectivity that did not reverse at any position in depth. This criterion supposes that vergence eye movements by the monkey are negligible. Generally, the position of only one eye was recorded, but Janssen et al. (119) have shown that this suffices to detect vergence eye movements, provided enough trials are averaged. Furthermore, a number of higher order neurons were recorded while the positions of both eyes were monitored (115) and the absence of vergence eye movements directly demonstrated. Thus these studies confirmed the validity of our definition of higher order neurons. In the initial study (117), this criterion was implemented by the requirement that the response to the preferred shape at its optimal position should exceed the response to the nonpreferred shape at any position. Subsequently (119), this requirement was quantified by an index comparing the best position for the nonpreferred shape to the worst position of the preferred shape. The ratio of these responses did not exceed a factor of 2 in higher order neurons and was generally smaller than 1.5. For the cell in Figure 8A, the ratio exceeded 5. A simple disparity test with fronto-parallel surfaces sufficed to confirm that this cell was of zero order: the cell was a “near” neuron (229). First-order neurons were position-in-depth invariant and responded as well to the three-dimensional shapes as to a planar three-dimensional surface tilted in depth (Fig. 8B). In the Liu et al. study (160), these neurons were shown to be tuned for the tilt (3-dimensional orientation) in depth. Finally, second-order neurons were invariant for position in depth and responded selectively to shapes curved in depth but not to first-order stimuli (Fig. 8C). In about half of these, the first-order approximation, a wedge, evoked a significantly weaker response than the original curved stimulus (Fig. 8C). In the other half, this approximation was as effective as that stimulus, reminiscent of the V4 neurons tuned to the orientation of smooth curves or sharp angles (221). Note that zero-order approximations were generally not effective in higher-order neurons, especially second-order neurons, the few exceptions being first-order neurons, as illustrated in Figure 8B.

FIG. 8.

Types of TEs neurons. A–C: PSTHs indicating average responses of zero-order (A), first-order (B), and second-order (C) selective neurons. D: responses of TEs neurons selective for the 3-dimensional shape of the edges of surfaces (i) and of texture inside the edges (ii). The horizontal lines below the PSTH indicate stimulus duration. In A, all stimuli are indicated above the corresponding PSTHs. In B–D, icons show only the preferred stimulus polarity. Vertical bars indicate 60 spikes/s (A), 30 spikes/s (B and C), and 65 spikes/s (D). [From Orban et al. (204), copyright 2006 with permission from Elsevier.]

It is worthwhile to emphasize the exquisite sensitivity of TEs neurons for small changes in three-dimensional structure. The difference between curved stimuli and their linear approximations is only one example. Most neurons remained selective for the sign of curvature up to the smallest amplitude of depth variation (0.03°) tested. In addition, most neurons were sensitive to differences in the amplitude of depth variation within convex or concave stimuli. Their response usually decreased monotonically with decreasing amplitude, but in some cases was tuned to a preferred amplitude. Thus TEs neurons can signal very precisely the shape of the object in the third dimension (depth structure), and since they are also two-dimensional shape selective, they provide a complete three-dimensional shape representation of objects.

In the original studies, the depth variation was applied to the outline as well as the texture inside the outline of the surface stimuli. Hence, the selectivity for the curvature of three-dimensional surfaces of TEs neurons could reflect selectivity for the depth structure of either the edges or the texture pattern inside the edges. In fact, TEs neurons can be selective for depth structure of either component of the surface stimuli (115). This was demonstrated by testing TEs neurons with additional stimuli some of which lacked edges in depth (double curved “surface” stimuli in Fig. 8D), others of which lacked texture in depth (decorrelated and solid RDS in Fig. 8D). The neuron in the top part of Figure 8D retains its selectivity with decorrelated RDS and solid stereograms in which only the boundary carries depth information, while losing it when the edges are removed in the doubly curved stimuli. This neuron was thus selective for the three-dimensional shape of the edges (3-dimensional edge neuron). The neuron in the bottom part of Figure 8D reacted in exactly the opposite way and was selective for the depth structure of the texture inside the edges (3-dimensional surface neuron). In the same study, we also showed that TEs neurons can encode the orientation of the three-dimensional curvature and can, in all likelihood, combine selectivity for orthogonally oriented curvatures as captured by the shape index of Koenderink (132). Two orthogonal curvatures define convex or concave ridges (one curvature of zero), convex or concave half spheres (both curvatures of the same sign), or saddles (curvatures of opposite sign).

C. The Invariance of Three-Dimensional Shape Selectivity in TEs

The selectivity for depth structure was found to be invariant in TEs for changes in fronto-parallel position and in size (119), as has been observed for two-dimensional shape selectivity (114, 164, 273, 305, 331, 333). The invariance for fronto-parallel position complements the invariance for position in depth already reported in the first study (117), defining a region in three-dimensional space in which TEs neurons maintain their three-dimensional shape selectivity.

As mentioned earlier, the two-dimensional shape selectivity of IT neurons has been shown to be cue invariant (264, 304). In the same vein, the three-dimensional shape selectivity of TEs neurons has also been shown to be depth-cue invariant. We opted for a comparison of selectivities for the disparity and texture cues. TEs neurons are selective for tilt specified by disparity but also those specified by texture gradients (160), and the preferred tilt is similar for the two cues. In addition, the selectivity for tilt specified by texture was shown to be invariant for texture type, for slant, and for binocular versus monocular presentations.

So far, these properties of TEs neurons have not been modeled. It is clear that some properties such as the fine sensitivity to curvature magnitude pose severe challenges for the cascade model that is proposed to operate in many cortical areas. Perhaps in this case a surround-based mechanism is more suited for extracting these disparity gradients. For example, a tuned near neuron having a surround with two opponent regions, as described in MT/V5 for motion, and perhaps with lateral inhibitory connections from a tuned far neuron with a smaller RF, would generate a selectivity for a convex surface, be it with relatively restricted position invariance.

D. Selectivity of CIP Neurons for First-Order Disparity

Shikata et al. (277) reported that neurons in the caudal part of the lateral bank of IPS were selective for the tilt of stereoscopic surfaces. This caudal region has been referred to as cIPS (258), CIP (300), or posterior LIP (185) and probably corresponds to pIPS as defined by Denys et al. (48) and to LOP as defined by Lewis and Van Essen (157). Although that initial study (277) established the disparity selectivity of the CIP neurons, it is only in Taira et al. (300) that the higher order nature of the selectivity was established by showing invariance for changes in the fixation distance. It has been reported in abstract form that CIP neurons have also solved the correspondence problem (124). Importantly, Tsutsui et al. (317) have demonstrated that inactivation of CIP interferes with judgments about surface tilt.

So far, only first-order selectivity has been demonstrated in CIP, although it has also been suggested that second-order selectivity is present (125). Cue convergence has been documented for CIP neurons, for the combination of texture and disparity (318), as well as for perspective and disparity (317). Unlike the stimuli in the TEs studies, the outlines were always very simple (squares or circles), and stimuli were small solid figures (317) or large textured surfaces (318). Finally, CIP neurons, rather than being selective for the orientation in depth of surfaces (surface orientation selective), can be selective for the three-dimensional orientation of elongated stimuli (axis orientation selective, Ref. 259).

E. Three-Dimensional Shape From Disparity Selectivity in Other Cortical Regions

V1 neurons display no higher-order disparity selectivity (192) and are selective for anticorrelated RDS as well as for correlated RDS (41). Thus most of the properties of TEs and CIP neurons reflect processing beyond V1. V4 provides input to IT, and V4 neurons are selective for the orientation in depth of elongated stimuli (102) but not for surfaces curved in depth (95). Thus either TEs neurons acquire their higher order selectivity through local connections in TEs or TEO, or TEs receives its selective input from IPS, presumably AIP (165). Indeed, it has been suggested that selectivity for three-dimensional orientation is a property of neurons along the lateral bank of IPS (185). The selectivity of AIP neurons for real objects supposedly supports their role in the control of grasping (183). Whether or not the selectivity for real three-dimensional objects is based on selectivity for three-dimensional shape is presently being investigated (293).

Some three-dimensional orientation selectivity, based on disparity, has been reported for MT/V5 neurons (see above), which also have intermediate properties with respect to responses to anticorrelated RDS (143, 190). Nguyenkim and DeAngelis (190) reported that the preferred tilt of MT/V5 neurons, specified by disparity, did not depend on slant. Thus the origin of higher order disparity selectivity and the distribution of this selectivity throughout the visual system remain unclear. The stronger selectivity in CIP and TEs compared with MT/V5, however, suggests that MT/V5 represents one of the early stages in the extraction of three-dimensional shape and three-dimensional surface orientation extraction.


It has been known for a considerable time that IT neurons respond to images of complex objects (86), including biological entities such as faces and hands (49, 224). Although face stimuli have received a great deal of attention (128, 173, 297, 315, 348), it is not clear how important natural stimuli, such as faces, body parts, and animals, are to the actual function of infero-temporal cortex.

In a very influential set of experiments, Tanaka et al. (309) showed that images of complex objects can be reduced, without loss of response from infero-temporal neurons, to “critical features” which generally consist of more or less complex geometrical parts of the object image. These experiments supported the view that objects, including many man-made objects, are represented by their parts and that these parts are predominantly defined by their geometrical description in flat images (164, 305). This has gradually shifted studies of IT neurons towards their two-dimensional shape selectivity, although in the initial study (309) some of the elaborate neurons clearly required combinations of shape with texture and/or color to be responsive. Subsequently, the same group (130), using a similar procedure along the entire ventral pathway, showed that the critical features of neurons become increasingly complex as one advances along the ventral pathway, culminating in the features of the so-called elaborate neurons of TE (anterior part of IT cortex). Finally, critical features were found to be clustered in IT (73), and further studies have suggested that the various critical features of object images are represented by the pattern of active and inactive clusters in IT (316), with some clusters specifically representing the links between the object parts (346). In the preceding sections, many of the higher order selectivities were for higher order parameters in the image such as three-dimensional orientation or position in space of the FOE. The notion of critical feature suggests that in IT, the higher order selectivity is rather a selectivity for a complex “configuration” (e.g., association of shape parts or of shape elements) than one for complex parameters. This might be taken as an indication that the processing in IT is different and perhaps more qualitative, although a critical feature can be seen as a point in a high-dimensional space. Some support for such a qualitative type of processing is provided by the results of Kayaert et al. (127), who found that IT neurons are more sensitive to changes in nonaccidental properties (invariant for changes of orientation in depth of the object) than for changes in metric properties (dependent on orientation in depth).

The finding that the removal of internal contours, texture, or color does not affect the responses of many IT neurons (138) has further contributed to make two-dimensional shape selectivity, understood as selectivity for the outlines of object images, the canonical property of IT neurons. While this is certainly a key property of IT neurons, one should keep in mind (Fig. 9) that 1) we are still uncertain about exactly what it is that the IT neurons represent, since living organisms are far more important to the monkey than most man-made objects; 2) the representation of objects, including animate entities, may not be the only goal of IT processing, and representation of scenes is perhaps just as important; and 3) the processing of two-dimensional shape is important for building these “object” and scene representations but processing of other aspects such as three-dimensional shape and material properties, including volumetric texture and color, are also important (1, 134, 135, 139, 251, 301). Thus IT neurons might be selective for complex image attributes other than two-dimensional shape, and two-dimensional shape itself may be integrated into even higher attributes.

FIG. 9.

Different aspects of the image processed by V1, V4 and infero-temporal (IT). Both geometry and material properties contribute to the description of animate entities (animals, conspecifics, and body parts) and/or objects and of scenes.

A. The Starting Point of Shape Selectivity in V4

As the visual message reaches V4, the figures have been segregated from the background (see above), and the analysis of these figures, particularly the shape of their boundary, can now begin. In an influential set of experiments, Pasupathy and Connor (221) have provided strong evidence that curvature, a contour feature present in angles and curves, is represented in V4 as an intermediate step between the orientation and spatial frequency selectivity in V1 and the complex shape selectivity in IT. Using a large parametric set of contours in which the acuteness and orientation of convex and concave angles and curves, both sharp and smooth, were manipulated, they showed 1) that a substantial fraction of V4 neurons are tuned for the orientation of a given angle or curve; 2) that this selectivity could not be accounted for by lower selectivity for, e.g., orientation of one of the edges of the angle; and 3) that this selectivity was invariant over the position in the RF (221). These results, in V4, are different from those obtained by Ito and Komatsu (113) in V2 where cells can be tuned for the orientation of angles but respond just as strongly to the angles and the component edges, suggesting that these V2 neurons represent an intermediate step leading to V4 angle selectivity. It is worth noting that the selectivity for curvature in V4 is probably based on a mechanism different from the curvature selectivity of end-stopped V1 neurons (52, 330). Instead of resulting from a balance between the extent and strength of the inhibitory end regions (205) and the excitatory RF, the selectivity in V4 more likely arises from the combined input from end-stopped or end-free neurons (220) having the appropriate preferred orientation and RF location (Fig. 10). This sort of representation is more invariant for changes in contrast and possibly for cues defining the contour.

FIG. 10.

Two definitions of curvature. Balance of excitation from classical RF and inhibition from the end regions in V1, V4 from two RFs located and oriented appropriately (with or without endstopping) yields selectivity for the angle/curve or for the angle only.

In a second step, Pasupathy and Connor (222) tested V4 neurons selective for curvature with a large parameterized set of 366 stimuli constructed by combining convex and concave boundary elements into closed shapes. They observed that individual neurons were selective for one to several neighboring curvatures, generally convexities, placed in particular angular position with respect to the shape center. Thus V4 neurons encode complex shapes in terms of moderately complex contour configurations and positions. It is noteworthy that the position is a “relative” position with respect to the shape and its center, which is only possible once the shape has been segregated from the background by the preliminary processing along the V1, V2, V4 path (see above). Finally, Pasupathy and Connor (223) reported that the population of curvature-selective V4 neurons represents complete shapes as aggregates of curved boundary fragments. To estimate the population representation of a shape, they scaled each cell's tuning peak with the response of that cell to that shape, summed across cells, and smoothed. The resulting population surface (coordinates: curvature and angular position) contained several peaks that could be used to reconstruct the original shape.

Others have found little difference between areas V1, V2, and V4 using a large set of grating and contour stimuli (97). The stimulus set included radial and hyperbolic gratings that were originally used to demonstrate the responsiveness to complex patterns in V4 neurons (74, 75) and V2 neurons (94). This result is reminiscent of the apparent similarity in responses to optic flow components in MT/V5 and MSTd (144). It is only when further tests, such as the position invariance test, were used that the underlying difference between MT/V5 neurons selective only for translation direction and MSTd neurons selective for flow components became evident. More elaborate tests, such as those reviewed above, are probably necessary to differentiate between areas V1, V2, and V4. Interestingly, Hegdé and Van Essen (97) observed in their MDS analysis a clear segregation between grating and contour stimuli in V4, which can be seen as an indication that, at this level, shape or contour processing and texture processing begin to diverge.

B. Shape Processing in Posterior IT: Building Simple Shape Parts

Using a strategy similar to that of Pasupathy and Connor (222), Brincat and Connor (33) constructed a parametric stimulus set of two-dimensional silhouette shapes by systematically integrating convex, straight, and concave contour elements at specific orientations and positions. They showed that posterior IT neurons (in TEO and posterior part of TE) integrate multiple contour elements such as those coded in V4 using linear and nonlinear mechanisms. Indeed, the responses to the stimulus set were widely distributed but could be modeled by nonlinear integration of one to six subunits, each selective for a contour element in a given relative position. The average (over 109 neurons) correlation between observed responses and those predicted by the model was 0.7, indicating that the model explained half the variability in the posterior IT responses. Both excitatory and inhibitory inputs were integrated by the IT neurons, but only excitatory inputs were integrated nonlinearly. Nonlinearity was related to the sparseness of the response. Shape tuning in the sense of relative responses to different shapes was position and size invariant. Again, this higher order selectivity is that for a “configuration,” not a complex parameter, underscoring the difference in processing between IT and other parts of extrastriate cortex.

Subsequently, Brincat and Connor (34) studied the time course of the integration. They observed that the linear integration was fast and that nonlinear integration required ∼60 additional milliseconds, a finding reminiscent of observations in MT/V5 for pattern direction selective responses (213, 286). This temporal evolution could be modeled by recurrent connections within the area, which have been shown to produce nonlinear selectivity for a conjunction of inputs that are initially combined linearly (262). These studies underscore the generality of cascade models, including a linear combination of input sandwiched between two nonlinear stages. On the other hand, they also indicate the diversity of the possible implementations of the nonlinearities.

C. Shape Processing in Anterior IT: Manipulating Shape Dimensions

The shape selectivity of anterior TE neurons has been studied with regard to several aspects, such as invariance (114, 264, 313, 333; for reviews, see Refs. 164, 305), similarity between shapes (197), and the influence of training (13, 20, 131) and input statistics (47). An initial study using radial frequency as a global parameter of shape (273) failed to reveal any systematic effects for this single parameter, and it is only recently that various shape parameters have been manipulated systematically and tested in anterior TE neurons. Kayaert et al. (126) manipulated various curvatures in rectangular and triangular shapes and observed monotonic response curves with the strongest responses to the most sharply curved shapes. This is rather similar to results obtained in V4 by Pasupathy and Connor (222) who observed strong responses to maximally curved convex contour fragments. Thus this tuning could simply reflect a selectivity present in the inputs to anterior TE. This simple account is more problematic for the more recent results obtained by Debaene et al. (47). These authors used a rapid serial presentation to test a wide range of stimuli divided into five stimulus sets, each of which was parameterized by two variables (Fig. 11). Again, anterior TE neurons responded maximally to the extreme stimuli in each set. These monotonic response curves are difficult to interpret in these IT studies. Indeed, the choice of the stimuli is arbitrary, and a larger stimulus set could, in principle, reveal a tuning for an optimal value. Given the stimuli used, it is unclear, however, exactly what that wider set should be. Even if we accept that these monotonic curves genuinely represent neuronal behavior in IT, it is unclear what they encode. One possibility is that, exactly as at lower levels in the system, they encode stimulus dimensions just as bell-shaped tunings do. While it is true that V1 neurons are tuned for orientation, direction, and spatial frequency, one should remember that monotonic curves have also been reported for disparity (229, 230) and for speed (199, 206, 207). Furthermore, tuning for spatial frequency might reflect the frequency envelope, imposed by the peripheral visual apparatus, leaving only circular variables such as orientation and direction for which tunings have been observed at the level of V1. Thus, at early levels, monotonic curves might be as useful for encoding stimulus dimensions as bell-shaped tuning curves are. On the other hand, at the level of IT, monotonic curves may have a different interpretation. Because extreme values of shape parameters are more likely to be features that define shape parts, they may evoke stronger responses in IT neurons. Thus the smooth variation of shape dimensions would have effects similar to the more stepwise reduction of object images as preformed by Tanaka et al. (309).

FIG. 11.

Preferences of anterior TE neurons for extreme shapes of a parameterized set. Smoothed average response for all responsive IT neurons to the shapes in four different sets. [Modified from Debaene et al. (47).]

Similar monotonic response curves have been observed by Leopold et al. (154), who tested face-selective TE neurons with stimuli in a face space defined by four different faces and their average. These neurons responded more strongly to the individual faces than to the average and even more so to a caricature that extrapolated beyond the actual face. An explanation in terms of typical shape parts seems difficult at first glance, since a face already contains all the defining parts. Yet it may be that the response variations along the axis average face-identity face in fact reflected some covariation in a face part dimension, such as interocular distance or distance nose-to-mouth distance (70). An extreme value for such a dimension might be indicative of the identity of the face and therefore encoded by IT neurons.


Neurons in the extrastriate cortex are endowed with selectivity for higher order aspects of the image which is absent or rare in V1. Many examples of higher order selectivity come from near-extrastriate areas, such as V2, V4, or MT/V5. At these levels, higher order selectivity is generally that for a complex parameter. Most of the selectivity described here can be accounted for by a simple pattern of excitatory and inhibitory linear inputs, supplemented with nonlinear mechanisms in the afferent and receiving area. Thus the feed-forward projections seem to largely determine the processing in the visual system, although feedback may play a role in figure-ground segmentation (111). One of the functions to which feedback connections might contribute is the generation of antagonistic surrounds present at various levels in the visual cortex. These antagonistic surrounds perform a variety of functions, such as gain control, rejection of uniformity, generation of selectivity, or integration of secondary cues.

We know far less about properties in far-extrastriate regions, such as the intraparietal sulcus, or infero-temporal cortex. A possible reason is that we do not know the goals of the visual processing well enough, especially for the species in which we most often investigate the visual system, the monkey. It may well be that the higher we climb into the hierarchy of the system, the more the processing of visual signals is tailored to the specific behavioral needs of the species under study. This is exemplified by the tuning for heading directions in MSTd. It came as a surprise that all directions in space were equally represented rather than emphasizing the forward direction or directions parallel to the ground plane. This is surprising only from the standpoint of our human needs; it is far less surprising when we consider that a monkey with his arboreal life-style can jump or fall in any direction. It may well be that the types of object images we have tested on IT neurons are too human-oriented and that a monkey normally needs to recognize and categorize other types of stimuli, most generally living animals or conspecifics. Thus, while it has proven fruitful to link neurophysiological studies of the visual system with psychophysics, it might equally be useful to consider an ethological perspective to better understand what the monkey uses his vision for.

Even if it turns out that human and nonhuman primates use behaviorally similar visual information, it will still be important to consider the visual system from its output side, the connections with other brain regions. Indeed, vision is useful only when it delivers a message to other parts of the brain. Thus reversing the current trend of considering vision from the perspective of the eye and the input and instead emphasizing the other brain parts receiving visual information and the output will help define the goals of the visual processing. Without such knowledge, attempts to investigate higher order visual processing can have little hope of success. Knowing the behavioral goals will specify the “end products” that vision has to deliver, and one can then trace back how the representations of these end products gradually emerge at the different levels in the extrastriate cortex. Undoubtedly these end products must be as robust as possible and will require the convergence of the different types of information pertinent to their computation. Thus emphasis will shift to the convergence of different types of information, either within the visual system as we have seen for shape, texture, and color in the description of object parts by IT neurons, or between sensory systems in multimodal areas, such as in several parietal regions. We have noted the combination of visual, vestibular, and oculomotor signals in MSTd and the convergence of visual, vestibular, oculomotor, auditory, and somato-sensory signals in VIP.

The results described in this review span 20 years of single-cell recording and are a testimony to the strength of the technique, the only one to date able to give insight into the operations performed in the brain at the neuronal level. Yet, they also show that progress has been slow. As indicated above, there might be conceptual reasons for this lack of headway, but there are also methodological reasons. Single-cell studies are very labor intensive, inclining researchers to go for assured results and to be rather conservative in their choice of recording sites. Indeed, only a small number of areas have been explored well enough to have at least hints of their perceptual role. This may change rapidly, as we have now a scouting technique available: functional imaging in the awake monkey. This technique (324) allows one to test a wide range of novel stimuli over the whole visual system rather quickly and to find out which areas are involved in their processing. Furthermore, this technique also allows the invasive studies in the monkey to be linked in a scientifically sound way to human functional imaging data, allowing a full exploitation of the monkey model.


This review was written while the author held the European chair at the Collège de France, Paris.

I am indebted to Y. Celis, A. Verhulst, and G. Meulemans for help with the references and figures and to Dr. S. Raiguel for comments on an earlier version of this manuscript.

Address for reprint requests and other correspondence: G. A. Orban, K.U. Leuven, Medical School, Laboratorium voor Neuro- en Psychofysiologie, Campus Gasthuisberg, Herestraat 49, Bus 1021, BE-3000 Leuven, Belgium (e-mail: guy.orban{at}


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
  68. 68.
  69. 69.
  70. 70.
  71. 71.
  72. 72.
  73. 73.
  74. 74.
  75. 75.
  76. 76.
  77. 77.
  78. 78.
  79. 79.
  80. 80.
  81. 81.
  82. 82.
  83. 83.
  84. 84.
  85. 85.
  86. 86.
  87. 87.
  88. 88.
  89. 89.
  90. 90.
  91. 91.
  92. 92.
  93. 93.
  94. 94.
  95. 95.
  96. 96.
  97. 97.
  98. 98.
  99. 99.
  100. 100.
  101. 101.
  102. 102.
  103. 103.
  104. 104.
  105. 105.
  106. 106.
  107. 107.
  108. 108.
  109. 109.
  110. 110.
  111. 111.
  112. 112.
  113. 113.
  114. 114.
  115. 115.
  116. 116.
  117. 117.
  118. 118.
  119. 119.
  120. 120.
  121. 121.
  122. 122.
  123. 123.
  124. 124.
  125. 125.
  126. 126.
  127. 127.
  128. 128.
  129. 129.
  130. 130.
  131. 131.
  132. 132.
  133. 133.
  134. 134.
  135. 135.
  136. 136.
  137. 137.
  138. 138.
  139. 139.
  140. 140.
  141. 141.
  142. 142.
  143. 143.
  144. 144.
  145. 145.
  146. 146.
  147. 147.
  148. 148.
  149. 149.
  150. 150.
  151. 151.
  152. 152.
  153. 153.
  154. 154.
  155. 155.
  156. 156.
  157. 157.
  158. 158.
  159. 159.
  160. 160.
  161. 161.
  162. 162.
  163. 163.
  164. 164.
  165. 165.
  166. 166.
  167. 167.
  168. 168.
  169. 169.
  170. 170.
  171. 171.
  172. 172.
  173. 173.
  174. 174.
  175. 175.
  176. 176.
  177. 177.
  178. 178.
  179. 179.
  180. 180.
  181. 181.
  182. 182.
  183. 183.
  184. 184.
  185. 185.
  186. 186.
  187. 187.
  188. 188.
  189. 190.
  190. 191.
  191. 192.
  192. 193.
  193. 194.
  194. 195.
  195. 196.
  196. 197.
  197. 198.
  198. 199.
  199. 200.
  200. 201.
  201. 202.
  202. 203.
  203. 204.
  204. 205.
  205. 206.
  206. 207.
  207. 208.
  208. 209.
  209. 210.
  210. 211.
  211. 212.
  212. 213.
  213. 214.
  214. 215.
  215. 216.
  216. 217.
  217. 218.
  218. 219.
  219. 220.
  220. 221.
  221. 222.
  222. 223.
  223. 224.
  224. 225.
  225. 226.
  226. 227.
  227. 228.
  228. 229.
  229. 230.
  230. 231.
  231. 232.
  232. 233.
  233. 234.
  234. 235.
  235. 236.
  236. 237.
  237. 238.
  238. 239.
  239. 240.
  240. 241.
  241. 242.
  242. 243.
  243. 244.
  244. 245.
  245. 246.
  246. 247.
  247. 248.
  248. 249.
  249. 250.
  250. 251.
  251. 252.
  252. 253.
  253. 254.
  254. 255.
  255. 256.
  256. 257.
  257. 258.
  258. 259.
  259. 260.
  260. 261.
  261. 262.
  262. 263.
  263. 264.
  264. 265.
  265. 266.
  266. 267.
  267. 268.
  268. 269.
  269. 270.
  270. 271.
  271. 272.
  272. 273.
  273. 274.
  274. 275.
  275. 276.
  276. 277.
  277. 278.
  278. 279.
  279. 280.
  280. 281.
  281. 282.
  282. 283.
  283. 284.
  284. 285.
  285. 286.
  286. 287.
  287. 288.
  288. 289.
  289. 290.
  290. 291.
  291. 292.
  292. 293.
  293. 294.
  294. 295.
  295. 296.
  296. 297.
  297. 298.
  298. 299.
  299. 300.
  300. 301.
  301. 302.
  302. 303.
  303. 304.
  304. 305.
  305. 306.
  306. 307.
  307. 308.
  308. 309.
  309. 310.
  310. 311.
  311. 312.
  312. 313.
  313. 314.
  314. 315.
  315. 316.
  316. 317.
  317. 318.
  318. 319.
  319. 320.
  320. 321.
  321. 322.
  322. 323.
  323. 324.
  324. 325.
  325. 326.
  326. 327.
  327. 328.
  328. 329.
  329. 330.
  330. 331.
  331. 332.
  332. 333.
  333. 334.
  334. 335.
  335. 336.
  336. 337.
  337. 338.
  338. 339.
  339. 340.
  340. 341.
  341. 342.
  342. 343.
  343. 344.
  344. 345.
  345. 346.
  346. 347.
  347. 348.
  348. 349.
  349. 350.
  350. 351.
View Abstract