The automatic reconstruction of urban scenes from sensory input data is a daunting task. By and large the task remains unresolved, although a considerable amount of research has been devoted to its solution. Many of the proposed methods are either too application dependent, or address only some aspects of the general problem. Moreover it appears that solutions based on a single sensor source, for example intensity images or laser point clouds, lead to partial solutions. In this paper we propose the reconstruction of visible surfaces from multi-sensor data, embedded in a fusion framework. We postulate that the reconstructed surface is an intermediate and application independent representation of the scene, similar to the 2.5 D sketch proposed by Marr in his vision paradigm. In contrast to the viewer based 2.5 D sketch, our reconstructed surface is represented in a suitable 3D Cartesian reference system. It contains explicit surface information, including shape and surface discontinuities. We argue that such an explicit description greatly benefits applications, such as object recognition, populating or updating GIS, change detection, city modeling, and true orthophoto generation. This is because the 3D object space enables more powerful reasoning methods to aid object recognition and image understanding as opposed to the traditional approach of reasoning in the 2D image space. Another strong motivation for the proposed application independent surface reconstruction scheme is the multi-source scenario with imaging and laser point data, and possibly hyperspectral data. These widely disparate data sets contain common (redundant), complementary and occasionally conflicting information about the surface. The paper discusses the notion of different surfaces and their relationships. Major emphasis is placed on the development of a general, true 3D surface representation scheme that copes with the problem of multi layer surfaces (e.g. multiple overpass).