Interesting information found while browsing
About Jack Berlin
Founded Accusoft (Pegasus Imaging) in 1991 and has been CEO ever since.
Very proud of what the team has created with edocr, it is easy to share documents in a personalized way and so very useful at no cost to the user! Hope to hear comments and suggestions at info@edocr.com.
Tag Cloud
3D Object Manipulation in a Single Photograph using Stock 3D Models
Natasha Kholgade1 Tomas Simon1 Alexei Efros2 Yaser Sheikh1
1Carnegie Mellon University 2University of California, Berkeley
Original Photograph Object Manipulated in 3D
>
Hidden Right Side
Appearance from Left Side
Hidden Under Side
Appearance from Stock Model
Before After Before After
3D Copy-PasteEstimated Illumination
z
y
x
Figure 1: Using our approach, a user manipulates the taxi cab in a photograph to do a backflip, and copy-pastes the cabs to create a traffic
jam (right) by aligning a stock 3D model (inset) obtained from an online repository. Such 3D manipulations often reveal hidden parts of the
object. Our approach completes the hidden parts using symmetries and the stock model appearance, while accounting for illumination in 3D.
Photo Credits (leftmost photograph): Flickr user © Lucas Maystre.
Abstract
Photo-editing software restricts the control of objects in a photo-
graph to the 2D image plane. We present a method that enables
users to perform the full range of 3D manipulations, including scal-
ing, rotation, translation, and nonrigid deformations, to an object in
a photograph. As 3D manipulations often reveal parts of the object
that are hidden in the original photograph, our approach uses pub-
licly available 3D models to guide the completion of the geometry
and appearance of the revealed areas of the object. The completion
process leverages the structure and symmetry in the stock 3D model
to factor out the effects of illumination, and to complete the appear-
ance of the object. We demonstrate our system by producing object
manipulations that would be impossible in traditional 2D photo-
editing programs, such as turning a car over, making a paper-crane
flap its wings, or manipulating airplanes in a historical photograph
to change its story.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism—Virtual Reality;
Keywords: three-dimensional, photo-editing, 3D models
Links: DL PDF
1 Introduction
One of the central themes of computer graphics is to let the general
public move from being passive consumers of visual information
(e.g., watching movies or browsing photos) to becoming its active
creators and manipulators. One particular area where we have al-
ready achieved success is 2D photo-editing software such as Pho-
toshop. Once mainly a tool of professional photographers and de-
signers, it has become mainstream, so much so that ‘to photoshop’
is now a legitimate English verb [Simpson 2003]. Photoshop lets
a user creatively edit the content of a photograph with image oper-
ations such as recoloring, cut-and-paste, hole-filling, and filtering.
Since the starting point is a real photograph, the final result often
appears quite photorealistic as well. However, while photographs
are depictions of a three-dimensional world, the allowable geomet-
ric operations in photo-editing programs are currently restricted to
2D manipulations in picture space. Three-dimensional manipula-
tions of objects—the sort that we are used to doing naturally in the
real world—are simply not possible with photo-editing software;
the photograph ‘knows’ only the pixels of the object’s 2D projec-
tion, not its actual 3D structure.
Our goal in this paper is to allow users to seamlessly perform 3D
manipulation of objects in a single consumer photograph with the
realism and convenience of Photoshop. Instead of simply editing
‘what we see’ in the photograph, our goal is to manipulate ‘what
we know’ about the scene behind the photograph [Durand 2002].
3D manipulation of essentially a 2D object sprite is highly under-
constrained as it is likely to reveal previously unobserved areas of
the object and produce new, scene-dependent shading and shadows.
One way to achieve a seamless ‘break’ from the original photograph
is to recreate the scene in 3D in the software’s internal representa-
tion. However, this operation requires significant effort, that only
large special effects companies can afford. It typically also involves
external scene data such as light probes, multiple images, and cali-
bration objects, not available with most consumer photographs.
Instead, in this paper, we constrain the recreation of the scene’s 3D
geometry, illumination, and appearance from the 2D photograph
using a publicly available 3D model of the manipulated object as
a proxy. Graphics is now entering the age of Big Visual Data:
enormous quantities of images and video are uploaded to the In-
ternet daily. With the move towards model standardization and the
use of 3D scanning and printing technologies, publicly available
3D data (modeled or scanned using 3D sensors like the Kinect) are
also readily available. Public repositories of 3D models (e.g., 3D
Warehouse or Turbosquid) are growing rapidly, and several Inter-
net companies are currently in the process of generating 3D models
for millions of merchandise items such as toys, shoes, clothing, and
household equipment. It is therefore increasingly likely that for
most objects in an average user photograph, a stock 3D model will
soon be available, if it is not already.
However, it is unreasonable to expect such a model to be a per-
fect match to the depicted object—the visual world is too varied to
ever be captured perfectly no matter how large the dataset. There-
fore, our approach deals with several types of mismatch between
the photographed object and the stock 3D model:
Geometry Mismatch. Interestingly, even among standard, mass-
produced household brands (e.g., detergent bottles), there are often
subtle geometric variabilities as manufacturers tweak the shape of
their products. Of course, for natural objects (e.g., a banana), the
geometry of each instance will be slightly different. Even in the
cases when a perfect match could be found (e.g., a car of a specific
make, model, and year), many 3D models are created with artistic
license and their geometry will likely not be metrically accurate, or
there are errors due to scanning.
Apperance Mismatch. Although both artists and scanning tech-
niques often provide detailed descriptions of object appearance
(surface reflectance), these descriptions may not match the colors
and textures (and aging and weathering effects) of the particular
instance of the object in the photograph.
Illumination Mismatch. To perform realistic manipulations in
3D, we need to generate plausible lighting effects, such as shadows
on an object and on contact surfaces. The environment illumination
that generates these effects is not known a priori, and the user may
not have access to the original scene to take illumination measure-
ments (e.g., in dynamic environments or for legacy photographs).
Our approach uses the pixel information in visible parts of the
object to correct the three sources of mismatch. The user semi-
automatically aligns the stock 3D model to the photograph using a
real-time geometry correction interface that preserves symmetries
in the object. Using the aligned model and photograph, our ap-
proach automatically estimates environment illumination and ap-
pearance information in hidden parts of the object. While a photo-
graph and 3D model may still not contain all the information needed
to precisely recreate the scene, our approach sufficiently approxi-
mates the illumination, geometry, and appearance of the underlying
object and scene to produce plausible completion of uncovered ar-
eas. Indeed, as shown by the user study in Section 8, our approach
plausibly reveals hidden areas of manipulated objects.
The ability to manipulate objects in 3D while maintaining realism
greatly expands the repertoire of creative manipulations that can
be performed on a photograph. Users are able to quickly perform
object-level motions that would be time-consuming or simply im-
possible in 2D. For example, from just one photograph, users can
cause grandma’s car to perform a backflip, and fake a baby lifting a
heavy sofa. We tie our approach to standard modeling and anima-
tion software to animate objects from a single photograph. In this
way, we re-imagine typical Photoshop edits—such as object rota-
tion, translation, rescaling, deformation, and copy-paste—as object
manipulations in 3D, and enable users to more directly translate
what they envision into what they can create.
Contributions. Our key contribution is an approach that allows
out-of-plane 3D manipulation of objects in consumer photographs,
while providing a seamless break from the original image. To do
so, our approach leverages approximate object symmetries and a
new non-parametric model of image-based lighting for appearance
completion of hidden object parts and for illumination-aware com-
positing of the manipulated object into the image. We make no as-
sumptions on the structure or nature of the object being manipulated
beyond the fact that an approximate stock 3D model is available.
Assumptions. In this paper, we assume a Lambertian model of
illumination. We do not model material properties such as refrac-
tion, specularities, sub-surface scattering, or inter-reflection. The
user study discussed in Section 8 shows that while for some objects,
these constraints are necessary to produce plausible 3D manipula-
tions, if their effects are not too pronounced, the results can be per-
ceptually plausible without explicit modeling. In addition, we as-
sume that the user provides a stock 3D model with components for
all parts of the objects visible in the original photograph. Finally,
we assume that the appearance of the 3D model is self-consistent,
i.e., the precise colors of the stock model need not match the pho-
tograph, but appearance symmetries should be preserved. For in-
stance, the cliffhanger in Figure 13 is created using the 3D model
of a blueish-grey Audi A4 (shown in the supplementary material)
to manipulate the green Rover 620 Ti in the photograph.
Notation. For the rest of the paper, we refer to known quanti-
ties without using the overline notation, and we use the overline
notation for unknown quantities. For instance, the geometry and
appearance of the stock 3D model known a priori are referred to
as X and T respectively. The geometry and appearance of the 3D
model modified to match the photograph are not known a priori and
are referred to as X and T respectively. Similarly, the illumination
environment which is not known a priori is referred to as L.
2 Related Work
Modern photo-editing software such as Photoshop provides sophis-
ticated 2D editing operations such as content-aware fill [Barnes
et al. 2009] and content-aware photo resizing [Avidan and Shamir
2007]. Many approaches provide 2D edits using low-level assump-
tions about shape and appearance [Barrett and Cheney 2002; Fang
and Hart 2004]. The classic work of Khan et al. [2006] uses insights
from human perception to edit material properties of photographed
objects, to add transparency, translucency, and gloss, and to change
object textures. Goldberg et al. [2012] provide data-driven tech-
niques to add new objects or manipulate existing objects in images
in 2D. While these techniques can produce surprisingly realistic
results in some cases, their lack of true 3D limits their ability to
perform more invasive edits, such as 3D manipulations.
The seminal work of Oh et al. [2001] uses depth-based segmen-
tation to perform viewpoint changes in a photograph. Chen et
al. [2011] extend this idea to videos. These methods manipulate
visible pixels, and cannot reveal hidden parts of objects. To address
these limitations, several methods place prior assumptions on pho-
tographed objects. Data-driven approaches [Blanz and Vetter 1999]
provide drastic view changes by learning deformable models, how-
ever, they rely on training data. Debevec et al. [1996] use the reg-
ular symmetrical structure of architectural models to reveal novel
views of buildings. Kopf et al. [2008] use georeferenced terrain
and urban 3D models to relight objects and reveal novel viewpoints
in outdoor imagery. Unlike our method, Kopf et al. do not remove
the effects of existing illumination. While this works well outdoors,
it might not be appropriate in indoor settings where objects cast soft
shadows due to area light sources.
Approaches in proxy-based modeling of photographed objects in-
clude cuboid proxies [Zheng et al. 2012] and 3-Sweep [Chen et al.
2013]. Unlike our approach, Zheng et al. and Chen et al. (1) cannot
reveal hidden areas that are visually distinct from visible areas, lim-
iting the full range of 3D manipulation (e.g., the logo of the laptop
from Zheng et al. that we reveal in Figure 7, the underside of the
taxi cab in Figure 1, and the face of the wristwatch in Figure 13),
(2) cannot represent a wide variety of objects precisely, as cuboids
(Zheng et al.) or generalized cylinders (Chen et al.) cannot handle
highly deformable objects such as backpacks, clothing, and stuffed
toys, intricate or indented objects such as the origami crane in Fig-
ure 6 or a pair of scissors, or objects with negative space such as
cups, top hats, and shoes, and (3) cannot produce realistic shad-
ing and shadows (e.g. in the case of the wristwatch, the top hat,
the cliffhanger, the chair, and the fruit in Figure 13, the taxi cab in
(b) Selected Stock 3D Model (d) 3D Object Manipulation (f) Manipulated Object Geometry (j) Final Result
z
y
x
(a) Original Photograph (e) Geometry Correction (g) Illumination Estimation (i) Appearance Completion
(h) Manipulated Object Illumination
2D Point
Corrections Mask
(c) User Input
Figure 2: Overview: (a) Given a photograph and (b) a 3D model from an online repository, the user (c) interactively aligns the model to the
photograph and provides a mask for the ground and shadow, which we augment with the object mask and use to fill the background using
PatchMatch [Barnes et al. 2009]. (d) The user then performs the desired 3D manipulation. (e) Our approach computes the camera and cor-
rects the 3D geometry, and (f) reveals hidden geometry during the 3D manipulation. (g) It automatically estimates environment illumination
and reflectance, (h) to produce shadows and surface illumination during the 3D manipulation. (i) Our approach completes appearance to
hidden parts revealed during manipulation, and (j) composites the appearance with the illumination to obtain the final photograph.
Figure 1, and the crane in Figure 6) to contribute to perceptually
plausible 3D manipulation. Zheng et al. use a point light source
that does not capture the effect of several area light sources (e.g.,
in a typical indoor environment). Chen et al. provide no explicit
representation of illumination.
An alternative to our proposed method of object manipulation is
object insertion, i.e., to inpaint the photographed object and replace
it with an inserted object, either in 2D from a large ‘photo clip art’
library [Lalonde et al. 2007], or in 3D [Debevec 1998; Karsch et al.
2011]. The classic approach of Debevec [1998] renders synthetic
3D objects into the photograph of a real scene by using illumination
captured with a mirrored sphere. Karsch et al. [2011] remove the
requirement of physical access to the scene by estimating geometry
and illumination from the photograph. However, such an insertion-
based approach discards useful information about environment illu-
mination and appearance contained in object pixels in the original
photograph. In contrast, our approach aims to utilize all information
in the original object pixels to estimate the environment illumina-
tion and appearance from the input photograph. Furthermore, when
creating videos, object insertion methods are unlikely to produce a
seamless break from the original photograph, as peculiarities of the
particular instance that was photographed (e.g., smudges, defects,
or a naturally unique shape) will not exist in a stock 3D model.
Our approach is related to work in the area of geometry alignment,
illumination estimation, texture completion, and symmetry detec-
tion. In the area of geometry alignment, there exist automated or
semi-automated methods [Xu et al. 2011; Prasad et al. 2006; Lim
et al. 2013]. In general, as shown by the comparison to Xu et
al. [2011] in Section 8, they fail to provide exact alignment crucial
for seamless object manipulation. To estimate illumination, we use
a basis of von Mises-Fisher kernels that provide the advantage of
representing high frequency illumination effects over the classical
spherical harmonics representation used in Ramamoorthi and Han-
rahan [2001], while avoiding unnaturally sharp cast shadows that
arise due to the point light representation used in Mei et al. [2009].
In addition, we impose non-negativity constraints on the basis coef-
ficients that ensure non-negativity of illumination, in contrast to the
Haar wavelet basis used in Ng et al. [2003], Okabe et al. [2004],
Haber et al. [2009] and Romeiro et al. [2010]. Mixtures of von
Mises-Fisher kernels have been estimated for single view relight-
ing [Hara et al. 2008; Panagopoulos et al. 2009], however, these
require estimating the number of mixtures separately. Our appear-
ance completion approach is related to methods that texture map 3D
models using images [Kraevoy et al. 2003; Tzur and Tal 2009; Gal
et al. 2010], however, they do not factor out illumination, and may
use multiple images to obtain complete appearance. In using sym-
metries to complete appearance, our work is related to approaches
that extract symmetries from images and 3D models [Hong et al.
2004; Gal and Cohen-Or 2006; Pauly et al. 2005], and that use
symmetries to complete geometry [Terzopoulos et al. 1987; Mitra
et al. 2006; Mitra and Pauly 2008; Bokeloh et al. 2011], and to infer
missing appearance [Kim et al. 2012]. However, our work differs
from these approaches in that the approaches are mutually exclu-
sive: approaches focused on symmetries from geometry do not re-
spect appearance constraints, and vice versa. Our approach uses
an intersection of geometric symmetry and appearance similarity,
and prevents appearance completion between geometrically simi-
lar but visually distinct parts, such as the planar underside and top
of a taxi-cab, or between visually similar but geometrically distinct
parts such as the curved surface of a top-hat and its flat brim.
3 Overview
The user manipulates an object in a photograph as shown in Fig-
ure 2(a) by using a stock 3D model. For this photograph, the model
was obtained through a word search on the online repository Tur-
boSquid. Other repositories such as 3D Warehouse, (Figure 2(b))
or semi-automated approaches such as those of Xu et al. [2011],
Lim et al. [2013], and Aubry et al. [2014] may also be used. The
user provides a mask image that labels the ground and shadow pix-
els. We compute a mask for the object pixels, and use this mask
to inpaint the background using the PatchMatch algorithm [Barnes
et al. 2009]. For complex backgrounds, the user may touch up the
background image after inpainting. Figure 2(c) shows the mask
with ground pixels in gray, and object and shadow pixels in white.
The user semi-automatically aligns and corrects the stock 3D model
to match the photograph using our symmetry-preserving geometry
correction interface as shown in Figure 2(c). Using the corrected
3D model and the photograph pixels, our approach computes and
factors out the environment illumination (Figure 2(g)), and com-
pletes the appearance of hidden areas (Figure 2(i)). Users can then
perform their desired 3D manipulations as shown in Figure 2(d),
and the illumination, completed appearance, and texture are com-
posited to produce the final output, shown in Figure 2(j).
When a user manipulates an object in the photograph I ∈
RW×H×3, shown in Figure 2(a), from the object’s original pose
Θ to a new pose Ω as in Figure 2(d), our objective is to produce an
edited photograph J shown in Figure 2(j). Here W and H are the
width and height of the photograph. We model I as a function of
the object geometry X, object appearance T, and the environment
illumination L, as
I = f(Θ,X,T,L). (1)
The above equation essentially represents the rendering equation.
The manipulated photograph J can then be produced by replacing
the original pose Θ with the new pose Ω, i.e., J = f(Ω,X,T,L).
However, X, T, and L are not known a priori, and estimating them
from a single photograph without any prior assumptions is highly
ill-posed [Barron 2012].
We overcome this difficulty by bootstrapping the estimation using
the stock 3D model of the object, whose geometry consists of ver-
tices X, and whose appearance is a texture map T (Figure 2(b)). In
general, the stock model geometry and appearance do not precisely
match the geometry X and appearance T of the photographed ob-
ject. We provide a tool through which the user marks 2D point
corrections, shown in Figure 2(c). We deform X to match X using
the 2D corrections, as shown in Figure 2(e) and described in Sec-
tion 4. Additionally, we estimate the ground plane in 3D, using one
of two methods: either using vanishing points from user-marked
parallel lines in the image, or as the plane intersecting three user-
marked points on the base of the object. Ground plane estimation
is described in the supplementary material. Manually correcting
the illumination and the appearance is difficult, as the illumination
sources may not be visible in the photograph. Instead, we present an
algorithm to estimate the illumination and appearance using pixels
on the object and the ground, as shown in Figure 2(g) and described
in Section 5, by optimizing the following objective:{
T
?
,L
?
}
= arg min
T,L
∥∥I− f(Θ,X,T,L)∥∥2
2
. (2)
Equation (2) only estimates the appearance for parts of the object
that are visible in the original photograph I as shown in Figure 2(i).
The new pose Θ′ potentially reveals hidden parts of the object. To
produce the manipulated photograph J, we need to complete the
hidden appearance. After factoring out the effect of illumination on
the appearance in visible areas, we present an algorithm that uses
symmetries to complete the appearance of hidden parts from visible
areas as described in Section 6. The algorithm uses the stock model
appearance for hidden parts of objects that are not symmetric to vis-
ible parts. Given the estimated geometry, appearance, and illumi-
nation, and the user-manipulated pose of the object, we composite
the edited photograph by replacing Θ with Θ′ in Equation (1) as
shown in Figures 2(f), 2(h), and 2(j).
4 Geometry Correction
We provide a user-guided approach to correct the geometry of the
3D model to match the photographed object, while ensuring that
smoothness and symmetry are preserved over the model. Given the
photograph I and the stock model geometry X ∈ RN×3 (where
N is the number of vertices), we first estimate the original rigid
pose Θ of the object using a set A of user-defined 3D-2D corre-
spondences, Xj ∈ R3 on the model and xj ∈ R2, j ∈ A in the
photograph. Here, Θ = {R, t}, where R ∈ R3×3 is the object
rotation, and t ∈ R3 is the object translation. We use the EPnP al-
gorithm [Lepetit et al. 2009] to estimate R and t. The algorithm
takes as input Xj , xj , and the matrix K ∈ R3×3 of camera param-
eters (i.e., focal length, skew, and pixel aspect ratio). We assume
a zero-skew camera, with square pixels and principal point at the
photograph center. We use the focal length computed from EXIF
tags when available, else we use vanishing points to compute the
focal length. We assume that objects in the photograph are at rest
on a ground plane. We describe focal length extraction using van-
ishing points, and ground plane estimation in the supplementary
material. It should be noted that there exists a scale ambiguity in
computing t. The EPnP algorithm handles the scale ambiguity in
terms of translation along the z-axis of the camera.
As shown in Figure 3(a), after the camera is estimated, the user
provides a set B of start points xk ∈ R2, k ∈ B on the projection
of the stock model, and a corresponding set of end points xk ∈ R2
on the photographed object for the purpose of geometry correction.
We used a point-to-point correction approach, as opposed to sketch
or contour-based approaches [Nealen et al. 2005; Kraevoy et al.
2009], as reliably tracing soft edges can be challenging compared
to providing point correspondences. The user only provides the
point corrections in 2D. We use them to correct X to X in 3D by
optimizing an objective in X consisting of a correction term E1, a
symmetry prior E2, and a smoothness prior E3:
E(X) = E1(X) + E2(X) + E3(X). (3)
The correction term E1 forces projections of the points
X
Θ
k = RXk + t to match the user-provided 2D corrections
xk, k ∈ B. Here, XΘ represents X transformed by Θ. As shown in
Figure 3(b), we compute the ray vk = K−1[xTk 1]
T back-projected
through each xk. E1 minimizes the sum of distances between each
X
Θ
k and the projection
vkv
T
k
‖vk‖2 X
Θ
k of X
Θ
k onto the ray vk:
E1(X) =
∑
k∈B
∥∥∥∥∥XΘk − vkvTk‖vk‖22 XΘk
∥∥∥∥∥
2
2
. (4)
Unlike traditional rotoscoping, the correction term encourages the
vertex coordinates to match the photograph only after geometric
projection into the camera. The corrected vertices Xk are other-
wise free to move along the lines of projection such that the overall
deformation energy E(X) is minimized.
The smoothness prior E2 preserves local smoothness over the cor-
rected model. As shown in Figure 3(c), the term ensures that points
in the neigborhood of the corrected points Xk move smoothly. In
our work, this term refers to the surface deformation energy from
the as-rigid-as-possible framework of Sorkine and Alexa [2007].
The framework requires that local deformations within the 1-ring
neighborhoodDi of the ith point in the corrected model should have
nearly the same rotations Ri as on the original model:
E2(X) =
N∑
i=1
∑
j∈Di
∥∥(Xi −Xj)−Ri(Xi −Xj)∥∥22 . (5)
The local rotation Ri in the neighborhood of a vertex Xi is distinct
from the global rigid rotation R.
The symmetry prior E3 preserves the principal symmetry (or bilat-
eral symmetry) of the model, as shown in Figure 3(c). If a point
xk
xk
(a) User 2D Edit (b) Correction Along Ray in 3D (c) Smoothness and Symmetry Priors
vk vk
xk
xk
xk
xk
Camera Camera
X⇥k
X
⇥
k
X
⇥
sym(k)
X⇥sym(k)
X⇥k
X⇥sym(k)
Figure 3: Geometry correction. (a) The user makes a 2D cor-
rection by marking a start-end pair, (xk,xk) in the photograph.
(b) Correction term: The back-projected ray vk corresponding to
xk is shown in black, and the back-projected ray corresponding
to xk is shown in red. The top inset shows the 3D point XΘk for
xk on the stock model, and the bottom inset shows its symmetric
pair XΘsym(k). We deform the stock model geometry (light grey) to
the user-specified correction (dark grey) subject to smoothness and
symmetry-preserving priors.
on the stock model Xi has a symmetric counterpart Xsym(i), E3
ensures that Xi remains symmetric to Xsym(i), i.e., that they are
related through a symmetric transform,
S =
[
I3 − 2nnT 2nd
]
, (6)
where I3 is the 3×3 identity matrix, and n and d are the normal and
distance of the principal plane of symmetry in the corrected model
geometry X. E3 is thus given as
E3(X) = w
N∑
i=1
∥∥∥S[XTi 1]T −Xsym(i)∥∥∥2
2
. (7)
Here, w is a user-defined weight, that the user sets to 1 if the object
has bilateral symmetry, and 0 otherwise.
To determine Xsym(i) for every stock model point Xi, we compute
the principal plane pi = [nT − d]T on the stock model using
RANSAC1. We then reflect every Xi across pi, and obtain Xsym(i)
as the nearest neighbor to the reflection of Xi across pi.
The objective function in Equation (3) is non-convex. However,
note that given the symmetry S and local rotations Ri, the objective
is convex in the geometry X, and vice versa. We initialize Ri = I3,
where I3 is the 3× 3 identity matrix, and S with the original stock
model symmetry S. We alternately solve for the geometry, and the
symmetry and local rotations till convergence to a local minimum.
Given S and Ri, we solve for X by setting up a system of linear
equations. Given X, we solve for Ri through SVD as described
by Sorkine and Alexa [2007]. To solve for S, we assume that the
bilateral plane of symmetry passes through the center of mass of
the object (which we can assume without loss of generality to be
at the origin), so that d = 0. To obtain n, we note that the first
three columns of S (which we refer to as S3) form an orthogonal
matrix, which we extract using SVD, as follows: We create matrices
A = [X1, · · · ,XN ] and B = [Xsym(1) · · ·Xsym(N)], perform the
1At each RANSAC iteration, we randomly choose two points Xir and
Xjr on the stock 3D model, and compute their bisector plane pir with nor-
mal nr =
Xir−Xjr
‖Xir−Xjr‖2
, and distance from origin 1
2
nTr (Xir+Xjr ). We
maintain a score nr of the number of points that when reflected across pir
have a symmetric neighbor within a small threshold µ. After R iterations,
we retain the plane with maximum score as pi.
Light Map von Mises-Fisher Basis
= ↵1 +↵2 + . . .+ ↵K
Figure 4: We represent the environment map as a linear combina-
tion of the von Mises-Fisher (vMF) basis. We enforce constraints of
sparseness and grouping of basis coefficients to mimic area lighting
and produce soft cast shadows.
SVD decomposition of ABT as UΣVT , and extract S3 = VUT .
Then, we extract n as the principal eigenvector of the matrix I3−S3
2
.
We substitute n and d = 0 in Equation (6) to get S.
5 Illumination and Appearance Estimation
Given the corrected geometry X
Θ
, we estimate illumination L and
appearance T to produce plausible shadows and lighting effects on
the object and the ground. We represent the imaging function f
using a Lambertian reflection model. Under this model, the ith pixel
Ii on the object and the ground in the photograph is generated as
fi(Θ,X,T,L) = Pi
∫
Ω
ni · si(ω)vi(ω)L(ω)dω + δi, (8)
where Pi ∈ R3 and δi ∈ R3 model the appearance of the ith
pixel. We assume that the appearance T of the object consists of
a reflectance map P ∈ RU×V×3and a residual difference map
δ ∈ RU×V×3. Here, P models the diffuse reflectance of the
object’s appearance (also termed the albedo). Inspired by De-
bevec [1998]), we include δ to represent the residual difference
between the image pixels and the diffuse reflection model, since
the model is only an approximation to the BRDF of the object. U
and V are the dimensions of the texture map. For the ith pixel, Pi
and δi are interpolated from the maps P and δ at the point X
Θ
i .
X
Θ
i is the 3D point back-projected from the ith pixel location to
the object’s 3D geometry X as transformed by Θ. ni is the nor-
mal at point X
Θ
i , si(ω) is the source vector from X
Θ
i towards the
light source along the solid angle ω, and vi(ω) is the visibility of
this light source from X
Θ
i . L(ω) is the intensity of the light source
along ω. We assume that the light sources lie on a sphere, i.e., that
L(ω) is a spherical environment map.
To estimate these quantities, we optimize the following objective
function in P and L, consisting of a data term F1, an illumination
prior F2, and a reflectance prior F3:
F (P,L) = F1(P,L) + F2(L) + F3(P), (9)
and we obtain δ as the residual of the data term F1 in the objective
function. F1 represents the generation of pixels in a single photo-
graph using the illumination model from Equation 8 as follows:
F1(P,L) =
NI∑
i=1
Ï„i
∥∥∥∥Ii −Pi ∫
Ω
ni · si(ω)vi(ω)L(ω)dω
∥∥∥∥2
2
,
where Ï„i =
 1 for object pixelsand shadow pixels on ground,τ for non-shadow pixels on ground.
Here, NI is the number of pixels covered by the object and the
ground in the photograph. Ï„ corresponds to the value in the gray
region of the user-provided mask shown in Figure 2(c). Here,
0 ≤ τ ≤ 1, and a small value of τ in non-shadow areas of the
ground emphasizes cast shadows. F2 and F3 represent illumination
and reflectance priors that regularize the ill-posed optimization of
estimating P and L from a single photograph.
To describe the illumination prior F2, we represent L(ω) as a linear
combination of von Mises-Fisher (vMF) kernels, L(ω) = Γ(ω)α.
Γ ∈ RK is a functional basis, shown in Figure 4, and α is a vector
of basis coefficients. The k-th component of Γ corresponds to the
k-th vMF kernel, given by
h (u(ω);µk, κ) =
exp
(
κµk
Tu(ω)
)
4pi sinhκ
,
where µk is the k-th mean direction vector, u(ω) is a unit vector
along direction ω, and the concentration parameter κ describes the
peakiness of the distribution [Fisher 1953].
Through the illumination prior F2, we force the algorithm to find
a sparse set of light sources using an L1 prior on the coefficients.
In addition, according to the elastic net framework [Zou and Hastie
2005], we place an L2 prior to force groups of correlated coeffi-
cients to be turned on. The L2 prior forces spatially adjacent light
sources to be switched on simultaneously to represent illumination
sources such as area lights or windows. We thus obtain the follow-
ing form for F2:
F2(L) = λ1 ‖α‖1 + λ2 ‖α‖22 . (10)
The reflectance prior F3 enforces piecewise constancy over the
deviation of the reflectance P from the original stock model re-
flectance P:
F3(P) = λ3
NI∑
i=1
∑
j∈Ni
∥∥(Pi −Pi)− (Pj −Pj)∥∥1 . (11)
P belongs to the stock model appearance T described in Section 3.
The prior F3 is related to color constancy assumptions about intrin-
sic images [Land et al. 1971; Karsch et al. 2011]. Ni represents the
4-neighborhood of the ith pixel in image space.
We optimize the objective F in Equation (9) subject to non-
negativity constraints on α:{
P
?
,L
?
}
= arg min
P,L
F (P,L), s.t. α ≥ 0. (12)
The above optimization is non-convex due to the bilinear interac-
tion of the surface reflectances P with the illumination L. If we
know the reflectances, we can solve a convex optimization for the
illumination, and vice versa. We initialize the reflectances with the
stock model reflectance P for the object, and the median pixel value
for the ground plane. We alternately solve for illumination and re-
flectance until convergence to a local minimum. To represent the
vMF kernels and L, we discretize the sphere into K directions, and
compute K kernels, one per direction. Finally, we compute the ap-
pearance difference as the residual of synthesizing the photograph
using the diffuse reflection model, i.e.,
δ
?
i = Ii −P?i
∫
Ω
ni · si(ω)vi(ω)L?(ω)dω. (13)
⇡1
⇡2 ⇡2
⇡1
⇡2 ⇡2
(a) Layer 1 (b) Layer 2 (c) Layer 3 (d) Layer 4 (e) MRF
Labeling
La
te
ra
l V
ie
w
D
or
sa
l V
ie
w
Figure 5: We build an MRF over the object model to complete ap-
pearance. (a) Due to the camera viewpoint, the vertices are parti-
tioned into a visible set Iv shown with the visible appearance, and
a hidden set Ih shown in green. Initially, the graph has a single
layer of appearance candidates, labeled Layer 1, corresponding to
visible parts. At the first iteration, we use the bilateral plane of sym-
metry pi1 to transfer appearance candidates from Layer 1 to Layer
2. At the second iteration, we use an alternate plane of symmetry pi2
to transfer appearance candidates (c) from Layer 1 to Layer 3, and
(d) from Layer 2 to Layer 4. We perform inference over an MRF
to find the best assignment of appearance candidates from several
layers to each vertex. This result was obtained after six iterations.
6 Appearance Completion to Hidden Parts
The appearance T = {P?, δ?} computed in Section 5 is only avail-
able for visible parts of the object, as shown in Figure 5(a). We use
multiple planes of symmetry to complete the appearance in hidden
parts of the object using the visible parts. We first establish sym-
metric relationships between hidden and visible parts of the object
via planes of symmetry. These symmetric relationships are used to
suggest multiple appearance candidates for vertices on the object.
The appearance candidates form the labels of a Markov Random
Field (MRF) over the vertices. To create the MRF, we first obtain a
fine mesh of object vertices Xs, s ∈ I created by mapping the uv-
locations on the texture map onto the 3D object geometry. Here I
is a set of indices for all texel locations. We use this fine mesh since
the original 3D model mesh usually does not provide one vertex per
texel location, and cannot be directly used to completely fill the ap-
pearance. We set up the MRF as a graph whose vertices correspond
to Xs, and whose edges consist of links from each Xs to the set
Ks consisting of k nearest neighbors of Xs. As described in Sec-
tion 6.1, we associate each vertex Xs with a set of L appearance
candidates (P˜is , δ˜is), i ∈ {1, 2, · · · , L} through multiple sym-
metries. Appearance candidates with the same value of i form a
layer, and the algorithm in Section 6.1 grows layers (Figure 5(b)
through 5(d)) by transferring appearance candidates from previous
layers across planes of symmetry. To obtain the completed appear-
ance, shown in Figure 5(e), we find an assignment of appearance
candidates, such that each vertex is assigned one candidate and the
assignment satisfies the constraints of neighborhood smoothness,
consistency of texture, and matching of visible appearance to the
observed pixels. We obtain this assignment, by performing infer-
ence over the MRF as described in Section 6.2.
6.1 Relating Hidden& Visible Parts via Symmetries
We establish symmetric relationships between hidden and visible
parts of the object by leveraging the regularities of the stock model
X. However, we identify the hidden and visible areas using the
aligned and corrected model X
Θ
. To do so, we note that a point
Xs on the stock model is related to the texture map and to a cor-
responding point X
Θ
s on the corrected model through barycentric
coordinates over a triangle mesh specified on the stock model. We
compute X
Θ
s on the corrected model using the barycentric coordi-
nates of Xs. We then use pose Θ of the object to determine the set
of indices Iv for vertices visible from the camera viewpoint (shown
as textured parts of the banana in Figure 5(a)), and the set of indices
Ih = I \Iv for vertices hidden from the camera viewpoint (shown
in green in Figure 5(a)). While it is possible to pre-compute sym-
metry planes on an object, our objective is to relate visible parts
of an object to hidden parts through symmetries, many of which
turn out to be approximate (for instance, different parts of a banana
with approximately similar curvature are identified as symmetries
using our approach). Pre-computing all such possible symmetries
is computationally prohibitive. We proceed iteratively, and in each
iteration, we compute a symmetric relationship between Ih and
Iv using the stock model X. Specifically, we compute planes of
symmetry, shown as planes pi1 and pi2 in Figures 5(a) and 5(b).
Through this symmetric relationship, we associate appearance can-
didates (P˜is , δ˜is) to each vertex Xs in the graph by growing out
layers of appearance candidates.
Algorithm 1 details the use of symmetries to associate appearance
candidates P˜is and δ˜is through layers. The algorithm initializes
the appearance candidates at the first layer for vertices of the ob-
ject in parts visible in the photograph using the appearance P and
δ computed in Section 5. Over M iterations, the algorithm uses M
planes of symmetry to populate the graph from layers 1 toL = 2M .
The algorithm uses RANSAC to compute the optimal plane of sym-
metry pim at the mth iteration. Planes pi1 and pi2 computed for it-
erations m = 1 and m = 2 are shown in Figures 5(a) and 5(b).
At themth iteration, the algorithm then generates 2m−1 new layers,
by transferring appearance candidates from the first 2m−1 layers
across the plane of symmetry. In Figure 5(b), Layer 2 is generated
by transferring appearance candidates from Layer 1 across pi1. Us-
ing plane pi2 obtained in iteration m = 2, the algorithm generates
Layers 3 and 4 (in Figures 5(c) and 5(d)) from Layers 1 and 2. The
first 2m−1 layers are grown in the previous m − 1 iterations. The
algorithm uses the criteria of geometric symmetry (captured by the
distance
∥∥Xs − 2nmnTmXt? + 2nmdm∥∥ in vertex space) and ap-
pearance similarity (represented by the distance ‖Ps −Pt?‖ in the
image space of the texture map) to compute the symmetry plane
pim and to transfer appearance candidates. For vertices for which
no appearance candidates can be generated using geometric sym-
metry and appearance similarity, as in the case of the underside of
the taxi cab in Figure 1, the algorithm defaults to the stock model
appearance P (for which appearance difference δ is 0).
6.2 Completing Appearance via MRF
To obtain the completed appearance for the entire object from the
appearance candidates, shown in Figure 5(e), we need to select a set
of candidates such that (1) each vertex on the 3D model is assigned
exactly one candidate, (2) the selected candidates satisfy smooth-
ness and consistency constraints, and (3) visible vertices retain
their original appearance. To do this, we perform inference over
the MRF using tree-reweighted message passing (TRW-S) [Kol-
mogorov 2006]. While graph-based inference has been used to
complete texture in images [Kwatra et al. 2003], our approach uses
Algorithm 1 Associating Appearance Candidates Through Layers.
Set user-defined values for µ, ν, and M .
∀i ∈ {1, 2, · · · , 2M} & ∀s ∈ I, set PËœis â†âˆž & δ˜is â†âˆž.
Initialize Ic ↠Iv , and Il ↠Ih.
∀s ∈ Iv , set PËœ1s â†âˆž and δ˜1s â†âˆž.
form = 1 to M do
Initialize nm ↠0, Im ↠∅, nm â†
[
0 0 0
]T , and dm ↠0.
RANSAC for optimal plane of symmetry pim:
for r = 1 to R do
Randomly select sr ∈ I and tr ∈ Il.
Compute bisector plane pir =
[
nTr −dr
]T , where
nr ↠Xsr−Xtr‖Xsr−Xtr‖ and dr â†
1
2
nT (Xsr + Xtr ).
Ir ↠{s : s ∈ Il ∧ ‖Ps −Pt?‖ < ν
∧ ∥∥Xs − 2nrnTr Xt? + 2nrdr∥∥ < µ }.
t? = arg min
t∈Ic
∥∥Xs − 2nrnTr Xt + 2nrdr∥∥.
if |Ir| > nm then
nm ↠|Ir|, Im ↠Ir , nm ↠nr , and dm ↠dr .
end if
end for
Optimal plane of symmetry pim =
[
nTm − dm
]T .
for i = 1 to 2m−1 do
Set P˜js ↠P˜it? and P˜ js ↠P˜ it? ,
where j = i+ 2m−1 ∧ s ∈ I ∧ ‖Ps −Pt?‖ < ν
∧ ∥∥Xs − 2nmnTmXt? + 2nmdm∥∥ < µ.
t? = arg min
t∈Ic
∥∥Xs − 2nmnTmXt + 2nmdm∥∥.
end for
Update Ic ↠Ic ∪ Im and Il ↠Il \ Im.
end for
if |Il| > 0 then
∀s ∈ Il, set P˜1s ↠Ps and δ˜1s ↠0.
end if
the layers obtained by geometric and appearance-based symmetries
from Section 6.1.
For every vertex on the 3D model Xs, s ∈ I, we find an assignment
of reflectance values that optimizes the following objective:
{
i
?
1 , ..., i
?
|I|
}
= arg min
i1,··· ,i|I|
|I|∑
s=1
Φ(P˜is ) +
|I|∑
s=1
∑
t∈Ks
Φ(P˜is , P˜it ). (14)
The pairwise term in the objective function, Φ(·, ·) enforces neigh-
borhood smoothness via the Euclidean distance. Here Ks repre-
sents the set of indices for the k nearest neighbors of Xs. We
bias the algorithm to select candidates from the same layer using
a weighting factor of 0 < β < 1. This provides consistency of
texture. We use the following form for the pairwise terms:
Φ(P˜is , P˜it)

β
∥∥∥P˜is − P˜it∥∥∥2
2
if is = it,∥∥∥P˜is − P˜it∥∥∥2
2
otherwise.
(15)
The unary term Φ(·) forces visible vertices to receive the re-
flectance computed in Section 5. We set the unary term at the first
layer for visible vertices to ζ, where 0 < ζ < 1, else we set it to 1:
Φ(P˜is) =
{
ζ if s ∈ Iv, is = 0,
1 otherwise. (16)
We use the tree-reweighted message passing algorithm to perform
the optimization in Equation (14). We use the computed assign-
ment to obtain the reflectance values P
?
for all vertices in the set I.
Original Photograph Corrected Geometry Estimated Illumination Missing Parts Revealed Output
Figure 6: Top row: 3D manipulation of an origami crane. We show the corrected geometry for the crane in the original photograph, and the
estimated illumination, missing parts, and final output for a manipulation. Bottom row: Our approach uses standard animation software to
create realistic animations such as the flying origami crane. Photo Credits: ©Natasha Kholgade.
We build an analogous graph for the appearance difference δ
?
, and
solve an analogous optimization to find its assignment. The assign-
ment step covers the entire object with completed appearance. For
areas of the object that do not satisfy the criteria of geometric sym-
metry and appearance similarity, such as the underside of the taxi
cab in Figure 1, the assignment defaults to the stock model appear-
ance. The assignment also defaults to the stock model appearance
when after several iterations, the remaining parts of the object are
partitioned into several small areas where the object lacks structural
symmetries relative to the visible areas. In this case, we allow the
user to fill the appearance in these areas on the texture map of the
3D model using PatchMatch [Barnes et al. 2009].
7 Final Composition
The user manipulates the object pose from Θ to Ω. Given the cor-
rected geometry, estimated illumination, and completed appearance
and from Sections 4 to 6, we create the result of the manipulation J
by replacing Θ with Ω in Equation (1). We use ray-tracing to ren-
der each pixel according to Equation (8), using the illumination L
?
,
geometry X, and reflectance P
?
. We add the appearance difference
δ
?
to the rendering to produce the final pixels on the manipulated
object. We render pixels for the object and the ground using this
method, while leaving the rest of the photograph unchanged (such
as the corridor in Figure 6). To handle aliasing, we perform the
illumination estimation, appearance completion, and compositing
using a super-sampled version of the photograph. We filter and
subsample the composite to create an anti-aliased result.
8 Results
We perform a variety of 3D manipulations to photographs, such as
rigid manipulation of photographed objects, deformation, and 3D
copy-paste. We use a separately captured background photograph
for the chair, while for all other photos, we fill the background using
Context-Aware Fill in Photoshop. The top row of Figure 6 shows
the aligned geometry, estimated illumination, missing parts, and fi-
nal result of a manipulation performed to an origami crane. Though
the illumination may not be accurate, in combination with the cor-
rected reflectance, it produces plausible results such as the shadows
of the wing on the hand and illumination changes on the crane. Our
approach can directly be tied to standard modeling and animation
software to create realistic animations from a single photograph,
such as the flying origami crane in the bottom row of Figure 6. Fig-
ure 7 shows the result of our approach on the laptop from Zheng et
al. [2012]. Unlike their approach, we reveal the hidden logo using
the stock 3D model while maintaining plausible illumination over
the object and the contact surface.
In Figure 13, we show 3D manipulations such as rotation, trans-
lation, copy-paste, and deformation on the chair from Figure 2, a
pen, a subject’s watch, some fruit, a painting, a car on a cliff, a
top hat, and a historical photograph of airplanes shot during World
War II. As the shown in the second column, our approach repli-
cates the original photograph in the first column to provide seam-
less transition from the original view to new manipulations of the
object. We show intermediate outputs in supplementary material.
Our approach allows users to create dynamic compositions such as
the levitating chair, the watches flying about the subject, and the
falling fruit in Figure 13, and the taxi jam from Figure 1. Users can
add creative effects to the photograph, such as the pen strokes on
the paper, in conjunction with 3D manipulations. The illumination
estimated in Section 5 can be used to plausibly relight new 3D ob-
jects inserted into the scene as shown in Figure 12. We also perform
non-rigid deformations to objects using the geometry correction ap-
proach of Section 4, such as converting the top hat in Figure 13 into
a magician’s hat. Through our approach, the user changes the story
of the car photograph in Figure 13 from two subjects posing next to
a parked car to them watching the car as it falls off the cliff.
Figure 11 shows an example where the taxi-cab at the top of the
photograph in Figure 1 is manipulated using 3D models of three dif-
ferent cars. Accurate alignment is obtained for the first two models.
Our appearance completion algorithm determines plausible appear-
ance, even when the stock 3D model of the second car deviates from
the original photograph. For significantly different geometry, such
as that of the third car, alignment may be less accurate, leading to
artifacts such as transfer of appearance from the body to the wind-
shield or the wheels. However, the approach maintains plausible
illumination with similar environment maps, as the geometry of the
3D models provides sufficient cues for diffuse illumination.
While our approach is designed for digital photographs, it pro-
vides plausible results for vintage photographs such as the historical
World War II photograph in Figure 13 and the non-photorealistic
media such as the vegetables’ painting in Figure 13. The appear-
Original Photograph from Zheng et al. [2012] Object manipulation using Zheng et al. [2012] Object manipulation using our approach Revealing unseen parts using our approach
Figure 7: We perform a 3D rotation of the laptop in a photograph from Zheng et al. [2012] (Copyright: ©ACM). Unlike their approach, we
can reveal the hidden cover and logo of the laptop.
(b) Alignment using Xu et al. [2011] (c) Geometry Correction with Our Approach(a) Photograph and 3D Stock Model
Figure 8: Comparison of geometry correction by our approach against the alignment of Xu et al. [2011]. As shown in the insets, Xu et al.
do not align the leg and the seat accurately. Through our approach, the user accurately aligns the model to the photograph.
ance estimation described in Section 5 estimates grayscale appear-
ances for the black-and-white photograph of the airplanes. Using
our approach, the user manipulates the airplanes to pose them as
if they were pointing towards the camera, an effect that would be
nearly impossible to capture in the actual scene. In the case of
paintings, our approach maintains the style of the painting and the
grain of the sheet, by transferring these through the fine-scale detail
difference in the appearance completion described in Section 6.
Geometry Evaluation. Figure 8 shows results of geometry cor-
rection through our user-guided approach compared to the semi-
automated approach of Xu et al. [2011] on the chair. In the sys-
tem of Xu et al., the user input involves seeding a graph-cut seg-
mentation algorithm with a bounding box, and rigidly aligning the
model. Their approach automatically segments the photographed
object based on connected components in the model, and deforms
the 3D model to resemble the photograph. Their approach approx-
imates the form of most of the objects. However, as shown by the
insets in Figure 8(b), it fails to exactly match the model to the pho-
tograph. Through our approach (shown in Figure 8(c)), users can
accurately correct the geometry to match the photographed objects.
Supplementary material shows comparisons of our approach with
Xu et al. for photographs of the crane, taxi-cab, banana, and mango.
Illumination Evaluation. To evaluate the illumination estima-
tion, we captured fifteen ground truth photographs of the chair from
Figure 2 in various orientations and locations using a Canon EOS
5d Mark II Digital SLR camera mounted on a tripod and fitted with
an aspheric lens. The photographs are shown in the supplementary
material. We also capture a photograph of the scene without the
object to provide a ground-truth background image. We aligned
the 3D model of the chair to each of the fifteen photographs us-
ing our geometry correction approach from Section 4, and eval-
uated our illumination and reflectance estimation approach from
Section 5 against a ground truth light probe, and three approaches:
(1) Haar wavelets with positivity constraints on coefficients [2009],
(2) L1-sparse high frequency illumination with spherical harmon-
ics for low-frequency illumination [Mei et al. 2009], and (3) envi-
ronment map completion using projected background [Khan et al.
2006]. We fill the parts of the Khan et al. environment map not seen
in the image using the PatchMatch algorithm [Barnes et al. 2009].
For all experiments, we estimated the illumination using Photo-
graph 1 (shown at the left of Figure 10(a)), and synthesized images
corresponding to the poses of the chair in Photographs 2 to 15 by
applying the estimated illumination to the geometry of each pose.
The synthesized images contain all steps of our approach except
appearance completion on the chair. We computed mean-squared
reconstruction errors between Photographs 2 to 15 and their corre-
sponding synthesized images for three types of subregions in the
photograph: object and ground, ground only, and object only. As
a baseline, we obtained the ground truth illumination by captur-
ing a high dynamic range (HDR) image of a light probe (a 4-inch
chrome sphere inspired by Debevec [1998]). Figure 9 shows the
mean-squared reconstruction error (MSE) for the von Mises-Fisher
basis compared against Haber et al., Mei et al., Khan et al., and
the light probe, for increasing numbers of basis components (i.e.
K). We used the same discretization for the sphere as the num-
ber of components. The number of basis components is a power
of two to ensure that the Haar wavelet basis is orthonormal. We
used λ1 = .01 and λ2 = 1 for our method, a regularization weight
of 1 for the approach of Haber et al., and a regularization weight
of .01 for the method of Mei et al. The reconstruction error for
Haar wavelets increases with increasing number of basis compo-
nents, since in attempting to capture high frequency information
(such as the edges of shadows), they introduce artifacts of negative
illumination such as highlights.
Figure 10 shows environment maps for the light probe, the back-
ground image projected out according to Khan et al., and the three
approaches to estimate illumination. The second and third rows of
Figure 10 show images of the chair in Photograph 2 synthesized
using the light probe, the background image, and the three illumi-
nation estimation approaches with K = 4096. The ground truth
for Photograph 2 is shown at the top-left. As shown in the figure,
the approach of Khan et al. does not capture the influence of ceil-
ing lights which are not observed in the original photograph. The
Haar wavelets’ approach introduces highlights due to negative light,
while the approach of Mei et al. generates sharp cast shadows. Our
approach produces smooth cast shadows and captures the effect of
lights not seen in the original photograph. Additional examples can
be found in the supplementary material.
Geometry Correction Timings. We evaluated the time taken for
four users to perform geometry correction of 3D models using our
user-guided approach. Table 1 shows the times (in minutes) to align
various 3D models to photographs used in this paper. The time
102 103 104
0
0.5
1
1.5 x 10
ï5
# Basis Components
M
SE
Object & Ground
vMF
Haar
SpH+L1
Bkgnd
LightProbe102 103 104
0
0.5
1
1.5 x 10
ï5
# Basis Components
M
SE
Ground Only
102 103 104
0
0.5
1
1.5 x 10
ï5
# Basis Components
M
SE
Object Only
Figure 9: Plots of mean-squared reconstruction error (MSE) versus number of basis components for the vMF basis in green (method used
in this paper) compared to the Haar basis [Haber et al. 2009], spherical harmonics with L1 prior (Sph+L1) [Mei et al. 2009], background
image projected (Bkgnd) [Khan et al. 2006], and a light probe, on the object and the ground, ground only, and object only.
Table 1: Times taken (minutes) to align 3D models to photographs.
User Banana Mango Top hat Taxi Chair Crane
1 12.43 7.10 8.72 16.09 45.17 32.22
2 6.33 3.08 10.08 7.75 20.32 14.92
3 2.23 2.42 4.93 2.57 6.12 22.07
4 5.63 5.13 6.17 7.52 8.53 19.03
spent on aligning is directly related to the complexity of the model
in terms of the number of connected components. Users spent the
least amount of time on aligning the mango, banana, and top hat
models, as these models consist of a single connected component
each. Though the taxi consists of 24 connected components, its
times are comparable to the simpler models, as the stock model
geometry is close to the photographed taxi. Long alignment times
of the chair and origami crane are due to the number of connected
components (3 and 8 respectively), and the large disparity between
their stock models and their photographs. We show user alignment
examples in the supplementary material. The supplementary video
shows a session to correct the banana model using the tool.
User Study. We evaluated the perceived realism of 3D object ma-
nipulation in photographs through a two-alternative forced choice
user study. We asked participants to compare and choose the more
realistic image between the original photograph and an edited result
produced using our approach. We recruited 39 participants mostly
from a pool of graduate students in computer science, and we con-
ducted the study using a webpage-based survey. Each participant
viewed four image pairs. Each image pair presented a choice be-
tween an original photograph and one of two ‘edited’ conditions:
either the final result of our approach (Condition 1), or an interme-
diate result also produced using our method (Condition 2). Specifi-
cally, Condition 2 was generated using the corrected geometry, illu-
mination, and surface reflectance, but without using the appearance
difference. The presentation order was randomized, and each user
saw each of the original photographs once. Across all participants,
we obtained a total of 40 responses for the original photo/final result
image pairs, and 38 responses for the original photo/intermediate
output image pairs. 71.05% of the time (i.e., on 54 of 76 image
pairs), participants chose the original photograph as being more re-
alistic than Condition 2. This preference was found to be statis-
tically significant: we used a one-sample, one-tailed t-test, which
rejected the null hypothesis that people do not prefer the original
photograph over Condition 2 at the 5% level (p < 0.001). 52.5%
of the time (i.e., on 42 of 80 image pairs), the participants chose
Condition 1 over the original photograph. The difference was not
found to be significant at the 5% level. In some cases, participants
reported choosing the more realistic image based on plausible con-
figurations of the object, e.g., an upright well-placed chair as op-
posed to one fallen on the ground. Images used are shown in the
supplementary material.
Parameters. During geometry correction, the user can turn on
symmetry by setting w = 1 or turn it off by setting w = 0. For
illumination extraction and appearance correction, we set the illu-
mination parameters λ1 and λ2 to small values for smooth shadows
due to dispersed illumination, i.e., λ1 = .001 to .01, λ2 = .1 to 1.
For strong directed shadows due to sparse illumination concentrated
in one region, we set high values: λ1 = 1 to 5, λ2 = 10 to 20. To
emphasize ground shadows, we set Ï„ = .01 to .25. The value of K
is set to 2500. We set λ3 = .5 to emphasize piecewise constancy of
the appearance deviations. In the appearance completion section,
we set the number of nearest neighbors k for all Ks to 5. We set M
to 1 for rigid objects such as the chair, the laptop, and the cars, and
to 7 for flexible objects such as fruit. We set β = .005 to emphasize
consistency of appearance within a layer, and we set ζ to a small
value (.01) to force visible vertices to be assigned the observed ap-
pearance. For all RANSAC operations, we set R = 5000, µ = .01
times the bounding box of the object, and ν = .001. We vary the
geometric symmetry tolerance µ from .01 to .1 times the bounding
box for increasingly approximate symmetries. Parameters we vary
the most in our algorithm are λ1, λ2, and τ . Parameter settings for
examples used in the paper can be found in supplementary material.
9 Discussion
We present an approach for performing intuitive 3D manipulations
on photographs, without requiring users to access the original scene
or the moment when the photograph was taken. 3D manipulations
greatly expands the repertoire of creative manipulations that artists
can perform on photographs. For instance, controlling the com-
position of dynamic events (such as Halsman’s Dalı´ Atomicus) is
particularly challenging for artists. Manipulation of photographed
objects in 3D will allow artists far greater freedom to experiment
with many different compositions. Through our approach, artists
can conveniently create stop motion animations that defy physical
constraints. While we have introduced our approach to improve
the experience of editing photographs, it can be used to correct the
geometry and appearance of 3D models in public repositories.
The fundamental limitation of our approach is related to sampling.
We may lack pixel samples for a photographed object if it is small
in image-space, in which case, manipulating it to move it closer to
the camera can produce a blurred result. However, as cameras to-
day exceed tens of megapixels in resolution, this problem is much
less of an issue, and may be addressed via final touch-ups in 2D.
Failures can occur if an object is photographed with the camera’s
look-at vector perpendicular to the normal of the bilateral plane of
symmetry, e.g., if a wine-bottle is photographed from the top. In
these cases, symmetry constraints are difficult to exploit, and the
completion has to rely heavily on the texture provided with the 3D
model. Another class of failures is caused when the illumination
model fails to account for some lighting effects, and the algorithm
attempts to explain the residual effect during appearance comple-
tion. This can result in lighting effects appearing as texture artifacts.
Natasha Kholgade1 Tomas Simon1 Alexei Efros2 Yaser Sheikh1
1Carnegie Mellon University 2University of California, Berkeley
Original Photograph Object Manipulated in 3D
>
Hidden Right Side
Appearance from Left Side
Hidden Under Side
Appearance from Stock Model
Before After Before After
3D Copy-PasteEstimated Illumination
z
y
x
Figure 1: Using our approach, a user manipulates the taxi cab in a photograph to do a backflip, and copy-pastes the cabs to create a traffic
jam (right) by aligning a stock 3D model (inset) obtained from an online repository. Such 3D manipulations often reveal hidden parts of the
object. Our approach completes the hidden parts using symmetries and the stock model appearance, while accounting for illumination in 3D.
Photo Credits (leftmost photograph): Flickr user © Lucas Maystre.
Abstract
Photo-editing software restricts the control of objects in a photo-
graph to the 2D image plane. We present a method that enables
users to perform the full range of 3D manipulations, including scal-
ing, rotation, translation, and nonrigid deformations, to an object in
a photograph. As 3D manipulations often reveal parts of the object
that are hidden in the original photograph, our approach uses pub-
licly available 3D models to guide the completion of the geometry
and appearance of the revealed areas of the object. The completion
process leverages the structure and symmetry in the stock 3D model
to factor out the effects of illumination, and to complete the appear-
ance of the object. We demonstrate our system by producing object
manipulations that would be impossible in traditional 2D photo-
editing programs, such as turning a car over, making a paper-crane
flap its wings, or manipulating airplanes in a historical photograph
to change its story.
CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional
Graphics and Realism—Virtual Reality;
Keywords: three-dimensional, photo-editing, 3D models
Links: DL PDF
1 Introduction
One of the central themes of computer graphics is to let the general
public move from being passive consumers of visual information
(e.g., watching movies or browsing photos) to becoming its active
creators and manipulators. One particular area where we have al-
ready achieved success is 2D photo-editing software such as Pho-
toshop. Once mainly a tool of professional photographers and de-
signers, it has become mainstream, so much so that ‘to photoshop’
is now a legitimate English verb [Simpson 2003]. Photoshop lets
a user creatively edit the content of a photograph with image oper-
ations such as recoloring, cut-and-paste, hole-filling, and filtering.
Since the starting point is a real photograph, the final result often
appears quite photorealistic as well. However, while photographs
are depictions of a three-dimensional world, the allowable geomet-
ric operations in photo-editing programs are currently restricted to
2D manipulations in picture space. Three-dimensional manipula-
tions of objects—the sort that we are used to doing naturally in the
real world—are simply not possible with photo-editing software;
the photograph ‘knows’ only the pixels of the object’s 2D projec-
tion, not its actual 3D structure.
Our goal in this paper is to allow users to seamlessly perform 3D
manipulation of objects in a single consumer photograph with the
realism and convenience of Photoshop. Instead of simply editing
‘what we see’ in the photograph, our goal is to manipulate ‘what
we know’ about the scene behind the photograph [Durand 2002].
3D manipulation of essentially a 2D object sprite is highly under-
constrained as it is likely to reveal previously unobserved areas of
the object and produce new, scene-dependent shading and shadows.
One way to achieve a seamless ‘break’ from the original photograph
is to recreate the scene in 3D in the software’s internal representa-
tion. However, this operation requires significant effort, that only
large special effects companies can afford. It typically also involves
external scene data such as light probes, multiple images, and cali-
bration objects, not available with most consumer photographs.
Instead, in this paper, we constrain the recreation of the scene’s 3D
geometry, illumination, and appearance from the 2D photograph
using a publicly available 3D model of the manipulated object as
a proxy. Graphics is now entering the age of Big Visual Data:
enormous quantities of images and video are uploaded to the In-
ternet daily. With the move towards model standardization and the
use of 3D scanning and printing technologies, publicly available
3D data (modeled or scanned using 3D sensors like the Kinect) are
also readily available. Public repositories of 3D models (e.g., 3D
Warehouse or Turbosquid) are growing rapidly, and several Inter-
net companies are currently in the process of generating 3D models
for millions of merchandise items such as toys, shoes, clothing, and
household equipment. It is therefore increasingly likely that for
most objects in an average user photograph, a stock 3D model will
soon be available, if it is not already.
However, it is unreasonable to expect such a model to be a per-
fect match to the depicted object—the visual world is too varied to
ever be captured perfectly no matter how large the dataset. There-
fore, our approach deals with several types of mismatch between
the photographed object and the stock 3D model:
Geometry Mismatch. Interestingly, even among standard, mass-
produced household brands (e.g., detergent bottles), there are often
subtle geometric variabilities as manufacturers tweak the shape of
their products. Of course, for natural objects (e.g., a banana), the
geometry of each instance will be slightly different. Even in the
cases when a perfect match could be found (e.g., a car of a specific
make, model, and year), many 3D models are created with artistic
license and their geometry will likely not be metrically accurate, or
there are errors due to scanning.
Apperance Mismatch. Although both artists and scanning tech-
niques often provide detailed descriptions of object appearance
(surface reflectance), these descriptions may not match the colors
and textures (and aging and weathering effects) of the particular
instance of the object in the photograph.
Illumination Mismatch. To perform realistic manipulations in
3D, we need to generate plausible lighting effects, such as shadows
on an object and on contact surfaces. The environment illumination
that generates these effects is not known a priori, and the user may
not have access to the original scene to take illumination measure-
ments (e.g., in dynamic environments or for legacy photographs).
Our approach uses the pixel information in visible parts of the
object to correct the three sources of mismatch. The user semi-
automatically aligns the stock 3D model to the photograph using a
real-time geometry correction interface that preserves symmetries
in the object. Using the aligned model and photograph, our ap-
proach automatically estimates environment illumination and ap-
pearance information in hidden parts of the object. While a photo-
graph and 3D model may still not contain all the information needed
to precisely recreate the scene, our approach sufficiently approxi-
mates the illumination, geometry, and appearance of the underlying
object and scene to produce plausible completion of uncovered ar-
eas. Indeed, as shown by the user study in Section 8, our approach
plausibly reveals hidden areas of manipulated objects.
The ability to manipulate objects in 3D while maintaining realism
greatly expands the repertoire of creative manipulations that can
be performed on a photograph. Users are able to quickly perform
object-level motions that would be time-consuming or simply im-
possible in 2D. For example, from just one photograph, users can
cause grandma’s car to perform a backflip, and fake a baby lifting a
heavy sofa. We tie our approach to standard modeling and anima-
tion software to animate objects from a single photograph. In this
way, we re-imagine typical Photoshop edits—such as object rota-
tion, translation, rescaling, deformation, and copy-paste—as object
manipulations in 3D, and enable users to more directly translate
what they envision into what they can create.
Contributions. Our key contribution is an approach that allows
out-of-plane 3D manipulation of objects in consumer photographs,
while providing a seamless break from the original image. To do
so, our approach leverages approximate object symmetries and a
new non-parametric model of image-based lighting for appearance
completion of hidden object parts and for illumination-aware com-
positing of the manipulated object into the image. We make no as-
sumptions on the structure or nature of the object being manipulated
beyond the fact that an approximate stock 3D model is available.
Assumptions. In this paper, we assume a Lambertian model of
illumination. We do not model material properties such as refrac-
tion, specularities, sub-surface scattering, or inter-reflection. The
user study discussed in Section 8 shows that while for some objects,
these constraints are necessary to produce plausible 3D manipula-
tions, if their effects are not too pronounced, the results can be per-
ceptually plausible without explicit modeling. In addition, we as-
sume that the user provides a stock 3D model with components for
all parts of the objects visible in the original photograph. Finally,
we assume that the appearance of the 3D model is self-consistent,
i.e., the precise colors of the stock model need not match the pho-
tograph, but appearance symmetries should be preserved. For in-
stance, the cliffhanger in Figure 13 is created using the 3D model
of a blueish-grey Audi A4 (shown in the supplementary material)
to manipulate the green Rover 620 Ti in the photograph.
Notation. For the rest of the paper, we refer to known quanti-
ties without using the overline notation, and we use the overline
notation for unknown quantities. For instance, the geometry and
appearance of the stock 3D model known a priori are referred to
as X and T respectively. The geometry and appearance of the 3D
model modified to match the photograph are not known a priori and
are referred to as X and T respectively. Similarly, the illumination
environment which is not known a priori is referred to as L.
2 Related Work
Modern photo-editing software such as Photoshop provides sophis-
ticated 2D editing operations such as content-aware fill [Barnes
et al. 2009] and content-aware photo resizing [Avidan and Shamir
2007]. Many approaches provide 2D edits using low-level assump-
tions about shape and appearance [Barrett and Cheney 2002; Fang
and Hart 2004]. The classic work of Khan et al. [2006] uses insights
from human perception to edit material properties of photographed
objects, to add transparency, translucency, and gloss, and to change
object textures. Goldberg et al. [2012] provide data-driven tech-
niques to add new objects or manipulate existing objects in images
in 2D. While these techniques can produce surprisingly realistic
results in some cases, their lack of true 3D limits their ability to
perform more invasive edits, such as 3D manipulations.
The seminal work of Oh et al. [2001] uses depth-based segmen-
tation to perform viewpoint changes in a photograph. Chen et
al. [2011] extend this idea to videos. These methods manipulate
visible pixels, and cannot reveal hidden parts of objects. To address
these limitations, several methods place prior assumptions on pho-
tographed objects. Data-driven approaches [Blanz and Vetter 1999]
provide drastic view changes by learning deformable models, how-
ever, they rely on training data. Debevec et al. [1996] use the reg-
ular symmetrical structure of architectural models to reveal novel
views of buildings. Kopf et al. [2008] use georeferenced terrain
and urban 3D models to relight objects and reveal novel viewpoints
in outdoor imagery. Unlike our method, Kopf et al. do not remove
the effects of existing illumination. While this works well outdoors,
it might not be appropriate in indoor settings where objects cast soft
shadows due to area light sources.
Approaches in proxy-based modeling of photographed objects in-
clude cuboid proxies [Zheng et al. 2012] and 3-Sweep [Chen et al.
2013]. Unlike our approach, Zheng et al. and Chen et al. (1) cannot
reveal hidden areas that are visually distinct from visible areas, lim-
iting the full range of 3D manipulation (e.g., the logo of the laptop
from Zheng et al. that we reveal in Figure 7, the underside of the
taxi cab in Figure 1, and the face of the wristwatch in Figure 13),
(2) cannot represent a wide variety of objects precisely, as cuboids
(Zheng et al.) or generalized cylinders (Chen et al.) cannot handle
highly deformable objects such as backpacks, clothing, and stuffed
toys, intricate or indented objects such as the origami crane in Fig-
ure 6 or a pair of scissors, or objects with negative space such as
cups, top hats, and shoes, and (3) cannot produce realistic shad-
ing and shadows (e.g. in the case of the wristwatch, the top hat,
the cliffhanger, the chair, and the fruit in Figure 13, the taxi cab in
(b) Selected Stock 3D Model (d) 3D Object Manipulation (f) Manipulated Object Geometry (j) Final Result
z
y
x
(a) Original Photograph (e) Geometry Correction (g) Illumination Estimation (i) Appearance Completion
(h) Manipulated Object Illumination
2D Point
Corrections Mask
(c) User Input
Figure 2: Overview: (a) Given a photograph and (b) a 3D model from an online repository, the user (c) interactively aligns the model to the
photograph and provides a mask for the ground and shadow, which we augment with the object mask and use to fill the background using
PatchMatch [Barnes et al. 2009]. (d) The user then performs the desired 3D manipulation. (e) Our approach computes the camera and cor-
rects the 3D geometry, and (f) reveals hidden geometry during the 3D manipulation. (g) It automatically estimates environment illumination
and reflectance, (h) to produce shadows and surface illumination during the 3D manipulation. (i) Our approach completes appearance to
hidden parts revealed during manipulation, and (j) composites the appearance with the illumination to obtain the final photograph.
Figure 1, and the crane in Figure 6) to contribute to perceptually
plausible 3D manipulation. Zheng et al. use a point light source
that does not capture the effect of several area light sources (e.g.,
in a typical indoor environment). Chen et al. provide no explicit
representation of illumination.
An alternative to our proposed method of object manipulation is
object insertion, i.e., to inpaint the photographed object and replace
it with an inserted object, either in 2D from a large ‘photo clip art’
library [Lalonde et al. 2007], or in 3D [Debevec 1998; Karsch et al.
2011]. The classic approach of Debevec [1998] renders synthetic
3D objects into the photograph of a real scene by using illumination
captured with a mirrored sphere. Karsch et al. [2011] remove the
requirement of physical access to the scene by estimating geometry
and illumination from the photograph. However, such an insertion-
based approach discards useful information about environment illu-
mination and appearance contained in object pixels in the original
photograph. In contrast, our approach aims to utilize all information
in the original object pixels to estimate the environment illumina-
tion and appearance from the input photograph. Furthermore, when
creating videos, object insertion methods are unlikely to produce a
seamless break from the original photograph, as peculiarities of the
particular instance that was photographed (e.g., smudges, defects,
or a naturally unique shape) will not exist in a stock 3D model.
Our approach is related to work in the area of geometry alignment,
illumination estimation, texture completion, and symmetry detec-
tion. In the area of geometry alignment, there exist automated or
semi-automated methods [Xu et al. 2011; Prasad et al. 2006; Lim
et al. 2013]. In general, as shown by the comparison to Xu et
al. [2011] in Section 8, they fail to provide exact alignment crucial
for seamless object manipulation. To estimate illumination, we use
a basis of von Mises-Fisher kernels that provide the advantage of
representing high frequency illumination effects over the classical
spherical harmonics representation used in Ramamoorthi and Han-
rahan [2001], while avoiding unnaturally sharp cast shadows that
arise due to the point light representation used in Mei et al. [2009].
In addition, we impose non-negativity constraints on the basis coef-
ficients that ensure non-negativity of illumination, in contrast to the
Haar wavelet basis used in Ng et al. [2003], Okabe et al. [2004],
Haber et al. [2009] and Romeiro et al. [2010]. Mixtures of von
Mises-Fisher kernels have been estimated for single view relight-
ing [Hara et al. 2008; Panagopoulos et al. 2009], however, these
require estimating the number of mixtures separately. Our appear-
ance completion approach is related to methods that texture map 3D
models using images [Kraevoy et al. 2003; Tzur and Tal 2009; Gal
et al. 2010], however, they do not factor out illumination, and may
use multiple images to obtain complete appearance. In using sym-
metries to complete appearance, our work is related to approaches
that extract symmetries from images and 3D models [Hong et al.
2004; Gal and Cohen-Or 2006; Pauly et al. 2005], and that use
symmetries to complete geometry [Terzopoulos et al. 1987; Mitra
et al. 2006; Mitra and Pauly 2008; Bokeloh et al. 2011], and to infer
missing appearance [Kim et al. 2012]. However, our work differs
from these approaches in that the approaches are mutually exclu-
sive: approaches focused on symmetries from geometry do not re-
spect appearance constraints, and vice versa. Our approach uses
an intersection of geometric symmetry and appearance similarity,
and prevents appearance completion between geometrically simi-
lar but visually distinct parts, such as the planar underside and top
of a taxi-cab, or between visually similar but geometrically distinct
parts such as the curved surface of a top-hat and its flat brim.
3 Overview
The user manipulates an object in a photograph as shown in Fig-
ure 2(a) by using a stock 3D model. For this photograph, the model
was obtained through a word search on the online repository Tur-
boSquid. Other repositories such as 3D Warehouse, (Figure 2(b))
or semi-automated approaches such as those of Xu et al. [2011],
Lim et al. [2013], and Aubry et al. [2014] may also be used. The
user provides a mask image that labels the ground and shadow pix-
els. We compute a mask for the object pixels, and use this mask
to inpaint the background using the PatchMatch algorithm [Barnes
et al. 2009]. For complex backgrounds, the user may touch up the
background image after inpainting. Figure 2(c) shows the mask
with ground pixels in gray, and object and shadow pixels in white.
The user semi-automatically aligns and corrects the stock 3D model
to match the photograph using our symmetry-preserving geometry
correction interface as shown in Figure 2(c). Using the corrected
3D model and the photograph pixels, our approach computes and
factors out the environment illumination (Figure 2(g)), and com-
pletes the appearance of hidden areas (Figure 2(i)). Users can then
perform their desired 3D manipulations as shown in Figure 2(d),
and the illumination, completed appearance, and texture are com-
posited to produce the final output, shown in Figure 2(j).
When a user manipulates an object in the photograph I ∈
RW×H×3, shown in Figure 2(a), from the object’s original pose
Θ to a new pose Ω as in Figure 2(d), our objective is to produce an
edited photograph J shown in Figure 2(j). Here W and H are the
width and height of the photograph. We model I as a function of
the object geometry X, object appearance T, and the environment
illumination L, as
I = f(Θ,X,T,L). (1)
The above equation essentially represents the rendering equation.
The manipulated photograph J can then be produced by replacing
the original pose Θ with the new pose Ω, i.e., J = f(Ω,X,T,L).
However, X, T, and L are not known a priori, and estimating them
from a single photograph without any prior assumptions is highly
ill-posed [Barron 2012].
We overcome this difficulty by bootstrapping the estimation using
the stock 3D model of the object, whose geometry consists of ver-
tices X, and whose appearance is a texture map T (Figure 2(b)). In
general, the stock model geometry and appearance do not precisely
match the geometry X and appearance T of the photographed ob-
ject. We provide a tool through which the user marks 2D point
corrections, shown in Figure 2(c). We deform X to match X using
the 2D corrections, as shown in Figure 2(e) and described in Sec-
tion 4. Additionally, we estimate the ground plane in 3D, using one
of two methods: either using vanishing points from user-marked
parallel lines in the image, or as the plane intersecting three user-
marked points on the base of the object. Ground plane estimation
is described in the supplementary material. Manually correcting
the illumination and the appearance is difficult, as the illumination
sources may not be visible in the photograph. Instead, we present an
algorithm to estimate the illumination and appearance using pixels
on the object and the ground, as shown in Figure 2(g) and described
in Section 5, by optimizing the following objective:{
T
?
,L
?
}
= arg min
T,L
∥∥I− f(Θ,X,T,L)∥∥2
2
. (2)
Equation (2) only estimates the appearance for parts of the object
that are visible in the original photograph I as shown in Figure 2(i).
The new pose Θ′ potentially reveals hidden parts of the object. To
produce the manipulated photograph J, we need to complete the
hidden appearance. After factoring out the effect of illumination on
the appearance in visible areas, we present an algorithm that uses
symmetries to complete the appearance of hidden parts from visible
areas as described in Section 6. The algorithm uses the stock model
appearance for hidden parts of objects that are not symmetric to vis-
ible parts. Given the estimated geometry, appearance, and illumi-
nation, and the user-manipulated pose of the object, we composite
the edited photograph by replacing Θ with Θ′ in Equation (1) as
shown in Figures 2(f), 2(h), and 2(j).
4 Geometry Correction
We provide a user-guided approach to correct the geometry of the
3D model to match the photographed object, while ensuring that
smoothness and symmetry are preserved over the model. Given the
photograph I and the stock model geometry X ∈ RN×3 (where
N is the number of vertices), we first estimate the original rigid
pose Θ of the object using a set A of user-defined 3D-2D corre-
spondences, Xj ∈ R3 on the model and xj ∈ R2, j ∈ A in the
photograph. Here, Θ = {R, t}, where R ∈ R3×3 is the object
rotation, and t ∈ R3 is the object translation. We use the EPnP al-
gorithm [Lepetit et al. 2009] to estimate R and t. The algorithm
takes as input Xj , xj , and the matrix K ∈ R3×3 of camera param-
eters (i.e., focal length, skew, and pixel aspect ratio). We assume
a zero-skew camera, with square pixels and principal point at the
photograph center. We use the focal length computed from EXIF
tags when available, else we use vanishing points to compute the
focal length. We assume that objects in the photograph are at rest
on a ground plane. We describe focal length extraction using van-
ishing points, and ground plane estimation in the supplementary
material. It should be noted that there exists a scale ambiguity in
computing t. The EPnP algorithm handles the scale ambiguity in
terms of translation along the z-axis of the camera.
As shown in Figure 3(a), after the camera is estimated, the user
provides a set B of start points xk ∈ R2, k ∈ B on the projection
of the stock model, and a corresponding set of end points xk ∈ R2
on the photographed object for the purpose of geometry correction.
We used a point-to-point correction approach, as opposed to sketch
or contour-based approaches [Nealen et al. 2005; Kraevoy et al.
2009], as reliably tracing soft edges can be challenging compared
to providing point correspondences. The user only provides the
point corrections in 2D. We use them to correct X to X in 3D by
optimizing an objective in X consisting of a correction term E1, a
symmetry prior E2, and a smoothness prior E3:
E(X) = E1(X) + E2(X) + E3(X). (3)
The correction term E1 forces projections of the points
X
Θ
k = RXk + t to match the user-provided 2D corrections
xk, k ∈ B. Here, XΘ represents X transformed by Θ. As shown in
Figure 3(b), we compute the ray vk = K−1[xTk 1]
T back-projected
through each xk. E1 minimizes the sum of distances between each
X
Θ
k and the projection
vkv
T
k
‖vk‖2 X
Θ
k of X
Θ
k onto the ray vk:
E1(X) =
∑
k∈B
∥∥∥∥∥XΘk − vkvTk‖vk‖22 XΘk
∥∥∥∥∥
2
2
. (4)
Unlike traditional rotoscoping, the correction term encourages the
vertex coordinates to match the photograph only after geometric
projection into the camera. The corrected vertices Xk are other-
wise free to move along the lines of projection such that the overall
deformation energy E(X) is minimized.
The smoothness prior E2 preserves local smoothness over the cor-
rected model. As shown in Figure 3(c), the term ensures that points
in the neigborhood of the corrected points Xk move smoothly. In
our work, this term refers to the surface deformation energy from
the as-rigid-as-possible framework of Sorkine and Alexa [2007].
The framework requires that local deformations within the 1-ring
neighborhoodDi of the ith point in the corrected model should have
nearly the same rotations Ri as on the original model:
E2(X) =
N∑
i=1
∑
j∈Di
∥∥(Xi −Xj)−Ri(Xi −Xj)∥∥22 . (5)
The local rotation Ri in the neighborhood of a vertex Xi is distinct
from the global rigid rotation R.
The symmetry prior E3 preserves the principal symmetry (or bilat-
eral symmetry) of the model, as shown in Figure 3(c). If a point
xk
xk
(a) User 2D Edit (b) Correction Along Ray in 3D (c) Smoothness and Symmetry Priors
vk vk
xk
xk
xk
xk
Camera Camera
X⇥k
X
⇥
k
X
⇥
sym(k)
X⇥sym(k)
X⇥k
X⇥sym(k)
Figure 3: Geometry correction. (a) The user makes a 2D cor-
rection by marking a start-end pair, (xk,xk) in the photograph.
(b) Correction term: The back-projected ray vk corresponding to
xk is shown in black, and the back-projected ray corresponding
to xk is shown in red. The top inset shows the 3D point XΘk for
xk on the stock model, and the bottom inset shows its symmetric
pair XΘsym(k). We deform the stock model geometry (light grey) to
the user-specified correction (dark grey) subject to smoothness and
symmetry-preserving priors.
on the stock model Xi has a symmetric counterpart Xsym(i), E3
ensures that Xi remains symmetric to Xsym(i), i.e., that they are
related through a symmetric transform,
S =
[
I3 − 2nnT 2nd
]
, (6)
where I3 is the 3×3 identity matrix, and n and d are the normal and
distance of the principal plane of symmetry in the corrected model
geometry X. E3 is thus given as
E3(X) = w
N∑
i=1
∥∥∥S[XTi 1]T −Xsym(i)∥∥∥2
2
. (7)
Here, w is a user-defined weight, that the user sets to 1 if the object
has bilateral symmetry, and 0 otherwise.
To determine Xsym(i) for every stock model point Xi, we compute
the principal plane pi = [nT − d]T on the stock model using
RANSAC1. We then reflect every Xi across pi, and obtain Xsym(i)
as the nearest neighbor to the reflection of Xi across pi.
The objective function in Equation (3) is non-convex. However,
note that given the symmetry S and local rotations Ri, the objective
is convex in the geometry X, and vice versa. We initialize Ri = I3,
where I3 is the 3× 3 identity matrix, and S with the original stock
model symmetry S. We alternately solve for the geometry, and the
symmetry and local rotations till convergence to a local minimum.
Given S and Ri, we solve for X by setting up a system of linear
equations. Given X, we solve for Ri through SVD as described
by Sorkine and Alexa [2007]. To solve for S, we assume that the
bilateral plane of symmetry passes through the center of mass of
the object (which we can assume without loss of generality to be
at the origin), so that d = 0. To obtain n, we note that the first
three columns of S (which we refer to as S3) form an orthogonal
matrix, which we extract using SVD, as follows: We create matrices
A = [X1, · · · ,XN ] and B = [Xsym(1) · · ·Xsym(N)], perform the
1At each RANSAC iteration, we randomly choose two points Xir and
Xjr on the stock 3D model, and compute their bisector plane pir with nor-
mal nr =
Xir−Xjr
‖Xir−Xjr‖2
, and distance from origin 1
2
nTr (Xir+Xjr ). We
maintain a score nr of the number of points that when reflected across pir
have a symmetric neighbor within a small threshold µ. After R iterations,
we retain the plane with maximum score as pi.
Light Map von Mises-Fisher Basis
= ↵1 +↵2 + . . .+ ↵K
Figure 4: We represent the environment map as a linear combina-
tion of the von Mises-Fisher (vMF) basis. We enforce constraints of
sparseness and grouping of basis coefficients to mimic area lighting
and produce soft cast shadows.
SVD decomposition of ABT as UΣVT , and extract S3 = VUT .
Then, we extract n as the principal eigenvector of the matrix I3−S3
2
.
We substitute n and d = 0 in Equation (6) to get S.
5 Illumination and Appearance Estimation
Given the corrected geometry X
Θ
, we estimate illumination L and
appearance T to produce plausible shadows and lighting effects on
the object and the ground. We represent the imaging function f
using a Lambertian reflection model. Under this model, the ith pixel
Ii on the object and the ground in the photograph is generated as
fi(Θ,X,T,L) = Pi
∫
Ω
ni · si(ω)vi(ω)L(ω)dω + δi, (8)
where Pi ∈ R3 and δi ∈ R3 model the appearance of the ith
pixel. We assume that the appearance T of the object consists of
a reflectance map P ∈ RU×V×3and a residual difference map
δ ∈ RU×V×3. Here, P models the diffuse reflectance of the
object’s appearance (also termed the albedo). Inspired by De-
bevec [1998]), we include δ to represent the residual difference
between the image pixels and the diffuse reflection model, since
the model is only an approximation to the BRDF of the object. U
and V are the dimensions of the texture map. For the ith pixel, Pi
and δi are interpolated from the maps P and δ at the point X
Θ
i .
X
Θ
i is the 3D point back-projected from the ith pixel location to
the object’s 3D geometry X as transformed by Θ. ni is the nor-
mal at point X
Θ
i , si(ω) is the source vector from X
Θ
i towards the
light source along the solid angle ω, and vi(ω) is the visibility of
this light source from X
Θ
i . L(ω) is the intensity of the light source
along ω. We assume that the light sources lie on a sphere, i.e., that
L(ω) is a spherical environment map.
To estimate these quantities, we optimize the following objective
function in P and L, consisting of a data term F1, an illumination
prior F2, and a reflectance prior F3:
F (P,L) = F1(P,L) + F2(L) + F3(P), (9)
and we obtain δ as the residual of the data term F1 in the objective
function. F1 represents the generation of pixels in a single photo-
graph using the illumination model from Equation 8 as follows:
F1(P,L) =
NI∑
i=1
Ï„i
∥∥∥∥Ii −Pi ∫
Ω
ni · si(ω)vi(ω)L(ω)dω
∥∥∥∥2
2
,
where Ï„i =
 1 for object pixelsand shadow pixels on ground,τ for non-shadow pixels on ground.
Here, NI is the number of pixels covered by the object and the
ground in the photograph. Ï„ corresponds to the value in the gray
region of the user-provided mask shown in Figure 2(c). Here,
0 ≤ τ ≤ 1, and a small value of τ in non-shadow areas of the
ground emphasizes cast shadows. F2 and F3 represent illumination
and reflectance priors that regularize the ill-posed optimization of
estimating P and L from a single photograph.
To describe the illumination prior F2, we represent L(ω) as a linear
combination of von Mises-Fisher (vMF) kernels, L(ω) = Γ(ω)α.
Γ ∈ RK is a functional basis, shown in Figure 4, and α is a vector
of basis coefficients. The k-th component of Γ corresponds to the
k-th vMF kernel, given by
h (u(ω);µk, κ) =
exp
(
κµk
Tu(ω)
)
4pi sinhκ
,
where µk is the k-th mean direction vector, u(ω) is a unit vector
along direction ω, and the concentration parameter κ describes the
peakiness of the distribution [Fisher 1953].
Through the illumination prior F2, we force the algorithm to find
a sparse set of light sources using an L1 prior on the coefficients.
In addition, according to the elastic net framework [Zou and Hastie
2005], we place an L2 prior to force groups of correlated coeffi-
cients to be turned on. The L2 prior forces spatially adjacent light
sources to be switched on simultaneously to represent illumination
sources such as area lights or windows. We thus obtain the follow-
ing form for F2:
F2(L) = λ1 ‖α‖1 + λ2 ‖α‖22 . (10)
The reflectance prior F3 enforces piecewise constancy over the
deviation of the reflectance P from the original stock model re-
flectance P:
F3(P) = λ3
NI∑
i=1
∑
j∈Ni
∥∥(Pi −Pi)− (Pj −Pj)∥∥1 . (11)
P belongs to the stock model appearance T described in Section 3.
The prior F3 is related to color constancy assumptions about intrin-
sic images [Land et al. 1971; Karsch et al. 2011]. Ni represents the
4-neighborhood of the ith pixel in image space.
We optimize the objective F in Equation (9) subject to non-
negativity constraints on α:{
P
?
,L
?
}
= arg min
P,L
F (P,L), s.t. α ≥ 0. (12)
The above optimization is non-convex due to the bilinear interac-
tion of the surface reflectances P with the illumination L. If we
know the reflectances, we can solve a convex optimization for the
illumination, and vice versa. We initialize the reflectances with the
stock model reflectance P for the object, and the median pixel value
for the ground plane. We alternately solve for illumination and re-
flectance until convergence to a local minimum. To represent the
vMF kernels and L, we discretize the sphere into K directions, and
compute K kernels, one per direction. Finally, we compute the ap-
pearance difference as the residual of synthesizing the photograph
using the diffuse reflection model, i.e.,
δ
?
i = Ii −P?i
∫
Ω
ni · si(ω)vi(ω)L?(ω)dω. (13)
⇡1
⇡2 ⇡2
⇡1
⇡2 ⇡2
(a) Layer 1 (b) Layer 2 (c) Layer 3 (d) Layer 4 (e) MRF
Labeling
La
te
ra
l V
ie
w
D
or
sa
l V
ie
w
Figure 5: We build an MRF over the object model to complete ap-
pearance. (a) Due to the camera viewpoint, the vertices are parti-
tioned into a visible set Iv shown with the visible appearance, and
a hidden set Ih shown in green. Initially, the graph has a single
layer of appearance candidates, labeled Layer 1, corresponding to
visible parts. At the first iteration, we use the bilateral plane of sym-
metry pi1 to transfer appearance candidates from Layer 1 to Layer
2. At the second iteration, we use an alternate plane of symmetry pi2
to transfer appearance candidates (c) from Layer 1 to Layer 3, and
(d) from Layer 2 to Layer 4. We perform inference over an MRF
to find the best assignment of appearance candidates from several
layers to each vertex. This result was obtained after six iterations.
6 Appearance Completion to Hidden Parts
The appearance T = {P?, δ?} computed in Section 5 is only avail-
able for visible parts of the object, as shown in Figure 5(a). We use
multiple planes of symmetry to complete the appearance in hidden
parts of the object using the visible parts. We first establish sym-
metric relationships between hidden and visible parts of the object
via planes of symmetry. These symmetric relationships are used to
suggest multiple appearance candidates for vertices on the object.
The appearance candidates form the labels of a Markov Random
Field (MRF) over the vertices. To create the MRF, we first obtain a
fine mesh of object vertices Xs, s ∈ I created by mapping the uv-
locations on the texture map onto the 3D object geometry. Here I
is a set of indices for all texel locations. We use this fine mesh since
the original 3D model mesh usually does not provide one vertex per
texel location, and cannot be directly used to completely fill the ap-
pearance. We set up the MRF as a graph whose vertices correspond
to Xs, and whose edges consist of links from each Xs to the set
Ks consisting of k nearest neighbors of Xs. As described in Sec-
tion 6.1, we associate each vertex Xs with a set of L appearance
candidates (P˜is , δ˜is), i ∈ {1, 2, · · · , L} through multiple sym-
metries. Appearance candidates with the same value of i form a
layer, and the algorithm in Section 6.1 grows layers (Figure 5(b)
through 5(d)) by transferring appearance candidates from previous
layers across planes of symmetry. To obtain the completed appear-
ance, shown in Figure 5(e), we find an assignment of appearance
candidates, such that each vertex is assigned one candidate and the
assignment satisfies the constraints of neighborhood smoothness,
consistency of texture, and matching of visible appearance to the
observed pixels. We obtain this assignment, by performing infer-
ence over the MRF as described in Section 6.2.
6.1 Relating Hidden& Visible Parts via Symmetries
We establish symmetric relationships between hidden and visible
parts of the object by leveraging the regularities of the stock model
X. However, we identify the hidden and visible areas using the
aligned and corrected model X
Θ
. To do so, we note that a point
Xs on the stock model is related to the texture map and to a cor-
responding point X
Θ
s on the corrected model through barycentric
coordinates over a triangle mesh specified on the stock model. We
compute X
Θ
s on the corrected model using the barycentric coordi-
nates of Xs. We then use pose Θ of the object to determine the set
of indices Iv for vertices visible from the camera viewpoint (shown
as textured parts of the banana in Figure 5(a)), and the set of indices
Ih = I \Iv for vertices hidden from the camera viewpoint (shown
in green in Figure 5(a)). While it is possible to pre-compute sym-
metry planes on an object, our objective is to relate visible parts
of an object to hidden parts through symmetries, many of which
turn out to be approximate (for instance, different parts of a banana
with approximately similar curvature are identified as symmetries
using our approach). Pre-computing all such possible symmetries
is computationally prohibitive. We proceed iteratively, and in each
iteration, we compute a symmetric relationship between Ih and
Iv using the stock model X. Specifically, we compute planes of
symmetry, shown as planes pi1 and pi2 in Figures 5(a) and 5(b).
Through this symmetric relationship, we associate appearance can-
didates (P˜is , δ˜is) to each vertex Xs in the graph by growing out
layers of appearance candidates.
Algorithm 1 details the use of symmetries to associate appearance
candidates P˜is and δ˜is through layers. The algorithm initializes
the appearance candidates at the first layer for vertices of the ob-
ject in parts visible in the photograph using the appearance P and
δ computed in Section 5. Over M iterations, the algorithm uses M
planes of symmetry to populate the graph from layers 1 toL = 2M .
The algorithm uses RANSAC to compute the optimal plane of sym-
metry pim at the mth iteration. Planes pi1 and pi2 computed for it-
erations m = 1 and m = 2 are shown in Figures 5(a) and 5(b).
At themth iteration, the algorithm then generates 2m−1 new layers,
by transferring appearance candidates from the first 2m−1 layers
across the plane of symmetry. In Figure 5(b), Layer 2 is generated
by transferring appearance candidates from Layer 1 across pi1. Us-
ing plane pi2 obtained in iteration m = 2, the algorithm generates
Layers 3 and 4 (in Figures 5(c) and 5(d)) from Layers 1 and 2. The
first 2m−1 layers are grown in the previous m − 1 iterations. The
algorithm uses the criteria of geometric symmetry (captured by the
distance
∥∥Xs − 2nmnTmXt? + 2nmdm∥∥ in vertex space) and ap-
pearance similarity (represented by the distance ‖Ps −Pt?‖ in the
image space of the texture map) to compute the symmetry plane
pim and to transfer appearance candidates. For vertices for which
no appearance candidates can be generated using geometric sym-
metry and appearance similarity, as in the case of the underside of
the taxi cab in Figure 1, the algorithm defaults to the stock model
appearance P (for which appearance difference δ is 0).
6.2 Completing Appearance via MRF
To obtain the completed appearance for the entire object from the
appearance candidates, shown in Figure 5(e), we need to select a set
of candidates such that (1) each vertex on the 3D model is assigned
exactly one candidate, (2) the selected candidates satisfy smooth-
ness and consistency constraints, and (3) visible vertices retain
their original appearance. To do this, we perform inference over
the MRF using tree-reweighted message passing (TRW-S) [Kol-
mogorov 2006]. While graph-based inference has been used to
complete texture in images [Kwatra et al. 2003], our approach uses
Algorithm 1 Associating Appearance Candidates Through Layers.
Set user-defined values for µ, ν, and M .
∀i ∈ {1, 2, · · · , 2M} & ∀s ∈ I, set PËœis â†âˆž & δ˜is â†âˆž.
Initialize Ic ↠Iv , and Il ↠Ih.
∀s ∈ Iv , set PËœ1s â†âˆž and δ˜1s â†âˆž.
form = 1 to M do
Initialize nm ↠0, Im ↠∅, nm â†
[
0 0 0
]T , and dm ↠0.
RANSAC for optimal plane of symmetry pim:
for r = 1 to R do
Randomly select sr ∈ I and tr ∈ Il.
Compute bisector plane pir =
[
nTr −dr
]T , where
nr ↠Xsr−Xtr‖Xsr−Xtr‖ and dr â†
1
2
nT (Xsr + Xtr ).
Ir ↠{s : s ∈ Il ∧ ‖Ps −Pt?‖ < ν
∧ ∥∥Xs − 2nrnTr Xt? + 2nrdr∥∥ < µ }.
t? = arg min
t∈Ic
∥∥Xs − 2nrnTr Xt + 2nrdr∥∥.
if |Ir| > nm then
nm ↠|Ir|, Im ↠Ir , nm ↠nr , and dm ↠dr .
end if
end for
Optimal plane of symmetry pim =
[
nTm − dm
]T .
for i = 1 to 2m−1 do
Set P˜js ↠P˜it? and P˜ js ↠P˜ it? ,
where j = i+ 2m−1 ∧ s ∈ I ∧ ‖Ps −Pt?‖ < ν
∧ ∥∥Xs − 2nmnTmXt? + 2nmdm∥∥ < µ.
t? = arg min
t∈Ic
∥∥Xs − 2nmnTmXt + 2nmdm∥∥.
end for
Update Ic ↠Ic ∪ Im and Il ↠Il \ Im.
end for
if |Il| > 0 then
∀s ∈ Il, set P˜1s ↠Ps and δ˜1s ↠0.
end if
the layers obtained by geometric and appearance-based symmetries
from Section 6.1.
For every vertex on the 3D model Xs, s ∈ I, we find an assignment
of reflectance values that optimizes the following objective:
{
i
?
1 , ..., i
?
|I|
}
= arg min
i1,··· ,i|I|
|I|∑
s=1
Φ(P˜is ) +
|I|∑
s=1
∑
t∈Ks
Φ(P˜is , P˜it ). (14)
The pairwise term in the objective function, Φ(·, ·) enforces neigh-
borhood smoothness via the Euclidean distance. Here Ks repre-
sents the set of indices for the k nearest neighbors of Xs. We
bias the algorithm to select candidates from the same layer using
a weighting factor of 0 < β < 1. This provides consistency of
texture. We use the following form for the pairwise terms:
Φ(P˜is , P˜it)

β
∥∥∥P˜is − P˜it∥∥∥2
2
if is = it,∥∥∥P˜is − P˜it∥∥∥2
2
otherwise.
(15)
The unary term Φ(·) forces visible vertices to receive the re-
flectance computed in Section 5. We set the unary term at the first
layer for visible vertices to ζ, where 0 < ζ < 1, else we set it to 1:
Φ(P˜is) =
{
ζ if s ∈ Iv, is = 0,
1 otherwise. (16)
We use the tree-reweighted message passing algorithm to perform
the optimization in Equation (14). We use the computed assign-
ment to obtain the reflectance values P
?
for all vertices in the set I.
Original Photograph Corrected Geometry Estimated Illumination Missing Parts Revealed Output
Figure 6: Top row: 3D manipulation of an origami crane. We show the corrected geometry for the crane in the original photograph, and the
estimated illumination, missing parts, and final output for a manipulation. Bottom row: Our approach uses standard animation software to
create realistic animations such as the flying origami crane. Photo Credits: ©Natasha Kholgade.
We build an analogous graph for the appearance difference δ
?
, and
solve an analogous optimization to find its assignment. The assign-
ment step covers the entire object with completed appearance. For
areas of the object that do not satisfy the criteria of geometric sym-
metry and appearance similarity, such as the underside of the taxi
cab in Figure 1, the assignment defaults to the stock model appear-
ance. The assignment also defaults to the stock model appearance
when after several iterations, the remaining parts of the object are
partitioned into several small areas where the object lacks structural
symmetries relative to the visible areas. In this case, we allow the
user to fill the appearance in these areas on the texture map of the
3D model using PatchMatch [Barnes et al. 2009].
7 Final Composition
The user manipulates the object pose from Θ to Ω. Given the cor-
rected geometry, estimated illumination, and completed appearance
and from Sections 4 to 6, we create the result of the manipulation J
by replacing Θ with Ω in Equation (1). We use ray-tracing to ren-
der each pixel according to Equation (8), using the illumination L
?
,
geometry X, and reflectance P
?
. We add the appearance difference
δ
?
to the rendering to produce the final pixels on the manipulated
object. We render pixels for the object and the ground using this
method, while leaving the rest of the photograph unchanged (such
as the corridor in Figure 6). To handle aliasing, we perform the
illumination estimation, appearance completion, and compositing
using a super-sampled version of the photograph. We filter and
subsample the composite to create an anti-aliased result.
8 Results
We perform a variety of 3D manipulations to photographs, such as
rigid manipulation of photographed objects, deformation, and 3D
copy-paste. We use a separately captured background photograph
for the chair, while for all other photos, we fill the background using
Context-Aware Fill in Photoshop. The top row of Figure 6 shows
the aligned geometry, estimated illumination, missing parts, and fi-
nal result of a manipulation performed to an origami crane. Though
the illumination may not be accurate, in combination with the cor-
rected reflectance, it produces plausible results such as the shadows
of the wing on the hand and illumination changes on the crane. Our
approach can directly be tied to standard modeling and animation
software to create realistic animations from a single photograph,
such as the flying origami crane in the bottom row of Figure 6. Fig-
ure 7 shows the result of our approach on the laptop from Zheng et
al. [2012]. Unlike their approach, we reveal the hidden logo using
the stock 3D model while maintaining plausible illumination over
the object and the contact surface.
In Figure 13, we show 3D manipulations such as rotation, trans-
lation, copy-paste, and deformation on the chair from Figure 2, a
pen, a subject’s watch, some fruit, a painting, a car on a cliff, a
top hat, and a historical photograph of airplanes shot during World
War II. As the shown in the second column, our approach repli-
cates the original photograph in the first column to provide seam-
less transition from the original view to new manipulations of the
object. We show intermediate outputs in supplementary material.
Our approach allows users to create dynamic compositions such as
the levitating chair, the watches flying about the subject, and the
falling fruit in Figure 13, and the taxi jam from Figure 1. Users can
add creative effects to the photograph, such as the pen strokes on
the paper, in conjunction with 3D manipulations. The illumination
estimated in Section 5 can be used to plausibly relight new 3D ob-
jects inserted into the scene as shown in Figure 12. We also perform
non-rigid deformations to objects using the geometry correction ap-
proach of Section 4, such as converting the top hat in Figure 13 into
a magician’s hat. Through our approach, the user changes the story
of the car photograph in Figure 13 from two subjects posing next to
a parked car to them watching the car as it falls off the cliff.
Figure 11 shows an example where the taxi-cab at the top of the
photograph in Figure 1 is manipulated using 3D models of three dif-
ferent cars. Accurate alignment is obtained for the first two models.
Our appearance completion algorithm determines plausible appear-
ance, even when the stock 3D model of the second car deviates from
the original photograph. For significantly different geometry, such
as that of the third car, alignment may be less accurate, leading to
artifacts such as transfer of appearance from the body to the wind-
shield or the wheels. However, the approach maintains plausible
illumination with similar environment maps, as the geometry of the
3D models provides sufficient cues for diffuse illumination.
While our approach is designed for digital photographs, it pro-
vides plausible results for vintage photographs such as the historical
World War II photograph in Figure 13 and the non-photorealistic
media such as the vegetables’ painting in Figure 13. The appear-
Original Photograph from Zheng et al. [2012] Object manipulation using Zheng et al. [2012] Object manipulation using our approach Revealing unseen parts using our approach
Figure 7: We perform a 3D rotation of the laptop in a photograph from Zheng et al. [2012] (Copyright: ©ACM). Unlike their approach, we
can reveal the hidden cover and logo of the laptop.
(b) Alignment using Xu et al. [2011] (c) Geometry Correction with Our Approach(a) Photograph and 3D Stock Model
Figure 8: Comparison of geometry correction by our approach against the alignment of Xu et al. [2011]. As shown in the insets, Xu et al.
do not align the leg and the seat accurately. Through our approach, the user accurately aligns the model to the photograph.
ance estimation described in Section 5 estimates grayscale appear-
ances for the black-and-white photograph of the airplanes. Using
our approach, the user manipulates the airplanes to pose them as
if they were pointing towards the camera, an effect that would be
nearly impossible to capture in the actual scene. In the case of
paintings, our approach maintains the style of the painting and the
grain of the sheet, by transferring these through the fine-scale detail
difference in the appearance completion described in Section 6.
Geometry Evaluation. Figure 8 shows results of geometry cor-
rection through our user-guided approach compared to the semi-
automated approach of Xu et al. [2011] on the chair. In the sys-
tem of Xu et al., the user input involves seeding a graph-cut seg-
mentation algorithm with a bounding box, and rigidly aligning the
model. Their approach automatically segments the photographed
object based on connected components in the model, and deforms
the 3D model to resemble the photograph. Their approach approx-
imates the form of most of the objects. However, as shown by the
insets in Figure 8(b), it fails to exactly match the model to the pho-
tograph. Through our approach (shown in Figure 8(c)), users can
accurately correct the geometry to match the photographed objects.
Supplementary material shows comparisons of our approach with
Xu et al. for photographs of the crane, taxi-cab, banana, and mango.
Illumination Evaluation. To evaluate the illumination estima-
tion, we captured fifteen ground truth photographs of the chair from
Figure 2 in various orientations and locations using a Canon EOS
5d Mark II Digital SLR camera mounted on a tripod and fitted with
an aspheric lens. The photographs are shown in the supplementary
material. We also capture a photograph of the scene without the
object to provide a ground-truth background image. We aligned
the 3D model of the chair to each of the fifteen photographs us-
ing our geometry correction approach from Section 4, and eval-
uated our illumination and reflectance estimation approach from
Section 5 against a ground truth light probe, and three approaches:
(1) Haar wavelets with positivity constraints on coefficients [2009],
(2) L1-sparse high frequency illumination with spherical harmon-
ics for low-frequency illumination [Mei et al. 2009], and (3) envi-
ronment map completion using projected background [Khan et al.
2006]. We fill the parts of the Khan et al. environment map not seen
in the image using the PatchMatch algorithm [Barnes et al. 2009].
For all experiments, we estimated the illumination using Photo-
graph 1 (shown at the left of Figure 10(a)), and synthesized images
corresponding to the poses of the chair in Photographs 2 to 15 by
applying the estimated illumination to the geometry of each pose.
The synthesized images contain all steps of our approach except
appearance completion on the chair. We computed mean-squared
reconstruction errors between Photographs 2 to 15 and their corre-
sponding synthesized images for three types of subregions in the
photograph: object and ground, ground only, and object only. As
a baseline, we obtained the ground truth illumination by captur-
ing a high dynamic range (HDR) image of a light probe (a 4-inch
chrome sphere inspired by Debevec [1998]). Figure 9 shows the
mean-squared reconstruction error (MSE) for the von Mises-Fisher
basis compared against Haber et al., Mei et al., Khan et al., and
the light probe, for increasing numbers of basis components (i.e.
K). We used the same discretization for the sphere as the num-
ber of components. The number of basis components is a power
of two to ensure that the Haar wavelet basis is orthonormal. We
used λ1 = .01 and λ2 = 1 for our method, a regularization weight
of 1 for the approach of Haber et al., and a regularization weight
of .01 for the method of Mei et al. The reconstruction error for
Haar wavelets increases with increasing number of basis compo-
nents, since in attempting to capture high frequency information
(such as the edges of shadows), they introduce artifacts of negative
illumination such as highlights.
Figure 10 shows environment maps for the light probe, the back-
ground image projected out according to Khan et al., and the three
approaches to estimate illumination. The second and third rows of
Figure 10 show images of the chair in Photograph 2 synthesized
using the light probe, the background image, and the three illumi-
nation estimation approaches with K = 4096. The ground truth
for Photograph 2 is shown at the top-left. As shown in the figure,
the approach of Khan et al. does not capture the influence of ceil-
ing lights which are not observed in the original photograph. The
Haar wavelets’ approach introduces highlights due to negative light,
while the approach of Mei et al. generates sharp cast shadows. Our
approach produces smooth cast shadows and captures the effect of
lights not seen in the original photograph. Additional examples can
be found in the supplementary material.
Geometry Correction Timings. We evaluated the time taken for
four users to perform geometry correction of 3D models using our
user-guided approach. Table 1 shows the times (in minutes) to align
various 3D models to photographs used in this paper. The time
102 103 104
0
0.5
1
1.5 x 10
ï5
# Basis Components
M
SE
Object & Ground
vMF
Haar
SpH+L1
Bkgnd
LightProbe102 103 104
0
0.5
1
1.5 x 10
ï5
# Basis Components
M
SE
Ground Only
102 103 104
0
0.5
1
1.5 x 10
ï5
# Basis Components
M
SE
Object Only
Figure 9: Plots of mean-squared reconstruction error (MSE) versus number of basis components for the vMF basis in green (method used
in this paper) compared to the Haar basis [Haber et al. 2009], spherical harmonics with L1 prior (Sph+L1) [Mei et al. 2009], background
image projected (Bkgnd) [Khan et al. 2006], and a light probe, on the object and the ground, ground only, and object only.
Table 1: Times taken (minutes) to align 3D models to photographs.
User Banana Mango Top hat Taxi Chair Crane
1 12.43 7.10 8.72 16.09 45.17 32.22
2 6.33 3.08 10.08 7.75 20.32 14.92
3 2.23 2.42 4.93 2.57 6.12 22.07
4 5.63 5.13 6.17 7.52 8.53 19.03
spent on aligning is directly related to the complexity of the model
in terms of the number of connected components. Users spent the
least amount of time on aligning the mango, banana, and top hat
models, as these models consist of a single connected component
each. Though the taxi consists of 24 connected components, its
times are comparable to the simpler models, as the stock model
geometry is close to the photographed taxi. Long alignment times
of the chair and origami crane are due to the number of connected
components (3 and 8 respectively), and the large disparity between
their stock models and their photographs. We show user alignment
examples in the supplementary material. The supplementary video
shows a session to correct the banana model using the tool.
User Study. We evaluated the perceived realism of 3D object ma-
nipulation in photographs through a two-alternative forced choice
user study. We asked participants to compare and choose the more
realistic image between the original photograph and an edited result
produced using our approach. We recruited 39 participants mostly
from a pool of graduate students in computer science, and we con-
ducted the study using a webpage-based survey. Each participant
viewed four image pairs. Each image pair presented a choice be-
tween an original photograph and one of two ‘edited’ conditions:
either the final result of our approach (Condition 1), or an interme-
diate result also produced using our method (Condition 2). Specifi-
cally, Condition 2 was generated using the corrected geometry, illu-
mination, and surface reflectance, but without using the appearance
difference. The presentation order was randomized, and each user
saw each of the original photographs once. Across all participants,
we obtained a total of 40 responses for the original photo/final result
image pairs, and 38 responses for the original photo/intermediate
output image pairs. 71.05% of the time (i.e., on 54 of 76 image
pairs), participants chose the original photograph as being more re-
alistic than Condition 2. This preference was found to be statis-
tically significant: we used a one-sample, one-tailed t-test, which
rejected the null hypothesis that people do not prefer the original
photograph over Condition 2 at the 5% level (p < 0.001). 52.5%
of the time (i.e., on 42 of 80 image pairs), the participants chose
Condition 1 over the original photograph. The difference was not
found to be significant at the 5% level. In some cases, participants
reported choosing the more realistic image based on plausible con-
figurations of the object, e.g., an upright well-placed chair as op-
posed to one fallen on the ground. Images used are shown in the
supplementary material.
Parameters. During geometry correction, the user can turn on
symmetry by setting w = 1 or turn it off by setting w = 0. For
illumination extraction and appearance correction, we set the illu-
mination parameters λ1 and λ2 to small values for smooth shadows
due to dispersed illumination, i.e., λ1 = .001 to .01, λ2 = .1 to 1.
For strong directed shadows due to sparse illumination concentrated
in one region, we set high values: λ1 = 1 to 5, λ2 = 10 to 20. To
emphasize ground shadows, we set Ï„ = .01 to .25. The value of K
is set to 2500. We set λ3 = .5 to emphasize piecewise constancy of
the appearance deviations. In the appearance completion section,
we set the number of nearest neighbors k for all Ks to 5. We set M
to 1 for rigid objects such as the chair, the laptop, and the cars, and
to 7 for flexible objects such as fruit. We set β = .005 to emphasize
consistency of appearance within a layer, and we set ζ to a small
value (.01) to force visible vertices to be assigned the observed ap-
pearance. For all RANSAC operations, we set R = 5000, µ = .01
times the bounding box of the object, and ν = .001. We vary the
geometric symmetry tolerance µ from .01 to .1 times the bounding
box for increasingly approximate symmetries. Parameters we vary
the most in our algorithm are λ1, λ2, and τ . Parameter settings for
examples used in the paper can be found in supplementary material.
9 Discussion
We present an approach for performing intuitive 3D manipulations
on photographs, without requiring users to access the original scene
or the moment when the photograph was taken. 3D manipulations
greatly expands the repertoire of creative manipulations that artists
can perform on photographs. For instance, controlling the com-
position of dynamic events (such as Halsman’s Dalı´ Atomicus) is
particularly challenging for artists. Manipulation of photographed
objects in 3D will allow artists far greater freedom to experiment
with many different compositions. Through our approach, artists
can conveniently create stop motion animations that defy physical
constraints. While we have introduced our approach to improve
the experience of editing photographs, it can be used to correct the
geometry and appearance of 3D models in public repositories.
The fundamental limitation of our approach is related to sampling.
We may lack pixel samples for a photographed object if it is small
in image-space, in which case, manipulating it to move it closer to
the camera can produce a blurred result. However, as cameras to-
day exceed tens of megapixels in resolution, this problem is much
less of an issue, and may be addressed via final touch-ups in 2D.
Failures can occur if an object is photographed with the camera’s
look-at vector perpendicular to the normal of the bilateral plane of
symmetry, e.g., if a wine-bottle is photographed from the top. In
these cases, symmetry constraints are difficult to exploit, and the
completion has to rely heavily on the texture provided with the 3D
model. Another class of failures is caused when the illumination
model fails to account for some lighting effects, and the algorithm
attempts to explain the residual effect during appearance comple-
tion. This can result in lighting effects appearing as texture artifacts.