Neural 3D Mesh Renderer

3D Mesh Reconstruction

2D-to-3D Style Transfer

3D DeepDream

These applications are realized by redefining the “backward pass” of a 3D mesh renderer and incorporating it into neural networks.

Short introduction

We propose Neural Renderer. This is a 3D mesh renderer and able to be integrated into neural networks.

We applied this renderer to (a) 3D mesh reconstruction from a single image and (b) 2D-to-3D image style transfer and 3D DeepDream.


For modeling the 3D world behind 2D images, which 3D representation is most appropriate? A polygon mesh is a promising candidate for its compactness and geometric properties. However, it is not straightforward to model a polygon mesh from 2D images using neural networks because the conversion from a mesh to an image, or rendering, involves a discrete operation called rasterization, which prevents back-propagation. Therefore, in this work, we propose an approximate gradient for rasterization that enables the integration of rendering into neural networks. Using this renderer, we perform single-image 3D mesh reconstruction with silhouette image supervision and our system outperforms the existing voxel-based approach. Additionally, we perform gradient-based 3D mesh editing operations, such as 2D-to-3D style transfer and 3D DeepDream, with 2D supervision for the first time. These applications demonstrate the potential of the integration of a mesh renderer into neural networks and the effectiveness of our proposed renderer.


Full paper is available at


Single-image 3D reconstruction

A 3D mesh can be correctly reconstructed from a single image using our method.

Comparison with voxel-based method [1]

Mesh reconstruction does not suffer from the low-resolution problem and cubic artifacts in voxel reconstruction.

Our approach outperforms the voxel-based approach [1] in 10 out of 13 categories on the voxel IoU metric.

2D-to-3D style transfer

The styles of the paintings are accurately transferred to the textures and shapes by our methond. Please pay attention to the outline of the bunny and the lid of the teapot.

The style images are Thomson No. 5 (Yellow Sunset) (D. Coupland, 2011), The Tower of Babel (P. Bruegel the Elder, 1563), The Scream (E. Munch, 1910), and Portrait of Pablo Picasso (J. Gris, 1912).

3D DeepDream

This is a 3D version of DeepDream.

Technical overview

Understanding the 3D world from 2D images is one of the fundamental problems in computer vision. And, rendering (3D-to-2D conversion) lies on the borderline between the 3D world and 2D images. A polygon mesh is an efficient, rich and intuitive 3D representation. Therefore, the “backward pass” of a 3D mesh renderer is worth pursuing.

Rendering cannot be integrated into neural networks without modifications because the back-propagation is prevented from the renderer. In this work, we propose an approximate gradient for rendering, which enables end-to-end training of neural networks including rendering. Please read the paper for the details of our renderer.

The applications demonstrated above were performed using this renderer. The figure below shows the pipelines.

The 3D mesh generator was trained with silhouette images. The generator tries to minimize the difference between the silhouettes of reconstructed 3D shape and true silhouettes in the training phase.

2D-to-3D style transfer was performed by optimizing the shape and texture of a mesh to minimize style loss defined on the images. 3D DeepDream was also performed in a similar way.

Both applications were realized by flowing information in 2D image space into 3D space through our renderer.

More details can be found in the paper.


  • Neural Renderer
  • Applications
    • 3D Reconstruction (in preparation)
    • Style Transfer (in preparation)
    • DeepDream (in preparation)


    title={Neural 3D Mesh Renderer},
    author={Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada},
    booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},


  1. X. Yan et al. “Perspective Transformer Nets: Learning Single-view 3D Object Reconstruction without 3D Supervision.” Advances in Neural Information Processing Systems (NIPS). 2016.

Papers that use neural renderer

  1. 3D Human Texture Estimation from a Single Image with Transformers [Xu & Loy ICCV 2021]
  2. HPOF: 3D Human Pose Recovery from Monocular Video with Optical Flow [Ji et al. ICMR 2021]
  3. De-rendering the World’s Revolutionary Artefacts [Wu et al. CVPR 2021]
  4. Model-based 3D Hand Reconstruction via Self-Supervised Learning [Chen et al. CVPR 2021]
  5. Probabilistic 3D Human Shape and Pose Estimation from Multiple Unconstrained Images in the Wild [Sengupta et al. CVPR 2021]
  6. Learning To Aggregate and Personalize 3D Face From In-the-Wild Photo Collection [Zhang et al. CVPR 2021]
  7. Learning 3D Shape Feature for Texture-insensitive Person Re-identification [Chen et al. CVPR 2021]
  8. Towards In-Field Phenotyping Exploiting Differentiable Rendering with Self-Consistency Loss [Magistri et al. ICRA 2021]
  9. Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs [Pan et al. ICLR 2021]
  10. Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks [Cao et al. S&P 2021]
  11. Differentiable Rendering-based Pose-Conditioned Human Image Generation [Horiuchi et al. CVPRW 2021]
  12. PARE: Part Attention Regressor for 3D Human Body Estimation [Kocabas et al. arXiv 2021]
  13. Exemplar-Based 3D Portrait Stylization [Han et al. arXiv 2021]
  14. Physical world assistive signals for deep neural network classifiers – neither defense nor attack [Pestana et al. arXiv 2021]
  15. ContourRender: Detecting Arbitrary Contour Shape For Instance Segmentation In One Pass [Tang et al. arXiv 2021]
  16. An Online Robot Teaching Method using Static Hand Gestures and Poses [Sun et al. arXiv 2021]
  17. Semantically Controllable Scene Generation with Guidance of Explicit Knowledge [Ding et al. arXiv 2021]
  18. Towards Better Adversarial Synthesis of Human Images from Text [Briq et al. arXiv 2021]
  19. LASOR: Learning Accurate 3D Human Pose and Shape Via Synthetic Occlusion-Aware Data and Neural Mesh Rendering [Yang et al. arXiv 2021]
  20. DexMV: Imitation Learning for Dexterous Manipulation from Human Videos [Qin et al. arXiv 2021]
  21. Towards unconstrained joint hand-object reconstruction from RGB videos [Hasson et al. arXiv 2021]
  22. Toward Realistic Single-View 3D Object Reconstructionwith Unsupervised Learning from Multiple Images [Ho et al. arXiv 2021]
  23. Deformation representation based convolutional mesh autoencoder for 3D hand generation [Zheng et al. Neurocomputing 2020]
  24. SUNNet: A novel framework for simultaneous human parsing and pose estimation [Xu et al. Neurocomputing 2020]
  25. Weakly-supervised Reconstruction of 3D Objects with Large Shape Variation from Single In-the-Wild Images [Sun et al. ACCV 2020]
  26. Learning Object Manipulation Skills via Approximate State Estimation from Real Videos [Petrik et al. CoRL 2020]
  27. Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency [Zhao et al. NeurIPS 2020]
  28. AOT: Appearance Optimal Transport Based Identity Swapping for Forgery Detection [Zhu et al. NeurIPS 2020]
  29. MeshSDF: Differentiable Iso-Surface Extraction [Remelli et al. NeurIPS 2020]
  30. Introducing Pose Consistency and Warp-Alignment for Self-Supervised 6D Object Pose Estimation in Color Images [Sock et al. 3DV 2020]
  31. Deep Learning And Interactivity For Video Rotoscoping [Saboo et al. ICIP 2020]
  32. Monocular Differentiable Rendering for Self-Supervised 3D Object Detection [Beker et al. ECCV 2020]
  33. DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling [Moon et al. ECCV 2020]
  34. Deep Feedback Inverse Problem Solver [Ma et al. ECCV 2020]
  35. Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop [Biggs et al. ECCV 2020]
  36. Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild [Zhang et al. ECCV 2020]
  37. Spatiotemporal Attacks for Embodied Agents [Liu et al. ECCV 2020]
  38. 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View [Badger et al. ECCV 2020]
  39. BCNet: Learning Body and Cloth Shape from A Single Image [Jiang et al. ECCV 2020]
  40. Weakly-Supervised Domain Adaptation via GAN and Mesh Model for Estimating 3D Hand Poses Interacting Objects [Baek et al. CVPR 2020]
  41. Coherent Reconstruction of Multiple Humans From a Single Image [Jiang et al. CVPR 2020]
  42. End-to-End Optimization of Scene Layout [Luo et al. CVPR 2020]
  43. Rotate-and-Render: Unsupervised Photorealistic Face Rotation from Single-View Images [Zhou et al. CVPR 2020]
  44. Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild [Wu et al. CVPR 2020]
  45. Leveraging Photometric Consistency over Time for Sparsely Supervised Hand-Object Reconstruction [Hasson et al. CVPR 2020]
  46. End to End Trainable Active Contours via Differentiable Rendering [Gur et al. ICLR 2020]
  47. Neural Puppet: Generative Layered Cartoon Characters [Poursaeed et al. WACV 2020]
  48. Changing clothing on people images using generative adversarial networks [Pozdniakov, Master thesis, Ukrainian Catholic University, 2020]
  49. Unsupervised Domain Adaptation with Temporal-Consistent Self-Training for 3D Hand-Object Joint Reconstruction [Qi et al. arXiv 2020]
  50. Reconstructing Hand-Object Interactions in the Wild [Cao et al. arXiv 2020]
  51. Temporal-Aware Self-Supervised Learning for 3D Hand Pose and Mesh Estimation in Videos [Chen et al. arXiv 2020]
  52. Semantic Correspondence via 2D-3D-2D Cycle [You et al. arXiv 2020]
  53. Learning Pose-invariant 3D Object Reconstruction from Single-view Images [Peng et al. arXiv 2020]
  54. EllipBody: A Light-weight and Part-based Representation for Human Pose and Shape Recovery [Wang et al. arXiv 2020]
  55. Neural Mesh Refiner for 6-DoF Pose Estimation [Wu et al. arXiv 2020]
  56. Reconstruct, Rasterize and Backprop: Dense shape and pose estimation from a single image [Pokale et al. arXiv 2020]
  57. Learning View Priors for Single-view 3D Reconstruction [Kato and Harada. CVPR 2019]
  58. Strike (with) a Pose: Neural Networks Are Easily Fooled by Strange Poses of Familiar Objects [Alcorn et al. CVPR 2019]
  59. MeshAdv: Adversarial Meshes for Visual Recognition [Xiao et al. CVPR 2019]
  60. Pushing the Envelope for RGB-Based Dense 3D Hand Pose Estimation via Neural Rendering [Baek et al. CVPR 2019]
  61. Canonical Surface Mapping via Geometric Cycle Consistency [Kulkarni et al. ICCV 2019]
  62. Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis [Liu et al. ICCV 2019]
  63. Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images “In the Wild” [Zuffi et al. ICCV 2019]
  64. End-to-end Hand Mesh Recovery from a Monocular RGB Image [Zhang et al. ICCV 2019]
  65. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images [Zimmermann et al. ICCV 2019]
  66. Localization and Mapping using Instance-specific Mesh Models [Feng et al. IROS 2019]
  67. Human Motion Generation Based on GAN Toward Unsupervised 3D Human Pose Estimation [Yamane et al. ACPR 2019]
  68. Single-image Mesh Reconstruction and Pose Estimation via Generative Normal Map [Xiang et al. CASA 2019]
  69. Towards Analyzing Semantic Robustness of Deep Neural Networks [Abdullah & Ghanem ICCVW 2019]
  70. Lifting AutoEncoders: Unsupervised Learning of a Fully-Disentangled 3D Morphable Model using Deep Non-Rigid Structure from Motion [Sahasrabudhe et al. ICCVW 2019]
  71. TriDepth: Triangular Patch-based Deep Depth Prediction [Kaneko et al. ICCVW 2019]
  72. Transporting Real World Rigid and Articulated Objects into Egocentric VR Experiences [IEEEVR 2019 poster]
  73. Generating 3D Human Animations from Single Monocular Images [Marwah, Master thesis, CMU, 2019]
  74. Self-supervised Learning of 3D Objects from Natural Images [Kato & Harada, arXiv 2019]
  75. STA: Adversarial Attacks on Siamese Trackers [Wu et al. arXiv 2019]
  76. 3D-Aware Scene Manipulation via Inverse Graphics [Yao et al. NIPS 2018]
  77. Learning Category-Specific Mesh Reconstruction from Image Collections [Kanazawa et al. ECCV 2018]