Perceiving Systems, Computer Vision

Department Talks

Geometric Regularizations for 3D Shape Generation

Talk
  • 13 March 2024 • 15:00—16:00
  • Qixing Huang
  • N3.022

Generative models, which map a latent parameter space to instances in an ambient space, enjoy various applications in 3D Vision and related domains. A standard scheme of these models is probabilistic, which aligns the induced ambient distribution of a generative model from a prior distribution of the latent space with the empirical ambient distribution of training instances. While this paradigm has proven to be quite successful on images, its current applications in 3D generation encounter fundamental challenges in the limited training data and generalization behavior. The key difference between image generation and shape generation is that 3D shapes possess various priors in geometry, topology, and physical properties. Existing probabilistic 3D generative approaches do not preserve these desired properties, resulting in synthesized shapes with various types of distortions. In this talk, I will discuss recent work that seeks to establish a novel geometric framework for learning shape generators. The key idea is to model various geometric, physical, and topological priors of 3D shapes as suitable regularization losses by developing computational tools in differential geometry and computational topology. We will discuss the applications in deformable shape generation, latent space design, joint shape matching, and 3D man-made shape generation.

Organizers: Yuliang Xiu


Mining Visual Knowledge from Large Pre-trained Models

Talk
  • 18 January 2024 • 15:00—16:00
  • Luming Tang
  • N3.022

Computer vision made huge progress in the past decade with the dominant supervised learning paradigm, that is training large-scale neural networks on each task with ever larger datasets. However, in many cases, scalable data or annotation collection is intractable. In contrast, humans can easily adapt to new vision tasks with very little data or labels. In order to bridge this gap, we found that there actually exists rich visual knowledge in large pre-trained models, i.e., models trained on scalable internet images with either self-supervised or generative objectives. And we proposed different techniques to extract these implicit knowledge and use them to accomplish specific downstream tasks where data is constrained including recognition, dense prediction and generation. Specifically, I’ll mainly present the following three works. Firstly, I will introduce an efficient and effective way to adapt pre-trained vision transformers to a variety of low-shot downstream tasks, while tuning only less than 1 percent of the model parameters. Secondly, I will show that accurate visual correspondences emerge from a strong generative model (i.e., diffusion models) without any supervision. Following that, I will demonstrate that an adapted diffusion model is able to complete a photo with true scene contents using only a few casual captured reference images.

Organizers: Yuliang Xiu Yandong Wen


  • Partha Ghosh
  • N3.022 Aquarium and Zoom

We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies. To capture these dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a singular latent code to model an entire video sequence. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy reduces computational complexity by a factor of 2 as measured in FLOPs. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model is capable of synthesizing high-fidelity video clips at a resolution of 256×256 pixels, with durations extending to more than 5 seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips.

Organizers: Yandong Wen


  • Weiyang Liu
  • N3.022 Aquarium and Zoom

Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in computer vision and natural language. The results validate the effectiveness of BOFT as a generic finetuning method.

Organizers: Yandong Wen


Ghost on the Shell: An Expressive Representation of General 3D Shapes

Talk
  • 12 October 2023 • 10:00 am—11:00 am
  • Zhen Liu
  • Hybrid

The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they enable 1) fast physics-based rendering with realistic material and lighting, 2) physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parametrize open surfaces by defining a manifold signed distance field on watertight templates. With this parametrization, we further develop a grid-based and differentiable representation that parametrizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-SHELL achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.

Organizers: Yandong Wen


  • Jun Gao
  • Virtual (Zoom)

Creating 3D virtual worlds will require generating diverse and high-quality 3D content that mimics the intricacies of the real 3D world. While machine learning has achieved significant success in image and video generation, its application in 3D content generation encounters fundamental challenges in the scarcity of 3D training data and increased complexities inherent in three dimensions. We approach the problem of 3D content generation by revisiting the 3D grounding for the representation, data and algorithms. First, we introduce a differentiable 3D representation that bridges neural fields with meshes via differentiable isosurfacing. This enables us not only to generate 3D meshes with varying topologies but also to regularize neural fields through the mesh. Second, we exploit 2D data prior to facilitating text-to-3D generation with a coarse-to-fine generation recipe. Specifically, we bring our differentiable isosurfacing to extract 3D meshes and differentiably render high-resolution images, which enables the generation of high-frequency details in geometry and textures from the text. Lastly, we develop a 3D generative algorithm that can generate high-quality meshes with textures by enforcing a 3D bottleneck in the generation process while supervising 2D images through differentiable rendering.

Organizers: Yao Feng


  • Yifan Wang
  • Virtual (Zoom)

A light stage acquires the shape and material properties of a face in high detail using a series of images captured under synchronized cameras and lights. This captured information can be used to synthesize novel images of the subject under arbitrary lighting conditions or from arbitrary viewpoints. This process enables a number of visual effects, such as creating digital replicas of actors that can be used in movies or high-quality postproduction relighting. In many cases, however, it is often infeasible to get access to a light stage for capturing a particular subject, because light stages are not easy to find: they are expensive and require significant technical expertise (often teams of people) to build and operate. In this talk, we will delve into a lightweight alternative to a light stage that captures comparable data using only a smartphone camera and the sun, which we dub SunStage. Our method only requires the user to capture a selfie video outdoors, rotating in place, and uses the varying angles between the sun and the face as guidance in joint reconstruction of facial geometry, reflectance, camera pose, and lighting parameters. Despite the in-the-wild un-calibrated setting, SunStage is able to reconstruct detailed facial appearance and geometry, enabling compelling effects such as relighting, novel view synthesis, and reflectance editing.

Organizers: Yao Feng


Face Exploration - Capture all Degrees of Freedom of the Face

Talk
  • 17 August 2023 • 10:00 am—11:00 am
  • Claudia Gallatz
  • N3.022 Aquarium and Zoom

A high quality data capture is decisive for your scientific work. As a member of the data team, it is a core task of my daily routine to ensure good quality standards in this field. My talk will enlighten the background of this work, starting from scanner set-up and the corresponding data outcome with focus on the Face Scanner. A work, each scientist can profit from for his personal projects. I will take the occasion to present our most recent face capture study named FACE EXPLORATION, of which Timo Bolkart is the leading scientist. A selection of representative sequences including facial movements and expressions will be demonstrated along with general informations on protocol and participating subjects. Further, I would like to point out some parallels between actors and computer scientist by doing an approach to the topic of facial expression – based on my experience as an actress prior to my work at PS.

Organizers: Yandong Wen


Full-body avatars from single images and textual guidance

Talk
  • 13 July 2023 • 10:00—11:00
  • Yangyi Huang

The reconstruction of full body appearance of clothed humans from single-view RGB images is a crucial yet challenging task, primarily due to depth ambiguities and the absence of observations from unseen regions. While existing methods have shown impressive results, they still suffer from limitations such as over-smooth surfaces and blurry textures, particularly lacking details at the backside of the avatar. In this talk, I will delve into how we have addressed these limitations by leveraging text guidance and pretrained text-image models, introducing two novel methods. Firstly, I will present ELICIT, a data-efficient approach that utilizes a SMPL-based human body prior and CLIP-based semantic prior to create an animatable human nerf from a single image. This method tackles the challenges of creating detailed back-side appearance by a CLIP embedding loss. Secondly, I will introduce TeCH, our latest project for reconstructing high-fidelity 3D clothed humans with consistent texture maps and detailed geometry. This approach employs a hybrid mesh representation and pretrained 2D text-to-image diffusion models to achieve remarkable results. Through these advancements, we aim to push the boundaries of creating digital human, bridging the gap between single-image inputs and the creation of fully textured and realistic 3D avatars.

Organizers: Hongwei Yi


Pose, Kinematics, and Dynamics

Talk
  • 13 April 2023 • 10:00—11:00
  • Bian Siyuan
  • Aquarium

Recovering accurate 3D human pose and shape from monocular input remains a challenging problem despite the rapid advancements powered by deep neural networks. Existing methods have limitations in achieving both robustness and mesh-image alignment, and the estimated pose suffers from physical artifacts such as foot sliding and body leaning. In this talk, we present two new methods to address these limitations. Firstly, we introduce NIKI, an inverse kinematics algorithm that utilizes an invertible neural network to model both the forward kinematics process and the inverse kinematics process. With the explict bi-directional error modeling, NIKI achieves pixel-aligned estimation accuracy and is robust to occlusions. Secondly, we present a new method, named D&D, that utilizes human body dynamics to generate physically plausible body motion and global trajectories in varied scenarios.

Organizers: Michael Black