본문 바로가기

부스트캠프 AI Tech/[Week7] Computer Vision

[Week7] Conditional generative model [Day3]

1. Conditional generative model

  • Translating an image given "condition"
  • We can explicitly generate an image corresponding to a given "condition"!
  • sketch of a bag이 주어졌을 때 X인 이미지가 일어날 확률

 

1.1 Generative model vs. Conditional generative model

  • Generative model은 랜덤 샘플을 생성
  • Conditional generative model은 조건이 주어졌을 때 랜덤 샘플을 생성
  • Example of conditional generative model - audio super resolution
    • P (high resolution audio | low resolution audio)
    • P (English sentence | Chinese sentence)
    • P (A full article | An article's title and subtitle)

 

*Generative Adversarial Network

  • "Criminal" (Generator) crafts, and "Police" (Discriminator) detects counterfeit
  • Adversarial train(적대적 학습법) : Generator는 Fake data를 더 잘 만들기 위해 학습하고, Discriminator는 Fake data와 Real Data를 더 잘 구분하기 위해 학습함으로써 둘 다의 성능이 올라감
  • (Basic) GAN vs. Conditional GAN
    • C : Conditional term



1.2 Conditional GAN and image translation

  • Image-to-Image translation
  • Translating an image into another image
  • Many applications : Style transfer, Super resolution, Colorization ...!
    • Style transfer

 

1.3 Example : Super resolution

  • Super resolution - low resolution to igh resolution
  • An example of conditional GAN
    • input : low resolution image
    • output : fake high resolution image
    • discriminator : Real HR image <-> Fake HR image 판별
  • Naive Regression model
  • Comparison of MAE, MSE and GAN losses in an image manifold
    • MAE/MSE는 픽셀의 intensity의 차이를 계산하므로 많은 유사한 패치들이 존재
    • 안전한 평균 영상이 만들어진다. but, 다소 blurry, 거리가 먼 패치들은 distance가 큼.
    • 하지만 GAN loss는 이러한 현상이 발생하지 않음
    • real data와 비교하여 학습하기 때문에 이미 loss가 낮다
  • "averaging answers"의 의미
    • Conditions
      • Task : Colorizing the given image
      • Real image has only two colors, "black" or "white"
    • L1 loss는 gray output을 낼 확률이 높다. 흰색과 검은색의 average를 찾아가기 때문
    • GAN loss는 black or white의 output을 생성한다. real data를 통해 학습을 하기 때문에 본적없는 data를 generate할 확률이 낮음
  • GAN loss for Super Resolution (SRGAN)
    • SRGAN generates more "realistic" and sharp images than SRResNet (MSE loss)



 

 

2. Image translation GANs

 

2.1 Pix2Pix

  • Translating an image to a corresponding image in another domain (e.g., style)
  • Example of a conditional GAN where the condition is given as an input image


  • Loss function of Pix2Pix
    • L1 loss는 blur를 만들지만 적당한 가이드로 쓸 수 있음
    • GAN loss는 realistic한 output을 낸다
    • 이 둘을 합침
    • L1 Loss는 정답 y랑 직접 비교를 함 (supervised learning). but, GAN은 입력 영상을 y와 직접 비교하지 않기 때문에 real data와 비슷한 데이터를 만들 수 없음
    • 따라서 L1 loss term이 정답과 비슷하게 만들고, GAN loss term이 좀 더 realistic하게 만들어주는 효과 
  • Role of GAN loss in Pix2Pix
    • Pix2Pix generates realistic images by using both GAN loss and L1 loss

 

 

 

2.2 CycleGAN

  • Pix2Pix에서는 "pairwise data"가 필요했다. 따라서 supervised learning이 가능했음
  • 하지만 pairwise dataset을 얻는것은 어렵고 불가능할 때도 있음
  • 이 문제를 해결하고자 CycleGAN 제안



*CycleGAN

  • CycleGAN enables the translation between domains with non-pairwise datasets
  • Do not require direct correspondences between (Monet portrait, real photo)
  • non-pairwise dataset을 잔뜩 주고 그 사이 관계 왔다갔다 하면서 학습한다


  • Loss function of CycleGAN
    • CycleGAN loss = GAN loss (in both direction) + Cycle-consistency loss
    • GAN loss : Translate an image in domain A to B, and vice versa
    • Cycle-consistency loss : Enforce the fact that an image and its manipulated one by going back-and-forth should be same -> A에서 B로 갔다가 B에서 다시 A로 갈때 기존 원본 A와 동일해야한다는 loss개념
  • GAN loss in CycleGAN
    • GAN loss does translation (X -> Y로 generation하고, Discriminator Y를 통해 판별, 반대도 동일)
    • CycleGAN has two GAN losses, in both direction (X Y, Y X)
    • GAN loss : L(Dx) + L(Dy) + L(G) + L(F)
    • G,F : generator
    • Dx, Dy : discriminator
    • GAN loss만 활용한다면?
      • Mode Collapse문제 발생 : input에 상관없이 하나의 output만 계속 출력함
      • X->Y generator가 생성한 image가 맞다라고 판단하고 더이상 학습하지 않음
      • local minima에 빠진것 같은 효과
      • Solution : Cycle-consistency loss 
  • Cycle-consistency loss to preserve contents

    •  style만 보는것이 아니라 안의 contents도 본다
    • X -> Y -> X 과정을 거칠 때, 다시 돌아온 X가 원본가 같아야 한다 (내부의 컨텐츠 유지가 목적)
    • No supervision (i.e., self-supervision)

 

 

 

2.3 Perceptual loss

  • GAN is hard to train : alternating training is required (generator update, discriminator update)
  • is there another way to get a high-quality image without GAN?
  • GAN loss
    • Relatively hard to train and code (Generator & Discriminator adversarially improve)
    • Do not require any pre-trained networks 
    • Since no pre-trained network is required, can be applied to various applications
  • Perceptual loss
    • Simple to train and code (trained only with simple forward & backward computation)
    • Requiring a pre-trained network to measure a learned loss
    • Observation: Pre-trained classifiers have filter responses similar to humans’ visual perception
    • By utilizing such pre-trained “perception,” we may transform an image to a perceptual space
    • Style transfer examples with the perceptual loss

*Perceptual loss

  • Image Transform Net : Output a transformed image from input 데이터로 부터 스타일 하나가 결정됨
  • Loss Network : Compute style and feature losses between a generated image and targets 
    • Typically, the VGG model pre-trained on ImageNet is used
    • Fixed during training Image transform Net
  • input을 Generation하고 y_hat을 image classification network(vgg16)에 넣어주고 feature를 뽑음
  • Style Target과 Content Target에 대한 loss를 계산
  • backpropagation이 되는데 loss net weight은 고정한채로 y_hat이 변하도록 Image transform net을 학습시킴
  • Feature reconstruction loss (Contents Target 관련)
    • The output image and target image are fed into the loss network
    • Compute L2 loss between the feature maps of output and target images
    • f(X) - > Y_hat 과정에서 f를 학습시킨다.
    • 이 과정에서 contents가 얼마나 유지되는 지를 보는 loss
    • Y_hat의 feature, X의 feature 둘 사이의 loss를 measure 함


  • Style reconstruction loss
    • Similarly, the output image and target image are fed into the loss network
    • Compute L2 loss between the gram matrices generated from the feature maps
    • Y_hat의 feature, X의 feature 둘 사이의 loss를 measure 함
    • style이 뭔지에 대한 mapping을 해야함 -> Gram matrices : feature map의 공간적 특징을 통계적으로 얻음
    • Gram matrices(?)를 통해 style을 encoding 해놓는다
    • trainformed image garm matrices가 style gram matrices를 따라가는 gradient를 계산함
    • Style reconstruction loss를 사용하지 않는 경우도 있음 (Super resolution)

 

 

 

 

3. Various GAN applications

  • Conditional GAN들의 응용 사례
  • Deepfake
    • Converting human face or voice in video into another face or voice
  • Face de-identification
    • protecting privacy by slightly modifying human face image
    • Results look similar to human but hard for computer to identify them as same person
  • Face anonymization with passcode
    • De-identifying human face with a specific passcode
    • Only the authorized with passcode is able to decrypt and get the original image

3.3 Video translation (Manipulation)

  • Pose transfer
  • Video-to-video translation
  • Video-to-game: controllable characters