[Week7] Multi-modal Learning [Day4]

1. Overview of multi-modal learning

Multi-modal Learning : 다른 특성을 갖는 데이터 타입들을 같이 활용하는 학습법(ex - text, sound)

Challenge (1) - Different representations between modalities
- Audio - 1D Signal
- Image - 2D Array
- Text - Embedding vector
Challenge (2) - Unbalance between heterogeneous feature spaces
Challenge (3) - May a model be biased on a specific modality
- 명백한 데이터에 편향되고, 까다로운 데이터는 안써버리는 현상 발생
다양한 Challenge들이 있지만 But... 다양한 센서로부터 오는 데이터들을 같이 활용하는것은 중요한 문제
- Matching : 서로 다른 타입의 데이터를 공통된 space로 보내서 매칭
- Translating : 다른 modality로 translation
- Referencing : 다른 modality를 참조함으로써 상호작용함

2. Multi-modal tasks (1) - Visual data & Text

2.1 Text embedding

Characters are hard to use in machine learning
Map to dense vectors
Surprisingly, generalization power is obtained by learning dense representation

*Embedding Skill

word2vec - Skip-gram model
- Trained to learn W and W'
- Rows in W represent word embedding vectors
- Skip-gram model
  - Input layer의 node들이 각각의 word를 의미
  - one-hot vector에 W가 곱해지면서 특정 row를 embedding하는 구조
  - W'연산 후 output layer에서는 선택된 단어 전후로 어떤 단어가 와야하는지 패턴을 학습
  - Learning to predict neighboring words for understanding relationships between words

2.2 Joint embedding (Matching)

*Joint embedding? - Matching을 하기 위한 공통된 embedding vector들을 학습하는것

Image tagging
- Can generate tags of a given image, and retrieve images by a tag keyword as well
- Combining pre-trained unimodal models
  - Text data, Image data 각각을 pre-trained model을 통해 feature vector를 생성
  - 동일한 dimension의 feature vector로 만들어줌
  - 이 후 Joint embedding을 통해 각 feature vector사이의 관계를 학습
    Metric learning in visual-semantic space
  - Metric learning (push & pull 반복) : matching된 pair는 distance가 짧아지도록 학습, non-matching pair는 distance가 길어지도록 패널티 부여
- Interesting property
Image & food recipe retrieval
- Recipe text (sentence) vs. food image
  - recipe는 text이지만 순서가 존재하기 때문에 RNN계열 network를 활용하여 fixed feature vector 생성
  - cosine similarity loss : recipe <-> image 연관성을 봄
  - semantic regularization loss : cosine similarity loss로 해결되지 않는것들을 가이드

2.3 Cross modal translation (translating)

Image captioning : image에 대한 설명을 text로 출력
Captioning as image-to-sentence - CNN for image & RNN for sentence
How combine?
- Show and tell
  - Encoder : CNN model pre-trained on ImageNet
  - Decoder : LSTM module
  - CNN에서 feature vector를 LSTM의 condition으로 제공
  - LSTM에 시작토큰을 넣어 첫번째 단어 생성하고, 다음 step의 input으로 넣음
  - 종료토큰이 나올때까지 반복
- Show, attend, and tell
  - CNN을 통해 만든 feature map에서 text에 필요한 국부적인 feature에만 attention할 수 있도록 제안
    Show, attend, and tell Architecture
  - Show, attend, and tell - Soft Attention Embedding
    - attention의 기원 : 특정 지역의 feature를 반복적으로 순회
    - soft attention embedding : sequential feature map, heatmap의 weighted sum으로 z 생성
  - Show, attend, and tell - Inference
    - step1 : 공간 정보를 담은 Feature를 최초에 LSTM condition으로 넣음 (Features -> h0)
    - step2 : 어떤 부분이 중요한지, 어디를 attention할지 출력 (h0 -> s1)
    - step3 : Feature, s1을 soft attention embedding하고, start word token과 같이 다음 step으로 넣음 (z1, y1 -> h1)
    - step4 : 어디를 attention할지 s2를 출력하고, 시작 단어 d1을 출력
    - step5 : step3 ~ step4 반복

Image-to-Text task의 반대는 어떻게 수행할까?
- Text-to-image by generative model
  - N개의 이미지로부터 1개의 대표적인 텍스트를 표현할 수 있음
  - but, 1개의 텍스트로부터 N개의 이미지를 표현하기 위해선 generative model을 활용해야함
  - Architecture
    - Generator
    - step1 : Text전체를 Fixed dimensional vector로 만들어주는 network를 통해 vector화 함
    - step2 : Gaussian random code(똑같은 output 방지)를 feature vector와 합쳐 input으로 넣음 (cGAN)
    - Discriminator
    - step1 : Generation된 image가 input으로 들어오면 spatial한 feature를 뽑고, Text 정보를 합쳐 True or False를 판단함으로써 학습

2.4 Cross modal reasoning (Referencing)

*Visual quesion answering

영상이 주어지고, 질문이 주어지면 답을 도출하는 형태의 task
Question stream : text의 sequence로 RNN으로 encoding하여 fixed dimensional vector 생성
Image stream : pre-trained neural network를 사용하여 fixed dimensional vector 생성
위 두 vector를 point-wise multiplication하여 두개의 embedding vector가 interaction을 함. 일종의 joint embedding으로 볼 수 있음

3. Multi-modal tasks (2) - Visual data & Audio

3.1 Sound representation

ML이나 DL에서는 Acoustic feature(Spectogram)로 변환하여 사용

Fourier transform : time domaion to frequency domain -> 시간축에 대한 표현이 불가 따라서 STFT활용
Short-time Fourier transform (STFT) : Fourier transform on windowed waveform results in frequency-magnitude graph
- A : window size
- B : offset
FT decomposese an input signal into constituent frequencies
- 주파수 성분을 잘 표현할 수 있도록 변환
Spectogram : A stack of spectrums along the time axis
- x : time , y : frequency
- 시간에 따른 주파수의 변화량을 볼 수 있음

3.2 Joint embedding (Matching)

Scene recognition by sound -> sound tagging
SoundNet -> 오디오의 표현을 어떻게 할 것인지에 대한 방법론 제시
- Learn audio representation from synchronized RGB frames in the same videos
- Train by the teacher-student manner
  - Transfer visual knowledge from pre-trained visual recognition models into sound modality
  - pre-trained 모델을 통해 object가 어떤것이 들어있는지에 대한 distribution과 현재 video가 어떤 장면에서 촬영되고 있는지 scene distribution을 출력
  - audio는 raw waveform형태로 CNN input에 넣음
  - 마지막단에는 two heads로 나누어 각 distribution과 매칭
  - spectrogram을 사용하지 않고, waveform을 활용하였음
- 어떠한 target task가 존재한다면 generalizable semantic 정보가 많은 pool5 feature를 대표적으로 활용할것. 마지막 단은 object/ scene distribution에 optimize 되어있기 때문

3.3 Cross modal translation (Translating)

Speech2Face : 음성을 듣고 그 사람의 얼굴을 상상하는 모델
Module networks
- VGG-Face Model
- Face Decoder 미리 학습
Training
- Training by feature matching loss (self-supervised manner) for making features compatible
- video는 이미 image와 audio가 pair 되어 있으므로 annotation이 필요가 없음
Image-to-speech synthesis
- Module networks
  - Image-to-Unit Model
  - Unit-to-Speech Model
  - 두 Model의 호환성을 맞춰야함

3.4 Cross modal reasoning (Referencing)

Sound source localization : people sound와 image가 주어졌을 때, sound가 어느 위치에 존재하는지 heatmap 형태로 localziation함
- Visual net에서 fixed dimensional vector를 활용하지않고 spatial feature를 attention net으로 넘겨줌
- 공간정보를 이용하여 visual feature의 각 위치와 sound feature를 내적함으로써 관계성을 학습
- inner product값이 localization score로 나타남
- Fully supervised version : GT가 존재한다면 localization score와 loss를 걸어줘서 supervised learning을 함
- Unsupervised version : video는 pair되어있기 때문에 annotation 없이 이를 활용함
  - Visual net에서 뽑은 spatial feature를 localization score와 element-wise곱을 통해 attended visual feature를 만듦
  - Attended visual feature와 Sound feature 와 Unsupervised metric learn을 할 수 있게 됨
  - Attended visual feature와 sound feature를 비교하는 이유?
    - sound feature는 sound 전체의 context를 담고 있고, attended visual feature에서 가지고 있는 visual feature들이 sound feature의 정보를 닮은 방향으로 학습을 할 수 있음
  - semi-supervised learning도 수행할 수 있음

Speech separation : visual 정보를 활용하여 speech separation을 수행
- Visual stream : N개의 face가 존재하면 각각 face embedding을 하여 feature를 뽑음
- Audio stream : Spectrogram으로 sound feature를 뽑음
- Face feature와 Sound feature concatenation 수행
- Spectrogram을 어떻게 분류해야 하는지 Complex mask형태로 출력
- original Spectrogram을 곱해주고 waveform으로 변환하여 separation
- 합성은 L2 loss 활용하여 clean spectrogram과 enhanced spectrogram 사이의 loss를 구함 (GT 존재해야함)
- But, 이미 합성된 영상은 GT를 추출하기 제한적..
- 따라서 두 개의 clean speech video에서 GT를 가진채로 두 영상을 동시에 재생하여 training data를 만듦
Applications : Lip movements generation - Synthesizing Obama example
- 음성으로부터 입의 움직임을 generation하고, 얼굴을 다시 참조하여 적절한 animation을 만듦

저작자표시

'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글

[Week7] 3D Understanding [Day5] (0)	2021.09.17
[Week7] Conditional generative model [Day3] (0)	2021.09.15
[Week7] Instance/Panoptic Segmentation and Landmark Localization [Day2] (0)	2021.09.14

백chef

[Week7] Multi-modal Learning [Day4]

'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글

티스토리툴바

[Week7] Multi-modal Learning [Day4]

'부스트캠프 AI Tech > [Week7] Computer Vision' 카테고리의 다른 글

'부스트캠프 AI Tech/[Week7] Computer Vision' Related Articles

티스토리툴바