[Data viz(3)] Seaborn

*Seaborn

Seaborn은 Matplotlib 기반 통계 시각화 라이브러리
- 통계 정보 : 구성, 분포, 관계 등
- Matplotlib 기반이라 Matplotlib으로 커스텀 가능
- 쉬운 문법과 깔끔한 디자인이 특징
디테일한 커스텀보다는
- 새로운 방법론을 위주로
- 앞서 공부한 이론과의 연결
pip install seaborn==0.11
import seaborn as sns
다양한 API
- 시각화 목적과 방법에 따라 API 분류
- Categorical API
- Distribution API
- Relational API
- Regression API
- Multiples API
- Theme API

*Seaborn 기초

countplot - seaborn의 categorical API의 대표적인 시각화로 이산적으로 막대 그래프를 그려줌

ex) sns.countplot(x='race/ethnicity', data = student)

order로 순서 지정 가능

sns.countplot(x='race/ethnicity', data = student, order=sorted(student['race/ethnicity'].unique()))

hue는 데이터를 색으로 구분하여 plot

sns.countplot(x='race/ethnicity',data=student,
              hue='gender', 
              order=sorted(student['race/ethnicity'].unique())
             )

색은 palette를 변경하여 바꿀 수 있음

sns.countplot(x='race/ethnicity',data=student,
              hue='gender', palette='Set2'
             )

hue로 지정된 그룹을 gradient 색상을 전달할 수 있음.

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', color='red'
             )

matplotlib와 함께 사용하기 적합하게 ax를 지정하여 seaborn plot을 그릴 수 있음

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.countplot(x='race/ethnicity',data=student,
              hue='gender', 
              ax=axes[0]
             )

sns.countplot(x='gender',data=student,
              hue='race/ethnicity', 
              hue_order=sorted(student['race/ethnicity'].unique()), 
              ax=axes[1]
             )

plt.show()

*Categorical API

데이터의 통계량
- count - missing value
- mean
- std
- 사분위수 : 데이터를 4등분한 관측값
  - min
  - 25% (lower quartile)
  - 50% (median)
  - 75% (upper quartile)
  - max
Box plot - 분포를 살피는 대표적인 시각화 방법 , 중간의 사각형은 25%, medium, 50% 값을 의미
- interquartile range(IQR) : 25th to the 75th percentile.
- whisker : 박스 외부의 범위를 나타내는 선
- outlier : -IQR1.5 , +IQR1.5을 벗어나는 값
- min : -IQR*1.5 보다 크거나 같은 값들 중 최솟값
- max : +IQR*1.5 보다 작거나 같은 값들 중 최댓값
- 다음 요소를 사용하여 시각화를 커스텀 할 수 있음.
  - width
  - linewidth
  - fliersize

Violin Plot - box plot은 대푯값을 잘 보여주지만 실제 분포를 표현하기에 부족, 흰점이 50%를 중간 검정 막대가 IQR 범위를 의미.
- violin plot은 오해가 생길 수 있는 표현 방식
  - 데이터는 연속적이지 않음
  - 연속적 표현에서 생기는 데이터의 손실과 오차가 존재
  - 데이터의 범위가 없는 데이터까지 표시
- 이런 오해를 줄이는 방법
  - bw : 분포 표현을 얼마나 자세하게 보여줄 것인가 (ex. scott , silverman, float)
  - cut : 끝부분을 얼마나 자를 것인가 (ex. float)
  - inner : 내부를 어떻게 표현할 것인가 (ex. bow, quartile , point , stick, None)
  - scale : 각 바이올린의 종류 (ex. area , count , width)
  - split : 동시에 비교
ETC
- boxen plot
- swarmplot
- stripplot

*Distribution

범주형/연속형을 모두 살펴볼 수 있는 분포 시각화
Univariate Distribution
- histplot : 히스토그램
- kdeplot : Kernel Density Estimate
- ecdfplot : 누적 밀도 함수
- rugplot : 선을 사용한 밀도 함수
Bivariate Distribution - 2개 이상 변수를 동시에 분포를 살펴볼 수 있음.

*Relation & Regression

scatter plot
Line plot
Regplot - 회귀선을 추가한 scatter plot

*Matrix plots

Heatmap - 상관관계 시각화에 많이 사용됨

*Seaborn Advanced

여러 차트를 사용하여 정보량을 높이는 방법

이전에는 ax에 하나를 그리는 방법이었고, 지금은 Figure0level로 전체적인 시각화를 그리는 API

*Joint Plot - 2개의 피처의 결합확률 분포와 함께 각각의 분포도 살필 수 있는 시각화

hist, scatter , hex, kde, reg ,resid 옵션을 활용하여 시각화 할 수 있음

*Pair Plot - 데이터셋의 pair-wise 관계를 시각화하는 함수

hue를 Species를 기준으로 plot 할 수 있음 - 클러스터나 값들을 더 잘 확인할 수 있음
2가지 변수를 사용하여 시각화 방법을 조정할 수 있음
- kind : {'scatter' , 'kde', 'hist' , 'reg'}
- diag_kind : {'auto' , 'hist', 'kde' , None}
기본적으로 pairwise는 모양이 대각선 기준으로 대칭임. 이때 corner=True 활용하면 상삼각행렬은 보이지 않음

*Facet Grid

pairplot과 같이 다중 패널을 사용하는 시각화
pairplot은 feature-feature 사이를 살폈다면, Facet Grid는 feature`s category-feature`s category의 관계도 살핌
4개의 큰 함수
- catplot : Categorical
- displot : Distribution
- relplot : Relational
- lmplot : Regression