[DACON] 와인 품질(Quality) 분류 경진대회

DACON: 와인 품질(Quality) 분류 경진대회

와인 품질(Quality) 분류 경진대회 - DACON

분석시각화 대회 코드 공유 게시물은 내용 확인 후 좋아요(투표) 가능합니다.

dacon.io

데이콘에 올라와있는 실전 연습 프로젝트들 중 가장 기본적인 분류 프로젝트 중 하나인

'와인 품질 분류 프로젝트'를 진행해 보려합니다.

학습 목표

분석 목표에 따른 데이터 전처리
데이터 시각화를 통한 EDA(탐색적 자료 분석)
분석 목표에 따른 Feature Engineering
분류 모델을 통한 클래스 분류 (배깅)
모델 성능 평가
하이퍼 파라미터 튜닝
보팅(Voting)을 통한 분류 모델 앙상블(Ensemble)

데이터 정보

index: 구분자
quality 품질 (0~10)
fixed acidity 산도
volatile acidity 휘발성산
citric acid 구연산
residual sugar 잔당 : 발효 후 와인 속에 남아있는 당분
chlorides 염화물
free sulfur dioxide 독립 이산화황
total sulfur dioxide 총 이산화황
density 밀도
pH 수소이온농도
sulphates 황산염
alcohol 도수

학습 목표와 데이터 정보는 위와 같습니다.

데이콘에서 주어진 데이터셋들을 이용하여 와인의 'Quality'을 타겟으로 예측을 진행해볼 것입니다.

사용한 언어는 Python이며, Google Colab을 이용하였습니다.

Install & Improt Packages

☑️ 편의를 위해 사용할 패키지들을 한꺼번에 install & loading한 후 진행합니다.

!pip install pandas-profiling==3.3.0
!pip install xgboost

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import scipy.stats as stats

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

Data

☑️ 데이터를 대략적으로 살펴봅니다.

train = pd.read_csv(mount_dir + "/train.csv")
test = pd.read_csv(mount_dir + "/test.csv")
submission = pd.read_csv(mount_dir + "/sample_submission.csv")

🚫 train으로만 학습 및 검증을 진행해야하며, test & submission은 최종 평가에만 사용해야하니 주의합니다.

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5497 entries, 0 to 5496
Data columns (total 14 columns):
#   Column                Non-Null Count  Dtype
---  ------                --------------  -----
0   index                 5497 non-null   int64
1   quality               5497 non-null   int64
2   fixed acidity         5497 non-null   float64
3   volatile acidity      5497 non-null   float64
4   citric acid           5497 non-null   float64
5   residual sugar        5497 non-null   float64
6   chlorides             5497 non-null   float64
7   free sulfur dioxide   5497 non-null   float64
8   total sulfur dioxide  5497 non-null   float64
9   density               5497 non-null   float64
10  pH                    5497 non-null   float64
11  sulphates             5497 non-null   float64
12  alcohol               5497 non-null   float64
13  type                  5497 non-null   object
dtypes: float64(11), int64(2), object(1)
memory usage: 601.4+ KB

train.head()

train.nunique()

quality                   7
fixed acidity           106
volatile acidity        179
citric acid              89
residual sugar          309
chlorides               205
free sulfur dioxide     127
total sulfur dioxide    271
density                 970
pH                      107
sulphates               106
alcohol                 103
type                      2
dtype: int64

➡️ unique한 값이 각각 7개, 2개인 'quality', 'type' 변수를 범주형 변수로 판단합니다.

➡️ 따라서 독립변수에는 총 11개의 연속형 변수, 1개의 범주형 변수가 존재하며, 종속변수는 1개의 범주형 변수로서 존재합니다.

train['quality'].value_counts()

6    2416
5    1788
7     924
4     186
8     152
3      26
9       5
Name: quality, dtype: int64

➡️ 우리가 분석 및 예측할 종속변수 quality의 값 분포는 위와 같습니다.

EDA

🚫 종속변수 quality는 클래스 불균형이 심한 편입니다.

plt.figure(figsize = (6, 5))
plt.suptitle("Target: Countplot", fontsize = 20)
plt.style.use('ggplot')

sns.countplot(x = target)

plt.tight_layout
plt.show()

🚫 대부분의 독립변수들이 많은 이상치를 가지고 있으며, 정규분포를 따르지 않습니다. 이에 따라 스케일 조정이 추후 필요할 것으로 보입니다.

plt.figure(figsize = (20, 15))
plt.suptitle("Boxplots", fontsize = 40)
plt.style.use('ggplot')

for i, col in enumerate(numdata):
    plt.subplot(3,4,i+1)
    plt.title(col)
    plt.boxplot(numdata[col])

plt.tight_layout
plt.show()

plt.figure(figsize = (20, 15))
plt.suptitle('Distplots', fontsize = 40)
plt.style.use('ggplot')

for i, col in enumerate(numdata):
    ax = plt.subplot(3,4,i+1)
    sns.distplot(numdata[col], ax = ax)

plt.show()

🚫 feature 간 상관관계가 일부 존재하는 것으로 보입니다.

corr = train.corr()

plt.figure(figsize = (8,8))
sns.heatmap(corr, annot = True, fmt = '.2f', cmap = 'Blues' )

free sulfur dioxide & total sulfur dioxide 간 강한 양의 상관관계
density & alcohol 간 강한 음의 상관관계

Feature Engineering

☑️ 우선 범주형 독립변수 type을 인코딩해줍니다.

le = LabelEncoder()
train['type'] = le.fit_transform(train['type'])
train['type'].value_counts()

1 4159
0 1338
Name: type, dtype: int64

☑️ 앞서 말했던 것처럼, 대부분의 독립변수들이 많은 이상치를 가지고 있으므로 Log Scaling, Standard Scaling, Remove Outliers 과정을 진행합니다.

def logarithm_scaler(df):
    epsilon = 1e-10
    # log 변환은 0 이상의 값에서만 가능하므로 min 값이 0인 col은 루트 변환 적용!
    for col in df.columns:
        if df[col].min() == 0:
            df[col] = np.sqrt(df[col])
        else:
            df[col] = np.log(df[col])
    return df

# log 변환 필요없는 feature : quality, total sulfur dioxide, type 제외
transform_data = train.drop(columns = ['quality', 'total sulfur dioxide', 'type'])

train[transform_data.columns] = logarithm_scaler(transform_data)
train.head()

transform_data = train.drop(columns = ['quality', 'type'])

standard_scaler = StandardScaler()
std_data = standard_scaler.fit_transform(transform_data)

train[transform_data.columns] = std_data
train.head()

before_remove = len(train)
transform_data = train.drop(columns = ['quality', 'type'])
train[transform_data.columns] = std_data

# train에서 표준편차 3이상의 데이터는 na값으로 변환, 그렇지 않으면 냅 둠
# 그 후 na값을 제거함
train[transform_data.columns] = np.where(abs(std_data) > 3, np.nan, std_data)
train = train.dropna()
after_remove = len(train)

print("remove outliers : {0}개".format(before_remove - after_remove))

remove outliers : 367개

☑️ 스케일링, 이상치 제거를 거친 변수들의 EDA를 다시 한 번 살펴보겠습니다.

➡️ 앞선 시각화 결과와 비교해 확실히 분포가 정규분포에 근사해 안정적이고, 눈에 띄는 이상치 값들도 줄어들었습니다.

☑️ 종속변수 'quality'에 클래스 불균형이 존재하였으므로, 해당 사항을 조정해주기 위해 간단한 Data Augmentation 과정을 거칩니다.
(quality = 3, 9 데이터가 현저히 적은 data imbalance 존재)

# 수가 적은 class의 data를 단순히 2배로 늘려주는 방법 (2번 진행)
before_aug = len(train)
train = train.append(train[train['quality'].isin([3,9])])
train = train.append(train[train['quality'].isin([3,9])])
print('Augmented : +', len(train) - before_aug)

☑️ feature engineering 과정을 마무리짓고, 훈련 데이터를 0.25/0.75 로 임의로 분리하여, 추후 학습률 평가에 이용하도록 합니다.

X = train.drop(['quality'], axis = 1)
y = train['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

Model

☑️ 사용한 알고리즘은 Logistic Regression, K-Nearest Neighbor, Decision Tree, Random Forest, Gradient Boosting Model 입니다.

위 5개의 알고리즘은 분류 알고리즘 중에서도 가장 기초적인 5개입니다.

Logistic Regession

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)

lr_acc = accuracy_score(y_test, lr_pred)
print("Logistic Regression Accuracy: {:0.4f}".format(lr_acc))

Logistic Regression Accuracy: 0.5385

K-Nearest Neighbor

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)

knn_acc = accuracy_score(y_test, knn_pred)
print("K-Nearest Neighbor Accuracy: {:0.4f}".format(knn_acc))

K-Nearest Neighbor Accuracy: 0.5639

Decision Tree

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)

dt_acc = accuracy_score(y_test, dt_pred)
print("Decision Tree Accuracy: {:0.4f}".format(dt_acc))

Decision Tree Accuracy: 0.5794

Random Forest

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

rf_acc = accuracy_score(y_test, rf_pred)
print("Random Forest Accuracy: {:0.4f}".format(rf_acc))

Random Forest Accuracy: 0.7011

Gradient Boosting Model

gbm = GradientBoostingClassifier()
gbm.fit(X_train, y_train)
gbm_pred = gbm.predict(X_test)

gbm_acc = accuracy_score(y_test, gbm_pred)
print("Gradient Boosting Model Accuracy: {:0.4f}".format(gbm_acc))

Gradient Boosting Model Accuracy: 0.5909

➡️ 가장 좋은 성능을 낸 모델은 Random Forest 모델입니다. 해당 모델 하이퍼 파라미터 조정을 시작해봅시다.

Results

☑️ Grid Search 기법을 이용해 최적의 하이퍼 파라미터를 탐색한 후, 최종 모델을 산출합니다.

rf_clf = RandomForestClassifier(random_state = 111)

params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [1, 8, 24],
    'min_samples_leaf' : [1, 6, 12],
    'min_samples_split' : [2, 8, 16]
}

grid_cv = GridSearchCV(rf_clf, param_grid = params, cv = 5, n_jobs = -1)
grid_cv.fit(X_train, y_train)

print('최적 하이퍼 파라미터:\n', grid_cv.best_params_)
print('최고 예측 정확도: {0:.4f}'.format(grid_cv.best_score_)) # validation dataset!

최적 하이퍼 파라미터:
{'max_depth': 24, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
최고 예측 정확도: 0.6480

rf_clf = RandomForestClassifier(n_estimators = 200, max_depth = 24,  min_samples_leaf = 1,
                                 min_samples_split = 2, random_state = 0, n_jobs = -1)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)
print('예측 정확도: {0:.4f}'.format(accuracy_score(y_test , rf_pred))) # test dataset!

예측 정확도: 0.6903

Conclusion

☑️ 결과 산출을 위해, test set에서도 마찬가지로 스케일링, 인코딩 과정을 진행합니다.

Test Set Preprocessing

test_X = test.drop(columns = ['index'])

# Log Scaling
transform_data = test_X.drop(columns = ['total sulfur dioxide', 'type'])
test_X[transform_data.columns] = logarithm_scaler(transform_data)

# Standard Scaling
transform_data = test_X.drop(columns = ['type'])
std_data = standard_scaler.fit_transform(transform_data)
test_X[transform_data.columns] = std_data

# Label Encoding
test_X['type'] = le.fit_transform(test_X['type'])

Test Predict & Submission

test_pred = rf_clf.predict(test_X)
submission['quality'] = test_pred

마무리하며...

최종 결과는 아래와 같습니다.

위 코드는 데이콘 Basic 와인 품질 분류 경진대회 코드 공유에 게재되어있는 Grateful님의 코드를 대거 참조해 작성하였습니다.

Ensemble and Data Preprocessing(Scaling..)

첫 결과 제출이라 결과 및 등수에 연연하진 않았고, 대략적인 흐름만은 익혀보고자 진행했습니다.

앞으로 더 중점을 두고 공부할 분야는 아래와 같습니다.

하이퍼 파라미터 튜닝
스케일링 및 이상치 제거

이상으로 글을 마치겠습니다.

감사합니다.