Titanic - Machine Learning from Disaster

Titanic - Machine Learning from Disaster

  1. [Baseline] Logistic Regression with Only Simple Data PreProcessing
    • Complete Missing Value
    • Drop Feature
    • Convert a categorical feature
    • Model : Logistic Regression

Score : 0.76555

Rank : 12142 (Top 78%)

  1. Logistic Regression with Feature Engineering

Score : 0.77272

Rank : 10045 (Top 64%)

  1. Various Model with Deeper Feature Engineering

Support Vector Machine

Score : 0.76794

K - Nearest Neighbors

Score : 0.76794

Decision Tree

Score : 0.76076

Random Tree

Score : 0.76076

  1. Ensemble(Voting) RandomForest, ExtraTrees, SVC, AdaBoost, GradientBoosting with GridSearchCV with Deeper Feature Engineering

Score : 0.78229

Rank : 2707 (Top 16%)

์œ ์šฉํ–ˆ๋˜ Techinque

Analyze by pivoting features

  • ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • feature๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ด€๊ณ„๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์œ ์šฉํ–ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ๋Š” ๋ฐ์ดํ„ฐ์˜ ๊ฐ feature(Pclass, Sex, SibSp, Parch)์™€ target(Survived)๊ณผ์˜ ์ƒ๊ด€์„ฑ(์ƒ์กด ์—ฌ๋ถ€์— ์˜ํ–ฅ์„ ์ฃผ๋Š” feature์ธ์ง€)์„ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ–ˆ๋‹ค.
  • ์ถ”๊ฐ€๋กœ, ์ƒˆ๋กœ ๋งŒ๋“  Has_Cabinfeature๊ฐ€ ๋„์›€์ด ๋ ๋งŒํ•œ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ–ˆ๋‹ค.
train[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

FacetGrid

  • Grid ํ˜•ํƒœ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ subplot์„ ์ƒ์„ฑํ•˜๊ณ  ๊ฐ๊ฐ์˜ subplot์— ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค˜, ๋ฐ์ดํ„ฐ์…‹์˜ ์—ฌ๋Ÿฌ ์ธก๋ฉด์„ ์‚ดํŽด๋ณด๊ธฐ์— ์œ ์šฉํ–ˆ๋‹ค.
  • ์—ฌ๊ธฐ์„œ๋Š” Age feature์˜ null๊ฐ’์„ ์—ฐ๊ด€์žˆ๋Š” feature๋ฅผ ์ด์šฉํ•ด ์ถ”์ •ํ•œ ๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์œ„ํ•ด, Pclass์™€ Survived ์— ๋”ฐ๋ฅธ Age์˜ ๋ถ„ํฌ๋ฅผ ํŒŒ์•…ํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ–ˆ๋‹ค.
grid = sns.FacetGrid(train, col="Survived", row="Pclass", height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)
grid.add_legend()

Null๊ฐ’ ์ถ”์ •

  • Age feature์˜ null๊ฐ’์„ ์ฑ„์šฐ๊ธฐ ์œ„ํ•ด, ๋‹จ์ˆœ ๋Œ€์น˜๊ฐ€ ์•„๋‹Œ Pclass์™€ Survived์— ๋”ฐ๋ฅธ Age ๋ถ„ํฌ๋ฅผ ์ด์šฉํ•˜์—ฌ, Age feature์˜ null๊ฐ’์„ Pclass์™€ Survived ์˜ ๊ฐ’์— ๋”ฐ๋ฅธ ๋ถ„ํฌ์˜ median๊ฐ’์œผ๋กœ ๋Œ€์ฒดํ•˜์˜€๋‹ค.

Numerical Feature์˜ Band feature ์ƒ์„ฑ

  • Age ์™€ Fare ์™€ ๊ฐ™์€ numerical feature๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๊ฐ ์˜์—ญ๋Œ€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๊ธฐ์กด feature ๋Œ€์‹  Age Band , Fare Band feature๋ฅผ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•˜์˜€๋‹ค.

๊ธฐ์กด feature๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด feature ์ƒ์„ฑ

  • Parch feature์™€ SibSp feautre๋ฅผ ์ด์šฉํ•˜์—ฌ FamilySize feature๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋กœ๋ถ€ํ„ฐ IsAlone feature๋ฅผ ์ƒˆ๋กญ๊ฒŒ ๋„์ถœํ•˜์—ฌ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

์ •๋ณด ์ถ”์ถœ

  • ํ•™์Šต์— ๋„์›€์ด ๋˜์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์•„ ๋ณดํ†ต drop ํ•˜๋Š” Name feature์—์„œ Title(ํ˜ธ์นญ ex) Mr, Mrs, Dr)์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜์—ฌ ์ƒˆ๋กœ์šด feature๋กœ ๋งŒ๋“ค์–ด ์‚ฌ์šฉํ•˜์˜€๋‹ค.

pd.crosstab

  • ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์‚ฌ์šฉํ•˜์—ฌ crosstabulation table์„ ๋งŒ๋“ค์–ด, ๊ฐ feature์˜ ์กฐํ•ฉ์— ๋Œ€ํ•œ ๋นˆ๋„์ˆ˜๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค
  • ์—ฌ๊ธฐ์„œ๋Š” Title feature์˜ ๊ฐ value์— ๋Œ€ํ•œ Survived ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
print(pd.crosstab(train['Title'],train['Survived']))

SklearnHelper class ์ •์˜

  • ๊ฐ ๋ชจ๋ธ๋“ค์„ ํ•™์Šตํ•  ๋•Œ ๊ณตํ†ต์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ถ€๋ถ„์„ ํ•˜๋‚˜์˜ class๋กœ ์ •์˜ํ•˜์—ฌ ํšจ์œจ์ ์œผ๋กœ ์ฝ”๋“œ๋ฅผ ์งค ์ˆ˜ ์žˆ์—ˆ๋‹ค.
class SklearnHelper(object):
    def __init__(self, clf, seed = 0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)
        
    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)
        
    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self, x, y):
        return self.clf.fit(x,y)
    
    def feature_importances(self, x, y):
        return self.clf.fit(x,y).feature_importances_

Plotly scatterplots

  • Plotly package๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ๋ชจ๋ธ์˜ feature importances ๊ฐ’๋“ค์„ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Plot learning curves

  • plot_learning_curve ํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ learning curve๋ฅผ ์‹œ๊ฐํ™”ํ•˜์—ฌ overfitting, underfitting์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Reference

Titanic Data Science Solutions

Leave a comment