Bike Sharing Demand

Bike Sharing Demand

Bike Sharing Demand

Day 1

  • Ridge : 1.04966
  • Lasso : 1.04126
  • Random Forest : 0.43310
  • Gradient Boosting : 0.42744
  • 535๋“ฑ (Top 15%)

Score : 0.42744 Rank : 535 (Top 16.49%)

Day 2

Model์„ train ์‹œํ‚ฌ ๋•Œ, ํ‰์†Œ์ฒ˜๋Ÿผ sklearn์˜ train_test_split์„ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ, ์ƒ๊ฐํ•ด๋ณด๋‹ˆ time series ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋Ÿฐ์‹์œผ๋กœ split ํ•˜๋ฉด ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค๋Š” ์ƒ๊ฐ์„ ํ•˜๊ณ  ์ฝ”๋“œ๋ฅผ ์ˆ˜์ •ํ•˜๊ธฐ๋กœ ํ•œ๋‹ค.

๋ฐฉ๋ฒ•1 - ์‹œ๊ฐ„์ˆœ์œผ๋กœ ์ •๋ ฌ ํ›„ ์•ž์—์„œ๋ถ€ํ„ฐ 80%๋งŒํผ์„ train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ, ๋‚˜๋จธ์ง€ 20%๋ฅผ validation ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์„ค์ •ํ•œ๋‹ค.

  • Ridge : 1.04723 (+0.23%)
  • Lasso : 1.05059 (-0.89%)
  • Random Forest : 0.41850 (+3.38%)
  • Gradient Boosting : 0.43600 (-2%)
  • 535๋“ฑ (Top 15%)

Score : 0.41850 Rank : 428 (Top 13.19%)

๋ฐฉ๋ฒ•2 - sklearn์˜ TimeSeriesSplit๊ณผ GridSearchCV๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ class๋กœ ์ •์˜ํ•œ๋‹ค.

class TimeSeries_GridCV_Model:
    def __init__(self, n_splits=5):
        self.n_splits = n_splits
        self.tscv = TimeSeriesSplit(n_splits=n_splits)
        
    def fit_and_evaluate(self, model, params, scorer, X, y):
        for train_index, test_index in self.tscv.split(dataTrain):
            X_train, X_test = X.iloc[train_index], X.iloc[test_index]
            y_train, y_test = y.iloc[train_index], y.iloc[test_index]
            
            grid_search = GridSearchCV(model, params, scoring = scorer, cv=5, error_score='raise')
            grid_search.fit(X_train.values, np.log1p(y_train))
            
            y_pred = grid_search.predict(X_test.values)
            
            error = rmsle(np.exp(np.log1p(y_test)), np.exp(y_pred), False)
            print(grid_search.best_params_)
            print("RMSLE Value:", error)
            
        return grid_search     
    
    def plot_parameters(self, grid_search):        
        fig, ax = plt.subplots()
        fig.set_size_inches(12, 5)
        df = pd.DataFrame(grid_search_result.cv_results_)
        df["alpha"] = df["param_alpha"]
        df["rmsle"] = -df["mean_test_score"]
        sns.pointplot(data=df, x='alpha', y='rmsle', ax=ax)
        
    def predict(self, grid_search, dataTest):
        predsTest = grid_search.predict(X = dataTest)
        return predsTest
  • Ridge : 1.04990
  • Lasso : 1.04370

  • windspeed์™€ humidity 0๊ฐ’์€ ์‹ค์ œ 0๊ฐ’์ด ์•„๋‹Œ null๊ฐ’์ด 0์œผ๋กœ ํ‘œ์‹œ๋œ ๊ฑฐ์˜€๋‹ค โ†’ RandomForestClassifier ๋ฅผ ์ด์šฉํ•ด null๊ฐ’ ์ฑ„์›€

์œ ์šฉํ–ˆ๋˜ Techinque

pd.DatetimeIndex

  • dataset์— datetime feature์—์„œ hour, day, month, year ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค.

Detect special missing value like 0

  • null ๊ฐ’์ด ์—†์–ด๋ณด์˜€์ง€๋งŒ, windspeed , humidity feature์—์„œ ๋ฐœ๊ฒฌ๋˜๋Š” 0 ๊ฐ’์€ ์‹ค์ œ ์„ธ๊ณ„์—์„œ๋Š” ์กด์žฌํ•  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ์ด๊ธฐ์— missing value์ž„์„ ์•Œ์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

Fill missing value with RandomForestClassifier

  • missing value๊ฐ€ ์žˆ๋Š” feature๋ฅผ missing value๊ฐ€ ์—†๋Š” feature๋“ค์„ ์ด์šฉํ•˜์—ฌ RandomForestClassifier๋ฅผ ํ•™์Šต์‹œ์ผœ ๊ฐ’์„ ์ฑ„์šธ ์ˆ˜ ์žˆ๋‹ค.
  • ๊ธฐ์กด ํ†ต๊ณ„๊ฐ’๋“ค๋กœ๋งŒ ์ฑ„์šฐ๋Š” ๋ณด๋‹ค ์˜๋ฏธ์žˆ๋Š” ๊ฐ’์„ ์ฑ„์šธ ์ˆ˜ ์žˆ๋‹ค.

metrics.make_scorer

  • metrics.make_scorer ์„ ์ด์šฉํ•˜์—ฌ ๋‚ด๊ฐ€ ์ง์ ‘ ์ •์˜ํ•œ ์˜ค์ฐจ ํ•จ์ˆ˜๋ฅผ GridSearCV ์˜ scorer๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

Check distribution

  • trian ๋ฐ์ดํ„ฐ์˜ label๊ฐ’๋“ค์˜ ๋ถ„ํฌ์™€, model์˜ predict ๊ฐ’๋“ค์˜ ๋ถ„ํฌ๋ฅผ ์‹œ๊ฐํ™” ํ›„ ๋น„๊ตํ•ด์„œ ํ•™์Šต ์ •๋„๋ฅผ ํŒŒ์•…ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Leave a comment