원핫인코딩 = 범주형 데이터를 모형이 인식할 수 있도로 숫자형 변환
onehot_sex = pd.get_dummies(ndf['sex'])
ndf = pd.concat([ndf, onehot_sex], axis = 1)
onehot_embarked = pd.get_dummies(ndf['embarked'], prefix = 'town')
ndf = pd.concat([ndf, onehot_embarked], axis = 1)
ndf.drop(['sex', 'embarked'], axis=1, inplace=True)
print(ndf.head())
survived pclass age sibsp parch female male town_C town_Q town_S
0 0 3 22.0 1 0 0 1 0 0 1
1 1 1 38.0 1 0 1 0 1 0 0
2 1 3 26.0 0 0 1 0 0 0 1
3 1 1 35.0 1 0 1 0 0 0 1
4 0 3 35.0 0 0 0 1 0 0 1
데이터셋 구분 - 훈련용/검증용
X=ndf[['pclass', 'age','sibsp','parch', 'female', 'male', 'town_C', 'town_Q', 'town_S']]
y=ndf['survived']
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)
#train data와 test data로 구분(30%)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=10)
print('train data 개수', X_train.shape)
print('test data 개수', X_test.shape)
train data 개수 (499, 9)
test data 개수 (215, 9)
KNN 분류모형 가져오기
from sklearn.neighbors import KNeighborsClassifier
knn= KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_hat = knn.predict(X_test)
print(y_hat[0:10])
print(y_test.values[0:10])
[0 0 1 0 0 1 1 1 0 0]
[0 0 1 0 0 1 1 1 0 0]
from sklearn import metrics
knn_matrix = metrics.confusion_matrix(y_test, y_hat)
print(knn_matrix)
[[109 16]
[ 25 65]]
knn_report = metrics.classification_report(y_test, y_hat)
print(knn_report)
precision recall f1-score support
0 0.81 0.87 0.84 125
1 0.80 0.72 0.76 90
accuracy 0.81 215
macro avg 0.81 0.80 0.80 215
weighted avg 0.81 0.81 0.81 215