스마트 인재개발원/머신러닝 & 딥러닝
(스마트인재개발원) 머신러닝 KNN 실습
앨런튜링_
2021. 6. 4. 19:21
목표
- iris(붓꽃)데이터를 활용
- 꽃잎 길이, 꽃잎 너비, 꽃받침 길이, 꽃받침 너비
- KNN모델의 이웃의 숫자를 조절해보자(하이퍼파라미터 튜닝)
2. 데이터 수집
- sklearn에서 제공하는 붓꽃 데이터 사용
(1) kNN 모델 정의 및 개념
k-Nearest Neighbor 모델의 약자입니다.
미리 학습을 하지 않고, 새로운 데이터의 Task 요청이 올때 그 때 분류를 수행하는 절차로 이루어지며, 기본 개념은 새로운 데이터가 어느 그룹에 속하는지를 분류하기 위해 가장 가까이 있는 학습데이터의 그룹을 알아보는 것
k=1일 때를 살펴보면 테스트 데이터에서 가장 가까운 학습 데이터는 동그라미 클래스에 속함을 알 수 있다. 따라서 테스트 데이터의 클래스를 동그라미 클래스로 정해준다. k=3일 때를 보면, 3개의 가장 가까운 이웃 중 과반수가 넘는 2개가 세모 클래스임을 알 수 있다. 따라서 이때는 테스트 데이터의 클래스를 세모 클래스로 정해준다. k=9일 때는 9개의 가장 가까운 이웃 중 5개가 세모 클래스에 속함을 확인할 수 있다. 따라서 테스트 데이터의 클래스는 세모
1. 유클리드 거리 (Euclidean Distance)
일반적으로 점과 점 사이의 거리를 구하는 방법입니다.
2. 맨해튼 거리 (Manhattan Distance)
점과 점사이의 직선거리가 아니라 X축, Y축을 따라 간 거리를 의미
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
데이터로딩
iris_data = load_iris()
iris_data
{'data': array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]]),
'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
'frame': None,
'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n :Number of Instances: 150 (50 in each of three classes)\n :Number of Attributes: 4 numeric, predictive attributes and the class\n :Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm\n - class:\n - Iris-Setosa\n - Iris-Versicolour\n - Iris-Virginica\n \n :Summary Statistics:\n\n ============== ==== ==== ======= ===== ====================\n Min Max Mean SD Class Correlation\n ============== ==== ==== ======= ===== ====================\n sepal length: 4.3 7.9 5.84 0.83 0.7826\n sepal width: 2.0 4.4 3.05 0.43 -0.4194\n petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)\n petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)\n ============== ==== ==== ======= ===== ====================\n\n :Missing Attribute Values: None\n :Class Distribution: 33.3% for each of 3 classes.\n :Creator: R.A. Fisher\n :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher\'s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n Mathematical Statistics" (John Wiley, NY, 1950).\n - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\n - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n Structure and Classification Rule for Recognition in Partially Exposed\n Environments". IEEE Transactions on Pattern Analysis and Machine\n Intelligence, Vol. PAMI-2, No. 1, 67-71.\n - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions\n on Information Theory, May 1972, 431-433.\n - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II\n conceptual clustering system finds 3 classes in the data.\n - Many, many more ...',
'feature_names': ['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)'],
'filename': 'C:\\Users\\SM2130\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'}
iris_data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
iris_data['data']
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2],
[5.7, 4.4, 1.5, 0.4],
[5.4, 3.9, 1.3, 0.4],
[5.1, 3.5, 1.4, 0.3],
[5.7, 3.8, 1.7, 0.3],
[5.1, 3.8, 1.5, 0.3],
[5.4, 3.4, 1.7, 0.2],
[5.1, 3.7, 1.5, 0.4],
[4.6, 3.6, 1. , 0.2],
[5.1, 3.3, 1.7, 0.5],
[4.8, 3.4, 1.9, 0.2],
[5. , 3. , 1.6, 0.2],
[5. , 3.4, 1.6, 0.4],
[5.2, 3.5, 1.5, 0.2],
[5.2, 3.4, 1.4, 0.2],
[4.7, 3.2, 1.6, 0.2],
[4.8, 3.1, 1.6, 0.2],
[5.4, 3.4, 1.5, 0.4],
[5.2, 4.1, 1.5, 0.1],
[5.5, 4.2, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.2],
[5. , 3.2, 1.2, 0.2],
[5.5, 3.5, 1.3, 0.2],
[4.9, 3.6, 1.4, 0.1],
[4.4, 3. , 1.3, 0.2],
[5.1, 3.4, 1.5, 0.2],
[5. , 3.5, 1.3, 0.3],
[4.5, 2.3, 1.3, 0.3],
[4.4, 3.2, 1.3, 0.2],
[5. , 3.5, 1.6, 0.6],
[5.1, 3.8, 1.9, 0.4],
[4.8, 3. , 1.4, 0.3],
[5.1, 3.8, 1.6, 0.2],
[4.6, 3.2, 1.4, 0.2],
[5.3, 3.7, 1.5, 0.2],
[5. , 3.3, 1.4, 0.2],
[7. , 3.2, 4.7, 1.4],
[6.4, 3.2, 4.5, 1.5],
[6.9, 3.1, 4.9, 1.5],
[5.5, 2.3, 4. , 1.3],
[6.5, 2.8, 4.6, 1.5],
[5.7, 2.8, 4.5, 1.3],
[6.3, 3.3, 4.7, 1.6],
[4.9, 2.4, 3.3, 1. ],
[6.6, 2.9, 4.6, 1.3],
[5.2, 2.7, 3.9, 1.4],
[5. , 2. , 3.5, 1. ],
[5.9, 3. , 4.2, 1.5],
[6. , 2.2, 4. , 1. ],
[6.1, 2.9, 4.7, 1.4],
[5.6, 2.9, 3.6, 1.3],
[6.7, 3.1, 4.4, 1.4],
[5.6, 3. , 4.5, 1.5],
[5.8, 2.7, 4.1, 1. ],
[6.2, 2.2, 4.5, 1.5],
[5.6, 2.5, 3.9, 1.1],
[5.9, 3.2, 4.8, 1.8],
[6.1, 2.8, 4. , 1.3],
[6.3, 2.5, 4.9, 1.5],
[6.1, 2.8, 4.7, 1.2],
[6.4, 2.9, 4.3, 1.3],
[6.6, 3. , 4.4, 1.4],
[6.8, 2.8, 4.8, 1.4],
[6.7, 3. , 5. , 1.7],
[6. , 2.9, 4.5, 1.5],
[5.7, 2.6, 3.5, 1. ],
[5.5, 2.4, 3.8, 1.1],
[5.5, 2.4, 3.7, 1. ],
[5.8, 2.7, 3.9, 1.2],
[6. , 2.7, 5.1, 1.6],
[5.4, 3. , 4.5, 1.5],
[6. , 3.4, 4.5, 1.6],
[6.7, 3.1, 4.7, 1.5],
[6.3, 2.3, 4.4, 1.3],
[5.6, 3. , 4.1, 1.3],
[5.5, 2.5, 4. , 1.3],
[5.5, 2.6, 4.4, 1.2],
[6.1, 3. , 4.6, 1.4],
[5.8, 2.6, 4. , 1.2],
[5. , 2.3, 3.3, 1. ],
[5.6, 2.7, 4.2, 1.3],
[5.7, 3. , 4.2, 1.2],
[5.7, 2.9, 4.2, 1.3],
[6.2, 2.9, 4.3, 1.3],
[5.1, 2.5, 3. , 1.1],
[5.7, 2.8, 4.1, 1.3],
[6.3, 3.3, 6. , 2.5],
[5.8, 2.7, 5.1, 1.9],
[7.1, 3. , 5.9, 2.1],
[6.3, 2.9, 5.6, 1.8],
[6.5, 3. , 5.8, 2.2],
[7.6, 3. , 6.6, 2.1],
[4.9, 2.5, 4.5, 1.7],
[7.3, 2.9, 6.3, 1.8],
[6.7, 2.5, 5.8, 1.8],
[7.2, 3.6, 6.1, 2.5],
[6.5, 3.2, 5.1, 2. ],
[6.4, 2.7, 5.3, 1.9],
[6.8, 3. , 5.5, 2.1],
[5.7, 2.5, 5. , 2. ],
[5.8, 2.8, 5.1, 2.4],
[6.4, 3.2, 5.3, 2.3],
[6.5, 3. , 5.5, 1.8],
[7.7, 3.8, 6.7, 2.2],
[7.7, 2.6, 6.9, 2.3],
[6. , 2.2, 5. , 1.5],
[6.9, 3.2, 5.7, 2.3],
[5.6, 2.8, 4.9, 2. ],
[7.7, 2.8, 6.7, 2. ],
[6.3, 2.7, 4.9, 1.8],
[6.7, 3.3, 5.7, 2.1],
[7.2, 3.2, 6. , 1.8],
[6.2, 2.8, 4.8, 1.8],
[6.1, 3. , 4.9, 1.8],
[6.4, 2.8, 5.6, 2.1],
[7.2, 3. , 5.8, 1.6],
[7.4, 2.8, 6.1, 1.9],
[7.9, 3.8, 6.4, 2. ],
[6.4, 2.8, 5.6, 2.2],
[6.3, 2.8, 5.1, 1.5],
[6.1, 2.6, 5.6, 1.4],
[7.7, 3. , 6.1, 2.3],
[6.3, 3.4, 5.6, 2.4],
[6.4, 3.1, 5.5, 1.8],
[6. , 3. , 4.8, 1.8],
[6.9, 3.1, 5.4, 2.1],
[6.7, 3.1, 5.6, 2.4],
[6.9, 3.1, 5.1, 2.3],
[5.8, 2.7, 5.1, 1.9],
[6.8, 3.2, 5.9, 2.3],
[6.7, 3.3, 5.7, 2.5],
[6.7, 3. , 5.2, 2.3],
[6.3, 2.5, 5. , 1.9],
[6.5, 3. , 5.2, 2. ],
[6.2, 3.4, 5.4, 2.3],
[5.9, 3. , 5.1, 1.8]])
4개의 피처를 갖는다.
pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
데이터 전처리를 해야한다. 섞어줘야 Test set 분포도가 정리된다.
iris_data['target']
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
# 0 : setona
# 1 : versicolor
# 2 : virginica
#머신러닝은 수치형 데이터가 중요하기 떄문에
iris_data['target_names']
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
iris_data['feature_names']
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
print(iris_data['DESCR'])
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
iris_data['filename']
'C:\\Users\\SM2130\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'
3. 데이터 전처리
#데이터 구성하기
#문제
iris_df = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names'])
X = iris_df #문제
y = iris_data['target']
#정답은 1차원으로 이루어져 있다
#X는 2차원원으로 이루어진다.
#훈련 , 테스트
# 70 : 30
#데이터를 섞어서 나누자
#train_test_split
#X_train, X_test, y_train, y_test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3, test_size =0.3)
y_train
array([1, 0, 1, 2, 1, 0, 0, 2, 1, 1, 0, 2, 0, 2, 1, 0, 0, 2, 1, 0, 0, 1,
2, 2, 0, 2, 1, 0, 0, 2, 2, 2, 1, 1, 1, 0, 0, 2, 2, 1, 2, 1, 2, 0,
2, 0, 1, 1, 2, 2, 0, 1, 0, 1, 1, 1, 0, 2, 0, 2, 1, 2, 1, 2, 1, 0,
2, 1, 2, 1, 0, 1, 2, 0, 1, 0, 0, 0, 1, 2, 0, 0, 2, 0, 1, 2, 1, 2,
2, 1, 1, 2, 1, 0, 1, 1, 0, 1, 2, 2, 2, 0, 0, 2, 2])
X_train
모델 선택 및 하이퍼파라미터 튜닝
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
KNeighborsClassifier()
7. 평가
# 예측점수 socre(X_test, y_test)
# score는 정확도를 통해서 점수를 나타내중
# X_test를 통해서 예측값을 출력하고 실제 정답인 y_test와 비교해서 점수를 나타내줌
# 예측값 predict()
knn.score(X_test, y_test)
0.9555555555555556
knn.predict(X_test)
array([0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 2, 0, 2, 2, 2, 0, 2, 2,
2, 1, 0, 2, 2, 1, 1, 1, 0, 0, 2, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0,
2])
y_test
array([0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 2, 0, 1, 2, 2, 0, 2, 2,
2, 1, 0, 2, 2, 1, 1, 1, 0, 0, 2, 1, 0, 0, 1, 0, 2, 1, 2, 1, 0, 0,
2])
train_list = []
for i in range(1,101):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
s = knn.score(X_test, y_test)
train_list.append(s)
print("{}이웃 : {} ".format(i,s))
1이웃 : 0.9555555555555556
2이웃 : 0.9555555555555556
3이웃 : 0.9555555555555556
4이웃 : 0.9555555555555556
5이웃 : 0.9555555555555556
6이웃 : 0.9555555555555556
7이웃 : 0.9555555555555556
8이웃 : 0.9555555555555556
9이웃 : 0.9777777777777777
10이웃 : 0.9555555555555556
11이웃 : 0.9555555555555556
12이웃 : 0.9333333333333333
13이웃 : 0.9777777777777777
14이웃 : 0.9555555555555556
15이웃 : 0.9555555555555556
16이웃 : 0.9777777777777777
17이웃 : 0.9777777777777777
18이웃 : 0.9555555555555556
19이웃 : 0.9777777777777777
20이웃 : 0.9333333333333333
21이웃 : 0.9555555555555556
22이웃 : 0.9333333333333333
23이웃 : 0.9555555555555556
24이웃 : 0.9555555555555556
25이웃 : 0.9555555555555556
26이웃 : 0.9777777777777777
27이웃 : 0.9555555555555556
28이웃 : 0.9333333333333333
29이웃 : 0.9333333333333333
30이웃 : 0.9555555555555556
31이웃 : 0.9555555555555556
32이웃 : 0.9333333333333333
33이웃 : 0.9333333333333333
34이웃 : 0.9111111111111111
35이웃 : 0.9333333333333333
36이웃 : 0.9333333333333333
37이웃 : 0.9333333333333333
38이웃 : 0.9111111111111111
39이웃 : 0.9111111111111111
40이웃 : 0.9111111111111111
41이웃 : 0.9111111111111111
42이웃 : 0.9111111111111111
43이웃 : 0.9111111111111111
44이웃 : 0.9111111111111111
45이웃 : 0.9111111111111111
46이웃 : 0.9111111111111111
47이웃 : 0.9111111111111111
48이웃 : 0.9111111111111111
49이웃 : 0.9111111111111111
50이웃 : 0.9111111111111111
51이웃 : 0.9111111111111111
52이웃 : 0.9111111111111111
53이웃 : 0.9333333333333333
54이웃 : 0.9111111111111111
55이웃 : 0.8666666666666667
56이웃 : 0.8888888888888888
57이웃 : 0.9111111111111111
58이웃 : 0.8888888888888888
59이웃 : 0.8666666666666667
60이웃 : 0.9111111111111111
61이웃 : 0.9111111111111111
62이웃 : 0.9111111111111111
63이웃 : 0.9111111111111111
64이웃 : 0.9111111111111111
65이웃 : 0.9111111111111111
66이웃 : 0.9333333333333333
67이웃 : 0.8222222222222222
68이웃 : 0.8222222222222222
69이웃 : 0.7333333333333333
70이웃 : 0.7111111111111111
71이웃 : 0.5333333333333333
72이웃 : 0.3111111111111111
73이웃 : 0.3111111111111111
74이웃 : 0.3111111111111111
75이웃 : 0.3111111111111111
76이웃 : 0.3111111111111111
77이웃 : 0.3111111111111111
78이웃 : 0.3111111111111111
79이웃 : 0.3111111111111111
80이웃 : 0.3111111111111111
81이웃 : 0.3111111111111111
82이웃 : 0.3111111111111111
83이웃 : 0.3111111111111111
84이웃 : 0.3111111111111111
85이웃 : 0.3111111111111111
86이웃 : 0.3111111111111111
87이웃 : 0.3111111111111111
88이웃 : 0.3111111111111111
89이웃 : 0.3111111111111111
90이웃 : 0.3111111111111111
91이웃 : 0.3111111111111111
92이웃 : 0.3111111111111111
93이웃 : 0.3111111111111111
94이웃 : 0.3111111111111111
95이웃 : 0.3111111111111111
96이웃 : 0.3111111111111111
97이웃 : 0.3111111111111111
98이웃 : 0.3111111111111111
99이웃 : 0.3111111111111111
100이웃 : 0.3111111111111111
test_list = []
for i in range(1,101):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
s = knn.score(X_train, y_train)
test_list.append(s)
print("{}이웃 : {} ".format(i,s))
1이웃 : 1.0
2이웃 : 0.9619047619047619
3이웃 : 0.9714285714285714
4이웃 : 0.9714285714285714
5이웃 : 0.9809523809523809
6이웃 : 0.9619047619047619
7이웃 : 0.9714285714285714
8이웃 : 0.9619047619047619
9이웃 : 0.9714285714285714
10이웃 : 0.9714285714285714
11이웃 : 0.9714285714285714
12이웃 : 0.9619047619047619
13이웃 : 0.9714285714285714
14이웃 : 0.9714285714285714
15이웃 : 0.9714285714285714
16이웃 : 0.9619047619047619
17이웃 : 0.9714285714285714
18이웃 : 0.9619047619047619
19이웃 : 0.9619047619047619
20이웃 : 0.9619047619047619
21이웃 : 0.9619047619047619
22이웃 : 0.9523809523809523
23이웃 : 0.9523809523809523
24이웃 : 0.9428571428571428
25이웃 : 0.9428571428571428
26이웃 : 0.9428571428571428
27이웃 : 0.9428571428571428
28이웃 : 0.9428571428571428
29이웃 : 0.9428571428571428
30이웃 : 0.9523809523809523
31이웃 : 0.9619047619047619
32이웃 : 0.9523809523809523
33이웃 : 0.9428571428571428
34이웃 : 0.9333333333333333
35이웃 : 0.9333333333333333
36이웃 : 0.9333333333333333
37이웃 : 0.9333333333333333
38이웃 : 0.9333333333333333
39이웃 : 0.9523809523809523
40이웃 : 0.9333333333333333
41이웃 : 0.9428571428571428
42이웃 : 0.9238095238095239
43이웃 : 0.9238095238095239
44이웃 : 0.8952380952380953
45이웃 : 0.8952380952380953
46이웃 : 0.8952380952380953
47이웃 : 0.8857142857142857
48이웃 : 0.8857142857142857
49이웃 : 0.8857142857142857
50이웃 : 0.8857142857142857
51이웃 : 0.9047619047619048
52이웃 : 0.9047619047619048
53이웃 : 0.9333333333333333
54이웃 : 0.9047619047619048
55이웃 : 0.9142857142857143
56이웃 : 0.9142857142857143
57이웃 : 0.9238095238095239
58이웃 : 0.9142857142857143
59이웃 : 0.9238095238095239
60이웃 : 0.8952380952380953
61이웃 : 0.8952380952380953
62이웃 : 0.9047619047619048
63이웃 : 0.9142857142857143
64이웃 : 0.8857142857142857
65이웃 : 0.8857142857142857
66이웃 : 0.8952380952380953
67이웃 : 0.8952380952380953
68이웃 : 0.8571428571428571
69이웃 : 0.7333333333333333
70이웃 : 0.6476190476190476
71이웃 : 0.580952380952381
72이웃 : 0.34285714285714286
73이웃 : 0.34285714285714286
74이웃 : 0.34285714285714286
75이웃 : 0.34285714285714286
76이웃 : 0.34285714285714286
77이웃 : 0.34285714285714286
78이웃 : 0.34285714285714286
79이웃 : 0.34285714285714286
80이웃 : 0.34285714285714286
81이웃 : 0.34285714285714286
82이웃 : 0.34285714285714286
83이웃 : 0.34285714285714286
84이웃 : 0.34285714285714286
85이웃 : 0.34285714285714286
86이웃 : 0.34285714285714286
87이웃 : 0.34285714285714286
88이웃 : 0.34285714285714286
89이웃 : 0.34285714285714286
90이웃 : 0.34285714285714286
91이웃 : 0.34285714285714286
92이웃 : 0.34285714285714286
93이웃 : 0.34285714285714286
94이웃 : 0.34285714285714286
95이웃 : 0.34285714285714286
96이웃 : 0.34285714285714286
97이웃 : 0.34285714285714286
98이웃 : 0.34285714285714286
99이웃 : 0.34285714285714286
100이웃 : 0.34285714285714286
데이터 시각화
plt.figure(figsize=(25,5)) #출력할 그래프의 크기
plt.plot(range(1,101) ,## x데이터는 1~100 까지
train_list,
c= 'red')
plt.plot(range(1,101) ,## x데이터는 1~100 까지
test_list,
c= 'blue')
plt.xticks(range(1,101))
plt.grid()
plt.show()
"스마트인재개발원에서 진행된 수업내용입니다"
스마트인재개발원
4차산업혁명시대를 선도하는 빅데이터, 인공지능, 사물인터넷 전문 '0원' 취업연계교육기관
www.smhrd.or.kr