(스마트인재개발원) 머신러닝 KNN 실습

스마트 인재개발원/머신러닝 & 딥러닝

(스마트인재개발원) 머신러닝 KNN 실습

앨런튜링_ 2021. 6. 4. 19:21

목표
- iris(붓꽃)데이터를 활용
- 꽃잎 길이, 꽃잎 너비, 꽃받침 길이, 꽃받침 너비
- KNN모델의 이웃의 숫자를 조절해보자(하이퍼파라미터 튜닝)

2. 데이터 수집
- sklearn에서 제공하는 붓꽃 데이터 사용

(1) kNN 모델 정의 및 개념

k-Nearest Neighbor 모델의 약자입니다.

미리 학습을 하지 않고, 새로운 데이터의 Task 요청이 올때 그 때 분류를 수행하는 절차로 이루어지며, 기본 개념은 새로운 데이터가 어느 그룹에 속하는지를 분류하기 위해 가장 가까이 있는 학습데이터의 그룹을 알아보는 것

k=1일 때를 살펴보면 테스트 데이터에서 가장 가까운 학습 데이터는 동그라미 클래스에 속함을 알 수 있다. 따라서 테스트 데이터의 클래스를 동그라미 클래스로 정해준다. k=3일 때를 보면, 3개의 가장 가까운 이웃 중 과반수가 넘는 2개가 세모 클래스임을 알 수 있다. 따라서 이때는 테스트 데이터의 클래스를 세모 클래스로 정해준다. k=9일 때는 9개의 가장 가까운 이웃 중 5개가 세모 클래스에 속함을 확인할 수 있다. 따라서 테스트 데이터의 클래스는 세모

1. 유클리드 거리 (Euclidean Distance)

일반적으로 점과 점 사이의 거리를 구하는 방법입니다.

2. 맨해튼 거리 (Manhattan Distance)

점과 점사이의 직선거리가 아니라 X축, Y축을 따라 간 거리를 의미

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier

데이터로딩

iris_data = load_iris()
iris_data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
        [5.5, 4.2, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.2],
        [5. , 3.2, 1.2, 0.2],
        [5.5, 3.5, 1.3, 0.2],
        [4.9, 3.6, 1.4, 0.1],
        [4.4, 3. , 1.3, 0.2],
        [5.1, 3.4, 1.5, 0.2],
        [5. , 3.5, 1.3, 0.3],
        [4.5, 2.3, 1.3, 0.3],
        [4.4, 3.2, 1.3, 0.2],
        [5. , 3.5, 1.6, 0.6],
        [5.1, 3.8, 1.9, 0.4],
        [4.8, 3. , 1.4, 0.3],
        [5.1, 3.8, 1.6, 0.2],
        [4.6, 3.2, 1.4, 0.2],
        [5.3, 3.7, 1.5, 0.2],
        [5. , 3.3, 1.4, 0.2],
        [7. , 3.2, 4.7, 1.4],
        [6.4, 3.2, 4.5, 1.5],
        [6.9, 3.1, 4.9, 1.5],
        [5.5, 2.3, 4. , 1.3],
        [6.5, 2.8, 4.6, 1.5],
        [5.7, 2.8, 4.5, 1.3],
        [6.3, 3.3, 4.7, 1.6],
        [4.9, 2.4, 3.3, 1. ],
        [6.6, 2.9, 4.6, 1.3],
        [5.2, 2.7, 3.9, 1.4],
        [5. , 2. , 3.5, 1. ],
        [5.9, 3. , 4.2, 1.5],
        [6. , 2.2, 4. , 1. ],
        [6.1, 2.9, 4.7, 1.4],
        [5.6, 2.9, 3.6, 1.3],
        [6.7, 3.1, 4.4, 1.4],
        [5.6, 3. , 4.5, 1.5],
        [5.8, 2.7, 4.1, 1. ],
        [6.2, 2.2, 4.5, 1.5],
        [5.6, 2.5, 3.9, 1.1],
        [5.9, 3.2, 4.8, 1.8],
        [6.1, 2.8, 4. , 1.3],
        [6.3, 2.5, 4.9, 1.5],
        [6.1, 2.8, 4.7, 1.2],
        [6.4, 2.9, 4.3, 1.3],
        [6.6, 3. , 4.4, 1.4],
        [6.8, 2.8, 4.8, 1.4],
        [6.7, 3. , 5. , 1.7],
        [6. , 2.9, 4.5, 1.5],
        [5.7, 2.6, 3.5, 1. ],
        [5.5, 2.4, 3.8, 1.1],
        [5.5, 2.4, 3.7, 1. ],
        [5.8, 2.7, 3.9, 1.2],
        [6. , 2.7, 5.1, 1.6],
        [5.4, 3. , 4.5, 1.5],
        [6. , 3.4, 4.5, 1.6],
        [6.7, 3.1, 4.7, 1.5],
        [6.3, 2.3, 4.4, 1.3],
        [5.6, 3. , 4.1, 1.3],
        [5.5, 2.5, 4. , 1.3],
        [5.5, 2.6, 4.4, 1.2],
        [6.1, 3. , 4.6, 1.4],
        [5.8, 2.6, 4. , 1.2],
        [5. , 2.3, 3.3, 1. ],
        [5.6, 2.7, 4.2, 1.3],
        [5.7, 3. , 4.2, 1.2],
        [5.7, 2.9, 4.2, 1.3],
        [6.2, 2.9, 4.3, 1.3],
        [5.1, 2.5, 3. , 1.1],
        [5.7, 2.8, 4.1, 1.3],
        [6.3, 3.3, 6. , 2.5],
        [5.8, 2.7, 5.1, 1.9],
        [7.1, 3. , 5.9, 2.1],
        [6.3, 2.9, 5.6, 1.8],
        [6.5, 3. , 5.8, 2.2],
        [7.6, 3. , 6.6, 2.1],
        [4.9, 2.5, 4.5, 1.7],
        [7.3, 2.9, 6.3, 1.8],
        [6.7, 2.5, 5.8, 1.8],
        [7.2, 3.6, 6.1, 2.5],
        [6.5, 3.2, 5.1, 2. ],
        [6.4, 2.7, 5.3, 1.9],
        [6.8, 3. , 5.5, 2.1],
        [5.7, 2.5, 5. , 2. ],
        [5.8, 2.8, 5.1, 2.4],
        [6.4, 3.2, 5.3, 2.3],
        [6.5, 3. , 5.5, 1.8],
        [7.7, 3.8, 6.7, 2.2],
        [7.7, 2.6, 6.9, 2.3],
        [6. , 2.2, 5. , 1.5],
        [6.9, 3.2, 5.7, 2.3],
        [5.6, 2.8, 4.9, 2. ],
        [7.7, 2.8, 6.7, 2. ],
        [6.3, 2.7, 4.9, 1.8],
        [6.7, 3.3, 5.7, 2.1],
        [7.2, 3.2, 6. , 1.8],
        [6.2, 2.8, 4.8, 1.8],
        [6.1, 3. , 4.9, 1.8],
        [6.4, 2.8, 5.6, 2.1],
        [7.2, 3. , 5.8, 1.6],
        [7.4, 2.8, 6.1, 1.9],
        [7.9, 3.8, 6.4, 2. ],
        [6.4, 2.8, 5.6, 2.2],
        [6.3, 2.8, 5.1, 1.5],
        [6.1, 2.6, 5.6, 1.4],
        [7.7, 3. , 6.1, 2.3],
        [6.3, 3.4, 5.6, 2.4],
        [6.4, 3.1, 5.5, 1.8],
        [6. , 3. , 4.8, 1.8],
        [6.9, 3.1, 5.4, 2.1],
        [6.7, 3.1, 5.6, 2.4],
        [6.9, 3.1, 5.1, 2.3],
        [5.8, 2.7, 5.1, 1.9],
        [6.8, 3.2, 5.9, 2.3],
        [6.7, 3.3, 5.7, 2.5],
        [6.7, 3. , 5.2, 2.3],
        [6.3, 2.5, 5. , 1.9],
        [6.5, 3. , 5.2, 2. ],
        [6.2, 3.4, 5.4, 2.3],
        [5.9, 3. , 5.1, 1.8]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]),
 'frame': None,
 'target_names': array(['setosa', 'versicolor', 'virginica'], dtype='<U10'),
 'DESCR': '.. _iris_dataset:\n\nIris plants dataset\n--------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 150 (50 in each of three classes)\n    :Number of Attributes: 4 numeric, predictive attributes and the class\n    :Attribute Information:\n        - sepal length in cm\n        - sepal width in cm\n        - petal length in cm\n        - petal width in cm\n        - class:\n                - Iris-Setosa\n                - Iris-Versicolour\n                - Iris-Virginica\n                \n    :Summary Statistics:\n\n    ============== ==== ==== ======= ===== ====================\n                    Min  Max   Mean    SD   Class Correlation\n    ============== ==== ==== ======= ===== ====================\n    sepal length:   4.3  7.9   5.84   0.83    0.7826\n    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)\n    ============== ==== ==== ======= ===== ====================\n\n    :Missing Attribute Values: None\n    :Class Distribution: 33.3% for each of 3 classes.\n    :Creator: R.A. Fisher\n    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n    :Date: July, 1988\n\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher\'s paper. Note that it\'s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\n\nThis is perhaps the best known database to be found in the\npattern recognition literature.  Fisher\'s paper is a classic in the field and\nis referenced frequently to this day.  (See Duda & Hart, for example.)  The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant.  One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\n\n.. topic:: References\n\n   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"\n     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to\n     Mathematical Statistics" (John Wiley, NY, 1950).\n   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System\n     Structure and Classification Rule for Recognition in Partially Exposed\n     Environments".  IEEE Transactions on Pattern Analysis and Machine\n     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions\n     on Information Theory, May 1972, 431-433.\n   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II\n     conceptual clustering system finds 3 classes in the data.\n   - Many, many more ...',
 'feature_names': ['sepal length (cm)',
  'sepal width (cm)',
  'petal length (cm)',
  'petal width (cm)'],
 'filename': 'C:\\Users\\SM2130\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'}

iris_data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

iris_data['data']
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.2],
       [5. , 3.2, 1.2, 0.2],
       [5.5, 3.5, 1.3, 0.2],
       [4.9, 3.6, 1.4, 0.1],
       [4.4, 3. , 1.3, 0.2],
       [5.1, 3.4, 1.5, 0.2],
       [5. , 3.5, 1.3, 0.3],
       [4.5, 2.3, 1.3, 0.3],
       [4.4, 3.2, 1.3, 0.2],
       [5. , 3.5, 1.6, 0.6],
       [5.1, 3.8, 1.9, 0.4],
       [4.8, 3. , 1.4, 0.3],
       [5.1, 3.8, 1.6, 0.2],
       [4.6, 3.2, 1.4, 0.2],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.3, 1.4, 0.2],
       [7. , 3.2, 4.7, 1.4],
       [6.4, 3.2, 4.5, 1.5],
       [6.9, 3.1, 4.9, 1.5],
       [5.5, 2.3, 4. , 1.3],
       [6.5, 2.8, 4.6, 1.5],
       [5.7, 2.8, 4.5, 1.3],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 2.4, 3.3, 1. ],
       [6.6, 2.9, 4.6, 1.3],
       [5.2, 2.7, 3.9, 1.4],
       [5. , 2. , 3.5, 1. ],
       [5.9, 3. , 4.2, 1.5],
       [6. , 2.2, 4. , 1. ],
       [6.1, 2.9, 4.7, 1.4],
       [5.6, 2.9, 3.6, 1.3],
       [6.7, 3.1, 4.4, 1.4],
       [5.6, 3. , 4.5, 1.5],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.2, 4.5, 1.5],
       [5.6, 2.5, 3.9, 1.1],
       [5.9, 3.2, 4.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.3, 2.5, 4.9, 1.5],
       [6.1, 2.8, 4.7, 1.2],
       [6.4, 2.9, 4.3, 1.3],
       [6.6, 3. , 4.4, 1.4],
       [6.8, 2.8, 4.8, 1.4],
       [6.7, 3. , 5. , 1.7],
       [6. , 2.9, 4.5, 1.5],
       [5.7, 2.6, 3.5, 1. ],
       [5.5, 2.4, 3.8, 1.1],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 3.9, 1.2],
       [6. , 2.7, 5.1, 1.6],
       [5.4, 3. , 4.5, 1.5],
       [6. , 3.4, 4.5, 1.6],
       [6.7, 3.1, 4.7, 1.5],
       [6.3, 2.3, 4.4, 1.3],
       [5.6, 3. , 4.1, 1.3],
       [5.5, 2.5, 4. , 1.3],
       [5.5, 2.6, 4.4, 1.2],
       [6.1, 3. , 4.6, 1.4],
       [5.8, 2.6, 4. , 1.2],
       [5. , 2.3, 3.3, 1. ],
       [5.6, 2.7, 4.2, 1.3],
       [5.7, 3. , 4.2, 1.2],
       [5.7, 2.9, 4.2, 1.3],
       [6.2, 2.9, 4.3, 1.3],
       [5.1, 2.5, 3. , 1.1],
       [5.7, 2.8, 4.1, 1.3],
       [6.3, 3.3, 6. , 2.5],
       [5.8, 2.7, 5.1, 1.9],
       [7.1, 3. , 5.9, 2.1],
       [6.3, 2.9, 5.6, 1.8],
       [6.5, 3. , 5.8, 2.2],
       [7.6, 3. , 6.6, 2.1],
       [4.9, 2.5, 4.5, 1.7],
       [7.3, 2.9, 6.3, 1.8],
       [6.7, 2.5, 5.8, 1.8],
       [7.2, 3.6, 6.1, 2.5],
       [6.5, 3.2, 5.1, 2. ],
       [6.4, 2.7, 5.3, 1.9],
       [6.8, 3. , 5.5, 2.1],
       [5.7, 2.5, 5. , 2. ],
       [5.8, 2.8, 5.1, 2.4],
       [6.4, 3.2, 5.3, 2.3],
       [6.5, 3. , 5.5, 1.8],
       [7.7, 3.8, 6.7, 2.2],
       [7.7, 2.6, 6.9, 2.3],
       [6. , 2.2, 5. , 1.5],
       [6.9, 3.2, 5.7, 2.3],
       [5.6, 2.8, 4.9, 2. ],
       [7.7, 2.8, 6.7, 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.7, 3.3, 5.7, 2.1],
       [7.2, 3.2, 6. , 1.8],
       [6.2, 2.8, 4.8, 1.8],
       [6.1, 3. , 4.9, 1.8],
       [6.4, 2.8, 5.6, 2.1],
       [7.2, 3. , 5.8, 1.6],
       [7.4, 2.8, 6.1, 1.9],
       [7.9, 3.8, 6.4, 2. ],
       [6.4, 2.8, 5.6, 2.2],
       [6.3, 2.8, 5.1, 1.5],
       [6.1, 2.6, 5.6, 1.4],
       [7.7, 3. , 6.1, 2.3],
       [6.3, 3.4, 5.6, 2.4],
       [6.4, 3.1, 5.5, 1.8],
       [6. , 3. , 4.8, 1.8],
       [6.9, 3.1, 5.4, 2.1],
       [6.7, 3.1, 5.6, 2.4],
       [6.9, 3.1, 5.1, 2.3],
       [5.8, 2.7, 5.1, 1.9],
       [6.8, 3.2, 5.9, 2.3],
       [6.7, 3.3, 5.7, 2.5],
       [6.7, 3. , 5.2, 2.3],
       [6.3, 2.5, 5. , 1.9],
       [6.5, 3. , 5.2, 2. ],
       [6.2, 3.4, 5.4, 2.3],
       [5.9, 3. , 5.1, 1.8]])

4개의 피처를 갖는다.

pd.DataFrame(iris_data.data, columns=iris_data.feature_names)

데이터 전처리를 해야한다. 섞어줘야 Test set 분포도가 정리된다.

iris_data['target']
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

# 0 : setona
# 1 : versicolor
# 2 : virginica
#머신러닝은 수치형 데이터가 중요하기 떄문에

iris_data['target_names']
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

iris_data['feature_names']
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
 
 print(iris_data['DESCR'])
 .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

iris_data['filename']
'C:\\Users\\SM2130\\anaconda3\\lib\\site-packages\\sklearn\\datasets\\data\\iris.csv'

3. 데이터 전처리

#데이터 구성하기
#문제

iris_df = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names'])
X = iris_df #문제
y = iris_data['target']

#정답은 1차원으로 이루어져 있다
#X는 2차원원으로 이루어진다.

#훈련 , 테스트
# 70 : 30
#데이터를 섞어서 나누자
#train_test_split

#X_train, X_test, y_train, y_test

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=3, test_size =0.3)

y_train
array([1, 0, 1, 2, 1, 0, 0, 2, 1, 1, 0, 2, 0, 2, 1, 0, 0, 2, 1, 0, 0, 1,
       2, 2, 0, 2, 1, 0, 0, 2, 2, 2, 1, 1, 1, 0, 0, 2, 2, 1, 2, 1, 2, 0,
       2, 0, 1, 1, 2, 2, 0, 1, 0, 1, 1, 1, 0, 2, 0, 2, 1, 2, 1, 2, 1, 0,
       2, 1, 2, 1, 0, 1, 2, 0, 1, 0, 0, 0, 1, 2, 0, 0, 2, 0, 1, 2, 1, 2,
       2, 1, 1, 2, 1, 0, 1, 1, 0, 1, 2, 2, 2, 0, 0, 2, 2])
X_train

모델 선택 및 하이퍼파라미터 튜닝

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
KNeighborsClassifier()

7. 평가

# 예측점수 socre(X_test, y_test)
# score는 정확도를 통해서 점수를 나타내중
# X_test를 통해서 예측값을 출력하고 실제 정답인 y_test와 비교해서 점수를 나타내줌
# 예측값 predict()

knn.score(X_test, y_test)
0.9555555555555556
knn.predict(X_test)
array([0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 2, 0, 2, 2, 2, 0, 2, 2,
       2, 1, 0, 2, 2, 1, 1, 1, 0, 0, 2, 1, 0, 0, 2, 0, 2, 1, 2, 1, 0, 0,
       2])
y_test
array([0, 0, 0, 0, 0, 2, 1, 0, 2, 1, 1, 0, 1, 1, 2, 0, 1, 2, 2, 0, 2, 2,
       2, 1, 0, 2, 2, 1, 1, 1, 0, 0, 2, 1, 0, 0, 1, 0, 2, 1, 2, 1, 0, 0,
       2])
train_list = []
for i in range(1,101):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    s = knn.score(X_test, y_test)
    train_list.append(s)
    print("{}이웃 : {} ".format(i,s))
1이웃 : 0.9555555555555556 
2이웃 : 0.9555555555555556 
3이웃 : 0.9555555555555556 
4이웃 : 0.9555555555555556 
5이웃 : 0.9555555555555556 
6이웃 : 0.9555555555555556 
7이웃 : 0.9555555555555556 
8이웃 : 0.9555555555555556 
9이웃 : 0.9777777777777777 
10이웃 : 0.9555555555555556 
11이웃 : 0.9555555555555556 
12이웃 : 0.9333333333333333 
13이웃 : 0.9777777777777777 
14이웃 : 0.9555555555555556 
15이웃 : 0.9555555555555556 
16이웃 : 0.9777777777777777 
17이웃 : 0.9777777777777777 
18이웃 : 0.9555555555555556 
19이웃 : 0.9777777777777777 
20이웃 : 0.9333333333333333 
21이웃 : 0.9555555555555556 
22이웃 : 0.9333333333333333 
23이웃 : 0.9555555555555556 
24이웃 : 0.9555555555555556 
25이웃 : 0.9555555555555556 
26이웃 : 0.9777777777777777 
27이웃 : 0.9555555555555556 
28이웃 : 0.9333333333333333 
29이웃 : 0.9333333333333333 
30이웃 : 0.9555555555555556 
31이웃 : 0.9555555555555556 
32이웃 : 0.9333333333333333 
33이웃 : 0.9333333333333333 
34이웃 : 0.9111111111111111 
35이웃 : 0.9333333333333333 
36이웃 : 0.9333333333333333 
37이웃 : 0.9333333333333333 
38이웃 : 0.9111111111111111 
39이웃 : 0.9111111111111111 
40이웃 : 0.9111111111111111 
41이웃 : 0.9111111111111111 
42이웃 : 0.9111111111111111 
43이웃 : 0.9111111111111111 
44이웃 : 0.9111111111111111 
45이웃 : 0.9111111111111111 
46이웃 : 0.9111111111111111 
47이웃 : 0.9111111111111111 
48이웃 : 0.9111111111111111 
49이웃 : 0.9111111111111111 
50이웃 : 0.9111111111111111 
51이웃 : 0.9111111111111111 
52이웃 : 0.9111111111111111 
53이웃 : 0.9333333333333333 
54이웃 : 0.9111111111111111 
55이웃 : 0.8666666666666667 
56이웃 : 0.8888888888888888 
57이웃 : 0.9111111111111111 
58이웃 : 0.8888888888888888 
59이웃 : 0.8666666666666667 
60이웃 : 0.9111111111111111 
61이웃 : 0.9111111111111111 
62이웃 : 0.9111111111111111 
63이웃 : 0.9111111111111111 
64이웃 : 0.9111111111111111 
65이웃 : 0.9111111111111111 
66이웃 : 0.9333333333333333 
67이웃 : 0.8222222222222222 
68이웃 : 0.8222222222222222 
69이웃 : 0.7333333333333333 
70이웃 : 0.7111111111111111 
71이웃 : 0.5333333333333333 
72이웃 : 0.3111111111111111 
73이웃 : 0.3111111111111111 
74이웃 : 0.3111111111111111 
75이웃 : 0.3111111111111111 
76이웃 : 0.3111111111111111 
77이웃 : 0.3111111111111111 
78이웃 : 0.3111111111111111 
79이웃 : 0.3111111111111111 
80이웃 : 0.3111111111111111 
81이웃 : 0.3111111111111111 
82이웃 : 0.3111111111111111 
83이웃 : 0.3111111111111111 
84이웃 : 0.3111111111111111 
85이웃 : 0.3111111111111111 
86이웃 : 0.3111111111111111 
87이웃 : 0.3111111111111111 
88이웃 : 0.3111111111111111 
89이웃 : 0.3111111111111111 
90이웃 : 0.3111111111111111 
91이웃 : 0.3111111111111111 
92이웃 : 0.3111111111111111 
93이웃 : 0.3111111111111111 
94이웃 : 0.3111111111111111 
95이웃 : 0.3111111111111111 
96이웃 : 0.3111111111111111 
97이웃 : 0.3111111111111111 
98이웃 : 0.3111111111111111 
99이웃 : 0.3111111111111111 
100이웃 : 0.3111111111111111 
test_list = []
for i in range(1,101):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    s = knn.score(X_train, y_train)
    test_list.append(s)
    print("{}이웃 : {} ".format(i,s))
1이웃 : 1.0 
2이웃 : 0.9619047619047619 
3이웃 : 0.9714285714285714 
4이웃 : 0.9714285714285714 
5이웃 : 0.9809523809523809 
6이웃 : 0.9619047619047619 
7이웃 : 0.9714285714285714 
8이웃 : 0.9619047619047619 
9이웃 : 0.9714285714285714 
10이웃 : 0.9714285714285714 
11이웃 : 0.9714285714285714 
12이웃 : 0.9619047619047619 
13이웃 : 0.9714285714285714 
14이웃 : 0.9714285714285714 
15이웃 : 0.9714285714285714 
16이웃 : 0.9619047619047619 
17이웃 : 0.9714285714285714 
18이웃 : 0.9619047619047619 
19이웃 : 0.9619047619047619 
20이웃 : 0.9619047619047619 
21이웃 : 0.9619047619047619 
22이웃 : 0.9523809523809523 
23이웃 : 0.9523809523809523 
24이웃 : 0.9428571428571428 
25이웃 : 0.9428571428571428 
26이웃 : 0.9428571428571428 
27이웃 : 0.9428571428571428 
28이웃 : 0.9428571428571428 
29이웃 : 0.9428571428571428 
30이웃 : 0.9523809523809523 
31이웃 : 0.9619047619047619 
32이웃 : 0.9523809523809523 
33이웃 : 0.9428571428571428 
34이웃 : 0.9333333333333333 
35이웃 : 0.9333333333333333 
36이웃 : 0.9333333333333333 
37이웃 : 0.9333333333333333 
38이웃 : 0.9333333333333333 
39이웃 : 0.9523809523809523 
40이웃 : 0.9333333333333333 
41이웃 : 0.9428571428571428 
42이웃 : 0.9238095238095239 
43이웃 : 0.9238095238095239 
44이웃 : 0.8952380952380953 
45이웃 : 0.8952380952380953 
46이웃 : 0.8952380952380953 
47이웃 : 0.8857142857142857 
48이웃 : 0.8857142857142857 
49이웃 : 0.8857142857142857 
50이웃 : 0.8857142857142857 
51이웃 : 0.9047619047619048 
52이웃 : 0.9047619047619048 
53이웃 : 0.9333333333333333 
54이웃 : 0.9047619047619048 
55이웃 : 0.9142857142857143 
56이웃 : 0.9142857142857143 
57이웃 : 0.9238095238095239 
58이웃 : 0.9142857142857143 
59이웃 : 0.9238095238095239 
60이웃 : 0.8952380952380953 
61이웃 : 0.8952380952380953 
62이웃 : 0.9047619047619048 
63이웃 : 0.9142857142857143 
64이웃 : 0.8857142857142857 
65이웃 : 0.8857142857142857 
66이웃 : 0.8952380952380953 
67이웃 : 0.8952380952380953 
68이웃 : 0.8571428571428571 
69이웃 : 0.7333333333333333 
70이웃 : 0.6476190476190476 
71이웃 : 0.580952380952381 
72이웃 : 0.34285714285714286 
73이웃 : 0.34285714285714286 
74이웃 : 0.34285714285714286 
75이웃 : 0.34285714285714286 
76이웃 : 0.34285714285714286 
77이웃 : 0.34285714285714286 
78이웃 : 0.34285714285714286 
79이웃 : 0.34285714285714286 
80이웃 : 0.34285714285714286 
81이웃 : 0.34285714285714286 
82이웃 : 0.34285714285714286 
83이웃 : 0.34285714285714286 
84이웃 : 0.34285714285714286 
85이웃 : 0.34285714285714286 
86이웃 : 0.34285714285714286 
87이웃 : 0.34285714285714286 
88이웃 : 0.34285714285714286 
89이웃 : 0.34285714285714286 
90이웃 : 0.34285714285714286 
91이웃 : 0.34285714285714286 
92이웃 : 0.34285714285714286 
93이웃 : 0.34285714285714286 
94이웃 : 0.34285714285714286 
95이웃 : 0.34285714285714286 
96이웃 : 0.34285714285714286 
97이웃 : 0.34285714285714286 
98이웃 : 0.34285714285714286 
99이웃 : 0.34285714285714286 
100이웃 : 0.34285714285714286

데이터 시각화

plt.figure(figsize=(25,5)) #출력할 그래프의 크기
plt.plot(range(1,101) ,## x데이터는 1~100 까지 
          train_list,
            c= 'red')
plt.plot(range(1,101) ,## x데이터는 1~100 까지 
          test_list,
            c= 'blue')
plt.xticks(range(1,101))
plt.grid()

plt.show()

"스마트인재개발원에서 진행된 수업내용입니다"

https://www.smhrd.or.kr/

스마트인재개발원

4차산업혁명시대를 선도하는 빅데이터, 인공지능, 사물인터넷 전문 '0원' 취업연계교육기관

www.smhrd.or.kr

저작자표시 (새창열림)