ラビットチャレンジ: Phase.2 機械学習

本ページはラビットチャレンジの、 Phase.2 機械学習のレポート提出を兼ねた受講記録です。提出指示に対して、下記の方針でまとめました

動画講義の要点まとめ
- 自分がその手法を思い出して実装するのに必要な最低限の情報 (モデル定義・評価関数・解放の着想があれば十分かなぁ.. )
- 講義で口頭補足されたトピックなどあれば。

実装演習(ハンズオン実行)
- 基本的には scikit-learn のノートの内容に準じる(より実践的?っぽいので)
  - 対応する numpy実装を見て、sikit-learn側に応用できそうなことがあればやってみる (コアアルゴリズム部分を numpy 実装に置き換えてみるとか)
  - その他パラメータのチューニングなど。その他、何か追加でやりたくなったら、適当に追加。
- ノートブックを jupyter nbconvert で markdown に変換しその抜粋を掲載
  - markdown /コード/出力のそれぞれに関して、blog上での表示を考慮してあとから微調整を行う
  - 手法に関する一般的な説明・数式展開などは、要点のまとめ側に移動する場合も

目次
各論前の講義内容　
線形回帰モデル
1. 動画講義メモ(要点のまとめ)
2. ハンズオン: 線形回帰モデル-Boston Hausing Data-
非線形回帰
1. 動画講義メモ(要点のまとめ)
2. ハンズオン
ロジスティック回帰
1. 動画講義メモ(要点のまとめ)
2. ハンズオン
主成分分析 (PCA)
1. 動画講義メモ(要点のまとめ)
2. ハンズオン
k-nn/means 法 (レポート提出課題の "アルゴリズム" の単元に該当?)
1. 動画講義メモ(要点のまとめ)
2. ハンズオン
  1. k-nn で分類
  2. k-means クラスタリング
サポートベクターマシン(SVM)
取り組み概要 & 感想
1. 取り組みの記録
2. 感想ほか　
計画の見直し (2021/05/06 時点)

各論前の講義内容　

プロローグ ( なぜ非ディープラーニングの機械学習の勉強をするのか? )
- 数学的な背景を元にした機械学習の基礎を抑えていないエンジニアは、フレームワークを使って機械学習モデルを組める「程度」の人材にしかならない

機械学習のモデリングプロセスをしっかりと抑えることが重要
1. 問題設定
  - 最終的な使われ方をイメージしよう
  - 必ずしも機械学習を使う必要は無く、ルールベースで解けるならそっちの方が楽
    - 技術的負債になりやすい
      - 自分が開発した技術を運用チームに移管するとして、移管先に専門知識を持った人間がいなければSLA(Service Level Agreement)を担保できない
    - テストしにくい・考慮すべきことが多い
    - データ集めるの大変
2. データ選定
  - GIGO(garbage in garbage out) = ゴミを突っ込んだらゴミが出てくる
    - ex. 集めたデータに(意図しない)バイアスがかかっているなど。
3. データの前処理
  - 開発時間の殆どは、このプロセスに費やされるといっても過言でない
  - 実務経験がモノをいうところ。練習のチャンスがない人は、kaggleなどやるといい
4. モデルの選定
  - ディープラーニング(で扱われるモデル)も、機械学習モデルの一部に過ぎず、このプロセスの具体的な作業が異なるだけ
5. モデルパラメータの決定
6. モデルの評価

ルールベースモデルと機械学習モデル
- 技術者として、機械学習とは?と聞かれたら何らかの形で答えられるようになっておくこと
  - 講師の体感では、機械学習ってなに? と聞かれるよりも、人工知能って何? と聞かれることの方が多いらしい

線形回帰モデル

動画講義メモ(要点のまとめ)

回帰問題 = 入力から出力を予測する問題。線形回帰は、回帰式が線形。
- 線形回帰モデルの形 (線形とは?)
  - ざっくり説明 = 入出力が比例関係である (直線・平面・超平面)
  - $y = w_0 + \sum_{i=1}^{n-1} w_i x_i$
    - $= \sum_{i=0}^{n-1} w_i x_i$
    - $= \bm{w}^T \bm{x}$ ただし $\bm{w}^T = (w_0, w_1, ... , w_{n-1})$ , $\bm{x}^T = (1, x_1, ... , x_{n-1})$
  - 記法としてはベクトルが便利、訳わからなったら sigma 記法にする
    - このベクトル/行列と要素ごとの記法の行き来ができることがポイント
    - この後の記述とあわせて係数は $w$ にした
- 出力について
  - 連続値とスライドに書かれているが、離散値でもいい
    - 例: 諸条件から、競馬の順位を予想する
      - cf. "vapnik の原理" には反する
        
        順位 = 大小関係がわかるだけでいいのに、もっと難しい回帰問題を解くべきではないという話。
        
        論文
  - スカラーと書いてあるが、多次元のベクトルにしてもいい(マルチタスク学習など)
  - データ分割・学習
    - 未知のデータに対する予測精度をあげたいので、テスト用のデータと、学習用のデータは分ける
    - 最尤法による解とMSEの最小化は等価と書いてあったが、それは誤差を確率変数としたとき、これが正規分布に従う場合では? (他の分布なら、別の解でてくる場合もあるよね?)
- 誤差について
  - 必ずしも偶発的な誤差だけではなく、モデル化誤差もある
    - y = 来店者数、x = 気温で予測ができる? いや、曜日にも影響する。このとき、曜日の項の影響は誤差になる
- 未知数の数と方程式の数について
  - 今、重みw は m+1 次元なので、基本的には、 m+1 以上の方程式がないと厳しい
    - データの方が少ないケースを扱う手法もあるが、それは advanced
    - DLの場合パラメータがたくさんあるので、データが必要
- パラメータ推定方法 : 最小二乗法
  - $\mathrm{MSE_{train}} = \frac{1}{n} \sum_{i=1}^{n_{train}} (\hat{y_i}^{(train)}-y_i^{(train)})^2 = \sum \epsilon^2 := J(w)$
    - MSE = mean square error
  - これをパラメータについて微分して、勾配が0になる点を求める
    - $\bm{\hat{w}}= (X^T X) ^{-1} X^T \bm{y}$ : (train) は省略
      - $\bm{y}$ に一般化逆行列をかけた形
    - 必ずしも誤差関数として二乗誤差は最適ではないことに注意
      - 基本的にハズレ値の存在にかなり弱くなる
      - Huber loss, Tukey loss とか　を使うとハズレ値に強くなる
  - よく使うベクトルの微分の形
    - $\frac{\partial}{\partial \bm{w}} \bm{w} ^T \bm{x} = x$
    - $\frac{\partial}{\partial \bm{w}} \bm{w}^T A \bm{w} = (A+A^T) \bm{x} = 2A\bm{x}$ (Aが対称行列のとき)
    - 参考図書: Matrix Cook Book

ハンズオンに関するコメント
- 全部の説明変数を使う必要はないし、使うべきではないことがある
- 12番目の "アフリカ系アメリカ人居住者の割合" など
- 現実のデータセットを扱う上では、よくデータをみる必要がある
  - 得られたデータの中に使用すべきでないもの(ハズレ値など) があるかもしれない
  - 要約統計量などを適宜活用
  - pandas で 12行目まで観察 ( 頭の数行みて、怪しい所があったから。最初の5行 CHAS全部0じゃね?)
- エイヤで学習すると、マイナスの価格が出てしまった件
  - こういう現象が出た時点でおかしい、と気付かなければいけない
  - 何がダメだったんだろう -> 1部屋のケースなどがデータに入ってないんだろうな
    - 外挿問題にDLは基本的には弱い 5~10部屋の範囲のデータから学習したとき、11部屋, 2部屋の時の予測は上手く行かない
- scikit-learn や tensorflow で動かすのは小学生でもできる。なんとなく動かすんじゃなくて、数式に対する理解を持つこと
- 多重回帰の場合も試しに実行してみてね
  - 学習されたモデルを評価する
  - 部屋を増やしてみる(価格増えるはず)
  - 犯罪率増やしてみる(価格下がるはず)

ハンズオン: 線形回帰モデル-Boston Hausing Data-

必要モジュールとデータのインポート

#from モジュール名 import クラス名（もしくは関数名や変数名）
from sklearn.datasets import load_boston
from pandas import DataFrame
import numpy as np

# ボストンデータを"boston"というインスタンスにインポート
boston = load_boston()

# Bunch オブジェクト
#  https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch
print(type(boston))
print(boston.keys())

<class 'sklearn.utils.Bunch'> dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

#DESCR変数の中身を確認
print(boston['DESCR'])

.. _boston_dataset:
    
    Boston house prices dataset
    ---------------------------
    
    **Data Set Characteristics:**  
    
        :Number of Instances: 506 
    
        :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
    
        :Attribute Information (in order):
            - CRIM     per capita crime rate by town
            - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
            - INDUS    proportion of non-retail business acres per town
            - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
            - NOX      nitric oxides concentration (parts per 10 million)
            - RM       average number of rooms per dwelling
            - AGE      proportion of owner-occupied units built prior to 1940
            - DIS      weighted distances to five Boston employment centres
            - RAD      index of accessibility to radial highways
            - TAX      full-value property-tax rate per $10,000
            - PTRATIO  pupil-teacher ratio by town
            - B        1000(Bk - 0.63)^2 where Bk is the proportion of black people by town
            - LSTAT    % lower status of the population
            - MEDV     Median value of owner-occupied homes in $1000's
    
        :Missing Attribute Values: None
    
        :Creator: Harrison, D. and Rubinfeld, D.L.
    
    This is a copy of UCI ML housing dataset.
    https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
    
    
    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
    
    The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
    prices and the demand for clean air', J. Environ. Economics & Management,
    vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
    ...', Wiley, 1980.   N.B. Various transformations are used in the table on
    pages 244-261 of the latter.
    
    The Boston house-price data has been used in many machine learning papers that address regression
    problems.   
         
    .. topic:: References
    
       - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
       - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

#feature_names変数の中身を確認
#カラム名
print(boston['feature_names'])

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

#data変数(説明変数)の中身を確認
print(boston['data'])

[[6.3200e-03 1.8000e+01 2.3100e+00 ... 1.5300e+01 3.9690e+02 4.9800e+00] [2.7310e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9690e+02 9.1400e+00] [2.7290e-02 0.0000e+00 7.0700e+00 ... 1.7800e+01 3.9283e+02 4.0300e+00] ... [6.0760e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 5.6400e+00] [1.0959e-01 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9345e+02 6.4800e+00] [4.7410e-02 0.0000e+00 1.1930e+01 ... 2.1000e+01 3.9690e+02 7.8800e+00]]

#target変数(目的変数)の中身を確認
print(boston['target'])

[24. 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 15. 18.9 21.7 20.4 18.2 19.9 23.1 17.5 20.2 18.2 13.6 19.6 15.2 14.5 15.6 13.9 16.6 14.8 18.4 21. 12.7 14.5 13.2 13.1 13.5 18.9 20. 21. 24.7 30.8 34.9 26.6 25.3 24.7 21.2 19.3 20. 16.6 14.4 19.4 19.7 20.5 25. 23.4 18.9 35.4 24.7 31.6 23.3 19.6 18.7 16. 22.2 25. 33. 23.5 19.4 22. 17.4 20.9 24.2 21.7 22.8 23.4 24.1 21.4 20. 20.8 21.2 20.3 28. 23.9 24.8 22.9 23.9 26.6 22.5 22.2 23.6 28.7 22.6 22. 22.9 25. 20.6 28.4 21.4 38.7 43.8 33.2 27.5 26.5 18.6 19.3 20.1 19.5 19.5 20.4 19.8 19.4 21.7 22.8 18.8 18.7 18.5 18.3 21.2 19.2 20.4 19.3 22. 20.3 20.5 17.3 18.8 21.4 15.7 16.2 18. 14.3 19.2 19.6 23. 18.4 15.6 18.1 17.4 17.1 13.3 17.8 14. 14.4 13.4 15.6 11.8 13.8 15.6 14.6 17.8 15.4 21.5 19.6 15.3 19.4 17. 15.6 13.1 41.3 24.3 23.3 27. 50. 50. 50. 22.7 25. 50. 23.8 23.8 22.3 17.4 19.1 23.1 23.6 22.6 29.4 23.2 24.6 29.9 37.2 39.8 36.2 37.9 32.5 26.4 29.6 50. 32. 29.8 34.9 37. 30.5 36.4 31.1 29.1 50. 33.3 30.3 34.6 34.9 32.9 24.1 42.3 48.5 50. 22.6 24.4 22.5 24.4 20. 21.7 19.3 22.4 28.1 23.7 25. 23.3 28.7 21.5 23. 26.7 21.7 27.5 30.1 44.8 50. 37.6 31.6 46.7 31.5 24.3 31.7 41.7 48.3 29. 24. 25.1 31.5 23.7 23.3 22. 20.1 22.2 23.7 17.6 18.5 24.3 20.5 24.5 26.2 24.4 24.8 29.6 42.8 21.9 20.9 44. 50. 36. 30.1 33.8 43.1 48.8 31. 36.5 22.8 30.7 50. 43.5 20.7 21.1 25.2 24.4 35.2 32.4 32. 33.2 33.1 29.1 35.1 45.4 35.4 46. 50. 32.2 22. 20.1 23.2 22.3 24.8 28.5 37.3 27.9 23.9 21.7 28.6 27.1 20.3 22.5 29. 24.8 22. 26.4 33.1 36.1 28.4 33.4 28.2 22.8 20.3 16.1 22.1 19.4 21.6 23.8 16.2 17.8 19.8 23.1 21. 23.8 23.1 20.4 18.5 25. 24.6 23. 22.2 19.3 22.6 19.8 17.1 19.4 22.2 20.7 21.1 19.5 18.5 20.6 19. 18.7 32.7 16.5 23.9 31.2 17.5 17.2 23.1 24.5 26.6 22.9 24.1 18.6 30.1 18.2 20.6 17.8 21.7 22.7 22.6 25. 19.9 20.8 16.8 21.9 27.5 21.9 23.1 50. 50. 50. 50. 50. 13.8 13.8 15. 13.9 13.3 13.1 10.2 10.4 10.9 11.3 12.3 8.8 7.2 10.5 7.4 10.2 11.5 15.1 23.2 9.7 13.8 12.7 13.1 12.5 8.5 5. 6.3 5.6 7.2 12.1 8.3 8.5 5. 11.9 27.9 17.2 27.5 15. 17.2 17.9 16.3 7. 7.2 7.5 10.4 8.8 8.4 16.7 14.2 20.8 13.4 11.7 8.3 10.2 10.9 11. 9.5 14.5 14.1 16.1 14.3 11.7 13.4 9.6 8.7 8.4 12.8 10.5 17.1 18.4 15.4 10.8 11.8 14.9 12.6 14.1 13. 13.4 15.2 16.1 17.8 14.9 14.1 12.7 13.5 14.9 20. 16.4 17.7 19.5 20.2 21.4 19.9 19. 19.1 19.1 20.1 19.9 19.6 23.2 29.8 13.8 13.3 16.7 12. 14.6 21.4 23. 23.7 25. 21.8 20.6 21.2 19.1 20.6 15.2 7. 8.1 13.6 20.1 21.8 24.5 23.1 19.7 18.3 21.2 17.5 16.8 22.4 20.6 23.9 22. 11.9]

データフレームの作成

#　説明変数らをDataFrameへ変換
df = DataFrame(data=boston.data, columns = boston.feature_names)

# 目的変数をDataFrameへ追加
df['PRICE'] = np.array(boston.target)

# 最初の5行を表示
df.head(5)

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

単回帰分析

# 説明変数
data = df.loc[:, ['RM']].values
# 目的変数
target = df.loc[:, 'PRICE'].values
# 予測したいケース rm=1
sample = np.array([[1]])

#dataリストの表示(1-5)
data[0:5]

array([[6.575], [6.421], [7.185], [6.998], [7.147]])

target[0:5]

array([24. , 21.6, 34.7, 33.4, 36.2])

## sklearnモジュールからLinearRegressionをインポート
from sklearn.linear_model import LinearRegression

# オブジェクト生成
model = LinearRegression()
# model.get_params()

# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
# コメントアウトされていた以下のコードは、default のパラメータなので、上記と同じ結果になる 
# fit_intercept を false にすると、切片を使わない=原点を通る直線でフィットする
# model = LinearRegression(fit_intercept = True, normalize = False, copy_X = True,  n_jobs = 1)

# fit関数でパラメータ推定
model.fit(data, target)

print("coef: {}".format(model.coef_))
print("intercept: {}".format(model.intercept_))

# モデルを使って予測
price = model.predict(sample)[0]
print("price: {}".format(price))

coef: [9.10210898] intercept: -34.670620776438554 price: -25.568511795258246

# numpy実装の回帰 (とりあえず 1次元前提の実装)
def train(xs, ys):
    cov = np.cov(xs, ys, ddof=0)
    a = cov[0, 1] / cov[0, 0]
    b = np.mean(ys) - a * np.mean(xs)
    return cov, a, b

def predict(x,a,b):
    return a * x + b

_,  a1, b1 = train(data[:,0], target)

print("coef: {}".format(a1))
print("intercept: {}".format(b1))
price = predict(sample,a1,b1)
print("price: {}".format(price))

coef: 9.102108981180313 intercept: -34.67062077643858 price: [[-25.5685118]]

どちらの実装でも同等の結果が出ているようだが、価格がマイナスになってしまった。

重回帰分析(2変数)

# 説明変数
data2 = df.loc[:, ['CRIM', 'RM']].values
# 目的変数
target2 = df.loc[:, 'PRICE'].values

# 予測したいケース crim = 0.2, rm=7
sample2 = np.array([[0.2, 7]])

# オブジェクト生成
model2 = LinearRegression()

# fit関数でパラメータ推定
model2.fit(data2, target2)

print("coef: {}".format(model2.coef_))
print("intercept: {}".format(model2.intercept_))

# モデルを使って予測
price = model2.predict(sample2)[0]
print("predict: {}".format(price))

coef: [-0.26491325 8.39106825] intercept: -29.24471945192992 predict: 29.43977562281461

# numpy実装の回帰 (N次元の実装)
# 普通に行列で微分すればOK
def train(xs, ys):
    X = np.concatenate([np.ones((xs.shape[0],1), dtype=np.float64),xs],1)
    w = np.dot( np.linalg.pinv(X), ys)
    return w

def predict(x, w):
    X = np.concatenate([np.ones((x.shape[0],1), dtype=np.float64),x],1)
    return np.dot(X,w)

w2 = train(data2,target)
print("coef: {}".format(w2[1:3]))
print("intercept: {}".format(w2[0]))
price = predict(sample2,w2)
print("price: {}".format(price))

coef: [-0.26491325 8.39106825] intercept: -29.24471945192995 price: [29.43977562]

単回帰・重回帰どちらでも、似たような結果が出ているので、モデルの検証は scikit-learnの結果を用いて行うので十分そう

モデルの検証

決定係数の定義は、調べてみると

R^2 = 1 - \frac{ \sum_i (y_i- \hat{y_i})^2}{ \sum_i (y_i - \bar{y})^2 }

らしい。

第二項目に着目してみると、分子分母をデータ数 $n$ で割ると　MSEを目的変数自体の分散で正規化したような値になっている。

# 決定係数
print('単回帰決定係数: %.3f, 重回帰決定係数 : %.3f' % (model.score(data,target), model2.score(data2,target2)))

単回帰決定係数: 0.484, 重回帰決定係数 : 0.542

オリジナルのノートブックには、単回帰に関する検証しか行われていないので、重回帰でも行う

def evaluate(data,target):
    from sklearn.model_selection import train_test_split

    # 70%を学習用、30%を検証用データにするよう分割
    X_train, X_test, y_train, y_test = train_test_split(data, target, test_size = 0.3, random_state = 666)

    # matplotlibをインポート
    import matplotlib.pyplot as plt
    # Jupyterを利用していたら、以下のおまじないを書くとnotebook上に図が表示
    %matplotlib inline

    # 学習用データでパラメータ推定
    model.fit(X_train, y_train)
    # 作成したモデルから予測（学習用、検証用モデル使用）
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # 学習用、検証用それぞれで残差をプロット
    plt.scatter(y_train_pred, y_train_pred - y_train, c = 'blue', marker = 'o', label = 'Train Data')
    plt.scatter(y_test_pred, y_test_pred - y_test, c = 'lightgreen', marker = 's', label = 'Test Data')
    plt.xlabel('Predicted Values')
    plt.ylabel('Residuals')
    # 凡例を左上に表示
    plt.legend(loc = 'upper left')
    # y = 0に直線を引く
    plt.hlines(y = 0, xmin = -10, xmax = 50, lw = 2, color = 'red')
    plt.xlim([10, 50])
    plt.ylim([-40, 40])
    
    plt.show()

    # 平均二乗誤差を評価するためのメソッドを呼び出し
    from sklearn.metrics import mean_squared_error
    # 学習用、検証用データに関して平均二乗誤差を出力
    print('MSE Train : %.3f, Test : %.3f' % (mean_squared_error(y_train, y_train_pred), mean_squared_error(y_test, y_test_pred)))
    # 学習用、検証用データに関してR^2を出力
    print('R^2 Train : %.3f, Test : %.3f' % (model.score(X_train, y_train), model.score(X_test, y_test)))

## 単回帰の場合
evaluate(data,target)

svg

MSE Train : 44.983, Test : 40.412 R^2 Train : 0.500, Test : 0.434

## 重回帰の場合
evaluate(data2,target)

svg

MSE Train : 40.586, Test : 34.377 R^2 Train : 0.549, Test : 0.518

# 別の変数でも試してみる
evaluate(df.loc[:, ['RM','LSTAT']].values,target)

svg

MSE Train : 32.228, Test : 26.711 R^2 Train : 0.642, Test : 0.626

非線形回帰

動画講義メモ(要点のまとめ)

モデルの形
- 線形回帰 $y = \sum w_i x_i$ で、 $x_i$ の代わりに非線形関数 $\phi_i (\bm{x})$ を使う。
  - $\phi_0 = 1, w_0 =1$
  - $w$ については線形 (linear-in-parameter) のまま
  - -> 二乗誤差に対する解としてパラメータは求められるってことね
  - 旧動画では、これを基底関数法と呼んでいた
未学習(underfitting)・過学習(overfitting)について
- 多項式近似の例: 4次以上はフィッティング具合ほとんどかわらない
  - 左のグラフ(1,2次ぐらい) -> 未学習
  - 真ん中(4次ぐらい) -> 望ましい状態
  - 右のグラフ(もっと次数をあげた) -> 過学習
- 訓練誤差とテスト誤差を比較することで、判断ができる
  - cf. 2018年の DL 論文で、過学習可と思いきや、ずっと学習続けてると、またロスが下がり始める現象が報告されていた(double descent)
過学習の対策
1. 学習データを増やす
2. 不要な基底関数を削除して表現力を抑止
- 情報量基準などを使う
- 愚直にやると、組合せ爆発を起こすので大変
1. 正則化法を利用して表現力を抑止
- モデルの複雑さに伴って、その値が大きくなるようなペナルティを関数に課す
  - 今回の場合は重みパラメータの大きさ
    - 4次まで誤差を十分小さくできており、7次以上の項による誤差低減効果はほぼない。この状態で、正則化を行えば、7次以上の重みパラネータが小さく抑えられるはず。
  - $min. \ MSE + \lambda R(\bm{w})$ を解けば良い。( $R(\bm{w})$ が正則化項)
    - KKT条件によって、 $min. \ \mathrm{MSE} \ s.t. \ R(\bm{w}) \le r$ を一つの目的関数の最小化問題に変換
      - 今回の場合、必ずしも上記不等式制約を満たす必要はない(なるべく小さくすればいいだけ)なので上記の議論で問題ない
      - $r$ は $\bm{w}$ によらないので目的関数からは削除されてる
- 正則化項の例
  - Ridge推定量　- 正則化項にL2ノルムを使う (パラメータを0に近づける)
  - Lasso推定量 - 正則化項にL1ノルムを使う (スパース推定)
- 正則化項にかかるパラメータは hyper parameter 　
モデル選択
- 評価するときには、学習に使ってないデータ(検証データ)に対する評価値を見る
- ホールドアウト法
  - 学習に使うデータと、検証に使うデータを固定する
    - 大量にデータが手元にあるときはこれをやることがおおい
  - 注意点 :
    - 検証用データに "はずれ値"が入ってしまうと、はずれ値にフィットするモデルが選ばれる、ということが起きる
- 交差検証法 (クロスバリデーション)
  - データを学習用と評価用に分割するのだが、その分割の仕方をローテーションする。例えば、5分割したうちの1つが検証、残りが学習、みたいな
  - 1つのモデルに対して、 5回の評価が行われる。精度の平均値 CV値で評価する
  - Q&A 精度ってどうやって計算するの? -> 基本的には学習時のloss をそのまま使う
  - 基本的には交差検証法の方の値を報告する
ハイパーパラメータの調整をどうしよう? という話　
- グリッドサーチは実装の練習はしてもいいと思うけど、実践的にはあまり使わない。
- ベイズ最適化(BO)が最近だとよく使われる?
  - PFN の optunaなど

ハンズオン

# Local実行のため、Google ドライブのマウントは行わない
# from google.colab import drive
# drive.mount('/content/drive')

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#seaborn設定
sns.set()
#背景変更
sns.set_style("darkgrid", {'grid.linestyle': '--'})
#大きさ(スケール変更)
sns.set_context("paper")

n=100
def true_func(x):
    z = 1-48*x+218*x**2-315*x**3+145*x**4
    return z 

def linear_func(x):
    z = x
    return z

# 真の関数からノイズを伴うデータを生成

# 真の関数からデータ生成
data = np.random.rand(n).astype(np.float32)
data = np.sort(data)
target = true_func(data)

# 　ノイズを加える
noise = 0.5 * np.random.randn(n) 
target = target  + noise

# ノイズ付きデータを描画
plt.scatter(data, target)
plt.title('NonLinear Regression')
# plt.legend(loc=2)

Text(0.5, 1.0, 'NonLinear Regression')

svg

線形回帰では十分にフィットできない(未学習の状態)。

from sklearn.linear_model import LinearRegression

clf = LinearRegression()
data = data.reshape(-1,1)
target = target.reshape(-1,1)
clf.fit(data, target)

p_lin = clf.predict(data)

plt.scatter(data, target, label='data')
plt.plot(data, p_lin, color='darkorange', marker='', linestyle='-', linewidth=1, markersize=6, label='linear regression')
plt.legend(loc=2)
print(clf.score(data, target))

0.40814627626298416

svg

多項式回帰

回帰モデルの形は、

f(x) = \sum_{n=0}^{N-1} w_n x^n

入力 $x$ から $\phi_n(x) = x^n, n=0,1,2,\ldots$ を計算しこれを $n$ 次元ベクトルの説明変数だと思って、重回帰を行えばよい。動画講義のスライド同様の結果が得られている

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
deg = [1,2,3,4,5,6,7,8,9,10]

plt.figure(figsize=(8,5))
plt.scatter(data, target, label='data')
plt.title("polinomial regression")
for d in deg:
    regr = Pipeline([
        ('poly', PolynomialFeatures(degree=d)),
        ('linear', LinearRegression())
    ])
    regr.fit(data, target)
    # make predictions
    p_poly = regr.predict(data)
    # plot regression result
    plt.plot(data, p_poly, label='degree %d' % (d))
plt.legend(loc="lower right")

<matplotlib.legend.Legend at 0x7f749c2b7f40>

svg

ガウシアン基底 + 正則化項

次はRBFカーネル(ガウシアン基底)を使う。(RBFカーネルにも色々あるらしいが、scikit-learnの関数を追っかけていくとガウシアンだと分かる)

つまり fit 時に与えたデータ $x_0, x_1,\ldots, x_{N-1}$ に対して、

f(x) =\sum_{i=0}^{N-1} w_i \exp\bigl(-\gamma (x-x_i)^2\bigr)

で回帰を行うということ。

回帰の方式は、Ridge回帰ということで、評価関数は

J(\mathrm{w}) = \sum_i \bigl( y - f(x_i) \bigr)^2 + \alpha \sum_i w_i^2

となる。

まずは、sklearn の KernelRidge を使った簡単な実装で、

$\gamma=1$ (デフォルト値)
$alpha=0.0002$

の場合。

from sklearn.kernel_ridge import KernelRidge

clf = KernelRidge(alpha=0.0002, kernel='rbf')
clf.fit(data, target)
p_kridge = clf.predict(data)

rbf_kernel() 関数によって、 $\mathrm{X} \rightarrow \Phi(\mathrm{X})$ の変換ができるので、多項式回帰と同様に線形回帰との組み合わせで実装することもできる。以下は、

$\alpha=30$
$\gamma=50$

の場合。はじめのモデルと比べてペナルティを強くかけている分、係数が0に集まっていることが分かる

from sklearn.metrics.pairwise import rbf_kernel
from sklearn.linear_model import Ridge

# 学習
kx = rbf_kernel(X=data, Y=data, gamma=50)
clf2 = Ridge(alpha=30)
clf2.fit(kx, target)
p_ridge = clf2.predict(kx)

# plot　
plt.scatter(data, target,label='data')
for i in range(len(kx)):
    if i == 0:
        plt.plot(data, kx[i], color='black', linestyle='-', linewidth=1, markersize=3, label='rbf kernels for each data', alpha=0.2)
    else:
        plt.plot(data, kx[i], color='black', linestyle='-', linewidth=1, markersize=3, alpha=0.2)

plt.plot(data, p_ridge, color='green', linestyle='-', linewidth=1, markersize=3,label='ridge regression #2 (alpha=30,gamma=50)')
plt.plot(data, p_kridge, color='orange', linestyle='-', linewidth=1, markersize=3,label='ridge regression #1 (alpha=0.0002,gamma=1)')
plt.legend()
plt.show()
# score
print("ridge regression #1 : %f" % (clf.score(data,target)))
print("ridge regression #2 : %f" % (clf2.score(kx, target)))

# weight coeff histogram

plt.hist([clf.dual_coef_[:,0], clf2.coef_[0,:]],label=['ridge regression 1','ridge regression 2'], bins=10)
plt.title("coef. histgram")
plt.legend()

svg

ridge regression #1 : 0.857598 ridge regression #2 : 0.836881 <matplotlib.legend.Legend at 0x7f7497a3d190>

svg

つぎに Lasso 回帰の効果を確認する。ハンズオンに最初から入力されていた $alpha=10000$ はペナルティが強すぎてほとんど、誤差が減らない。トライアル&エラーで $alpha=0.002$ がそれなりに fit することを見つけた。

実際に学習された係数を確認してみると、100のうち 4つの重みのみが非ゼロとなった。スパース推定ができていそう。

#Lasso
from sklearn.metrics.pairwise import rbf_kernel
from sklearn.linear_model import Lasso

kx = rbf_kernel(X=data, Y=data, gamma=5)

lasso_clf = Lasso(alpha=0.002, max_iter=50000)
lasso_clf.fit(kx, target)
p_lasso = lasso_clf.predict(kx)

lasso_clf2 = Lasso(alpha=10000, max_iter=50000)
lasso_clf2.fit(kx, target)
p_lasso2 = lasso_clf2.predict(kx)

# plot
plt.scatter(data, target)
plt.plot(data, p_lasso, label="alpha=0.002")
plt.plot(data, p_lasso2,label="alpha=10000")
plt.title("Lasso Regression")
plt.legend()
plt.show()

# score
print("alpha=0.002 :%f" % (lasso_clf.score(kx, target)))
print("alpha=10000 :%f" % (lasso_clf2.score(kx, target)))

print("weight for the model with alpha=0.002")
print(lasso_clf.coef_)

svg

alpha=0.002 :0.848822
    alpha=10000 :-0.000000
    weight for the model with alpha=0.002
    [-0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -2.4409611
     -7.19585    -1.6620084  -4.9211154  -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.          0.
      0.          0.          0.          0.          0.          0.
      0.          0.          0.          0.          0.          0.
      0.          0.          0.          0.          0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -0.
     -0.         -0.         -0.         -0.         -0.         -7.282311
     -6.780508   -0.03432941 -0.         -0.        ]

サポートベクトル回帰(SVR)

名前だけ見ると、これは後に学習するサポートベクトルマシンを回帰問題に応用したもの？調べてみると

モデル $f(x)$ は基底関数法と同じ
L2ノルムをペナルティにもつ
誤差関数が二乗誤差和でなく $\max(0, |f(x) - y|-\epsilon)$

らしい。ハンズオンに元々入力されていた $gamma=0.1$ ではうまくフィットしなかったため、 $gamma=1$ で学習。

from sklearn import model_selection, preprocessing, linear_model, svm

# SVR-rbf
clf_svr = svm.SVR(kernel='rbf', C=1e3, gamma=1, epsilon=0.1)
clf_svr.fit(data, target[:,0])
y_rbf = clf_svr.fit(data, target[:,0]).predict(data)

clf_svr2 = svm.SVR(kernel='rbf', C=1e3, gamma=0.1, epsilon=0.1)
clf_svr2.fit(data, target[:,0])
y_rbf2 = clf_svr2.fit(data, target[:,0]).predict(data)


# plot
plt.scatter(data, target, color='darkorange', label='data')
plt.title("Support Vector Regression (RBF)")
plt.plot(data, y_rbf, label='gamma=1')
plt.plot(data, y_rbf2, label='gamma=0.1')

plt.legend()
plt.show()

svg

多層パーセプトロンによる回帰

Kerasによる回帰の例も示されている。が、これはきっと Stage3移行で改めて勉強するので、実行して結果を確認する程度にする

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.1, random_state=0)

!mkdir -p ./out/checkpoints 
!mkdir -p ./out/tensorBoard

from keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint

cb_cp = ModelCheckpoint('./out/checkpoints/weights.{epoch:02d}-{val_loss:.2f}.hdf5', verbose=1, save_weights_only=True)
cb_tf  = TensorBoard(log_dir='./out/tensorBoard', histogram_freq=0)

def relu_reg_model():
    model = Sequential()
    model.add(Dense(10, input_dim=1, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='relu'))
    model.add(Dense(1000, activation='linear'))
#     model.add(Dense(100, activation='relu'))
#     model.add(Dense(100, activation='relu'))
#     model.add(Dense(100, activation='relu'))
#     model.add(Dense(100, activation='relu'))
    model.add(Dense(1))

    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, BatchNormalization
from keras.wrappers.scikit_learn import KerasRegressor

# use data split and fit to run the model
estimator = KerasRegressor(build_fn=relu_reg_model, epochs=100, batch_size=5, verbose=1)

history = estimator.fit(x_train, y_train, callbacks=[cb_cp, cb_tf], validation_data=(x_test, y_test))

出力は抜粋(本当は1エポック毎出力されているのだが)

Epoch 1/100
    18/18 [==============================] - 1s 25ms/step - loss: 1.6244 - val_loss: 1.5469

    ...

    Epoch 00009: saving model to ./out/checkpoints/weights.09-1.04.hdf5
    Epoch 10/100
    18/18 [==============================] - 0s 11ms/step - loss: 1.0322 - val_loss: 0.8556
    
    ...
    
    Epoch 00049: saving model to ./out/checkpoints/weights.49-0.47.hdf5
    Epoch 50/100
    18/18 [==============================] - 0s 11ms/step - loss: 0.3095 - val_loss: 0.4469

    ...

    Epoch 00099: saving model to ./out/checkpoints/weights.99-0.49.hdf5
    Epoch 100/100
    18/18 [==============================] - 0s 12ms/step - loss: 0.2995 - val_loss: 0.3197
    
    Epoch 00100: saving model to ./out/checkpoints/weights.100-0.32.hdf5

y_pred = estimator.predict(x_train)

18/18 [==============================] - 0s 3ms/step

plt.title('NonLiner Regressions via DL by ReLU')
plt.plot(data, target, 'o')
plt.plot(data, true_func(data), '.')
plt.plot(x_train, y_pred, "o", label='predicted: deep learning')
#plt.legend(loc=2)

[<matplotlib.lines.Line2D at 0x7f746052cf70>]

svg

plt.figure(figsize=(8,5))
plt.title('compare all regression models')
x=np.linspace(0,1,num=100)
plt.plot(x, true_func(x),label="true function")

plt.plot(x_train, y_pred, "o", label='DL')
plt.plot(data, y_rbf ,label="SVR(RBF)")
plt.plot(data, p_lasso ,label="Lasso(RBF)")
plt.plot(data, p_kridge ,label="Ridge(RBF)")
plt.legend(loc=2)
plt.show()

#plt.plot(data, clf ,label="Polinomial")


#plt.legend(loc=2)

svg

ロジスティック回帰

動画講義メモ(要点のまとめ)

参考: google AI blog の記事
- DNNじゃなくてあえてlogistic 回帰使いました
- using ML to Predict Parking Difficulty (2017年 2月)

分類問題に対するアプローチ
1. 識別的アプローチ
- $p(C_k | x )$ を直接モデル化する方法
  - logistic 回帰は、これ。
- 識別関数をつかう方法
  - $f(x) > 0$ なら $C = 1$ , $f(x) < 0$ なら $C = 0$ みたいな
  - SVMはこれ
1. 生成的アプローチ
- $p(C_k)$ と $P(x|C_k)$ を model化して、その後 Bayesの定理を用いて、 $p(C_k | x )$ を求める
  - ハズレ値の検出をもとめたり、生成モデルをつくれたりする
- ロジスティック回帰での使い方
  - 一般に実数全体をとり得る入力の線形結合に sigmoid 関数をかけて値域を0~1にすることで確率とみなせるようにしている
  - 確率が0.5以上ならば、1, 未満なら、0と予測
- なんでわざわざ確率にする? -> 判断の保留などが可能
  - 識別関数だと、識別結果をそのまま使用するしかない
最尤推定: ベルヌーイ分布のパラメータ推定
- データ-> ベールヌーイ分布の計算
  - 1回の試行 $P(y) = p^y (1-p)^{1-y}$
  - n回の試行で、 $y_1,..,y_n$ が同時に起こる確率 $P(y_1,..,y_n;p)= \prod_{i=0}^{i=n-1} p^{y_i} (1-p)^{1-y_i}$
- パラメータである p をデータ $y_1,..,y_n$ から推定したい。
  - pを色々変えて、 $y_1,..,y_n$ が得られる確率が最大になるような $p$ を求める->最尤推定
- ロジスティク回帰モデルの最尤推定
  - $P(Y=y_1 | \bm{x}_1 ) =p_1^{y_1} (1-p_1)^{1-y_1}$
    - 確率はロジスティック回帰 $p_1 = \sigma(\bm{w}^T \bm{x}_1)$ で計算できるので、尤度が最大になる $\bm{w}$ を探す問題
更新式の導出: ひたすら微分。もう何度もやったのでここでは省略.
- 対数尤度を使う理由
  - 微分の計算が簡単ってのは、正直どうでもいい (それならそう書かないでくれ)
  - 尤度 : 確率を何度も掛け算する -> 桁落ちの可能性
- 微分のchane ルールを理解して使いこなすことが重要
確率的勾配法
- Iteration によってデータを入れ替えることで、パラメータ更新の勾配方向をある程度不規則にして、初期値依存制を低減する効果を狙う
- ただし、ロジスティック回帰なら単峰制なので局所最適に陥る心配はないのだが
- p.70 の参考文献
  - 層を深くすること近似できる関数の自由度がどう増えていくか? とか興味深いがそもそも、どんな意図で貼られているんだろう?
モデルの評価
- 混合行列でマトリックスを使って整理.
  - 行 = モデルの予測結果の Positive/Negative
  - 列 = 検証データの Positive/Negative
  - 対角成分が正解(True), 非対角成分が不正解(False)となる
- FalsePotsitve と FalseNegativeとを分けて評価する
  - Link: http://ibisforest.org/index.php?F%E5%80%A4
  - 検証データ中の Posi/Nega のバランスが取れてないとき何をみてるかわからなくなる
    - 間違い方によって何が嫌かが違う
  - 再現率(Recall): Positiveデータに対する正答率 = TP/(TP+FN)
    - 過剰検出を許容して、ヌケモレのなさを重視する評価
  - 適合率(Precision): モデルがPositiveと予測したものに対する正答率 = TP/(TP+FP)
    - 見逃しを許容して、慎重な検出を評価する
  - F値 : Recall と Precision の調和平均
    - どっちにも振れない微妙なタスクにつかう?
  - 現場でよくやるのは、上記3つ全部並べておくこと

ハンズオン

データ読み込み & 前処理

# from google.colab import drive
# drive.mount('/content/drive')

#from モジュール名 import クラス名（もしくは関数名や変数名）
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#matplotlibをinlineで表示するためのおまじない (plt.show()しなくていい)
%matplotlib inline

# titanic data csvファイルの読み込み
titanic_df = pd.read_csv('../data/titanic_train.csv')

# ファイルの先頭部を表示し、データセットを確認する
titanic_df.head(5)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

#予測に不要と考えるカラムをドロップ (本当はここの情報もしっかり使うべきだと思っています)
titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

#一部カラムをドロップしたデータを表示
# titanic_df.head()

#nullを含んでいる行を表示
titanic_df[titanic_df.isnull().any(1)].head(10)

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked
5	0	3	male	NaN	0	8.4583	Q
17	1	2	male	NaN	0	13.0000	S
19	1	3	female	NaN	0	7.2250	C
26	0	3	male	NaN	0	7.2250	C
28	1	3	female	NaN	0	7.8792	Q
29	0	3	male	NaN	0	7.8958	S
31	1	1	female	NaN	1	146.5208	C
32	1	3	female	NaN	0	7.7500	Q
36	1	3	male	NaN	0	7.2292	C
42	0	3	male	NaN	0	7.8958	C

#Ageカラムのnullを中央値で補完
titanic_df['AgeFill'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())

#再度nullを含んでいる行を表示 (Ageのnullは補完されている)
titanic_df[titanic_df.isnull().any(1)]

#titanic_df.dtypes

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	AgeFill
5	0	3	male	NaN	0	0	8.4583	Q	29.699118
17	1	2	male	NaN	0	0	13.0000	S	29.699118
19	1	3	female	NaN	0	0	7.2250	C	29.699118
26	0	3	male	NaN	0	0	7.2250	C	29.699118
28	1	3	female	NaN	0	0	7.8792	Q	29.699118
...	...	...	...	...	...	...	...	...	...
859	0	3	male	NaN	0	0	7.2292	C	29.699118
863	0	3	female	NaN	8	2	69.5500	S	29.699118
868	0	3	male	NaN	0	0	9.5000	S	29.699118
878	0	3	male	NaN	0	0	7.8958	S	29.699118
888	0	3	female	NaN	1	2	23.4500	S	29.699118

179 rows × 9 columns

実装(チケット価格から生死を判別)

#運賃だけのリストを作成
data1 = titanic_df.loc[:, ["Fare"]].values

#生死フラグのみのリストを作成
label1 =  titanic_df.loc[:,["Survived"]].values

from sklearn.linear_model import LogisticRegression

# 学習
model=LogisticRegression()
model.fit(data1, label1[:,0])

# 学習結果の出力
print("モデルパラメータ")
print (model.intercept_)
print (model.coef_)

# 試しにいくつか予測 : 運賃 61USD, 62USD の間で生死の確率が逆転する
print("運賃 61USD")
print( model.predict([[61]]))
print( model.predict_proba([[61]]))

print("運賃 62USD")
print( model.predict([[62]]))
print( model.predict_proba([[62]]))

モデルパラメータ
    [-0.94131796]
    [[0.01519666]]
    運賃 61USD
    [0]
    [[0.50358033 0.49641967]]
    運賃 62USD
    [1]
    [[0.49978123 0.50021877]]

## 可視化 (おまけ程度なので飛ばします、だそう)
X_test_value = model.decision_function(data1) 

# # 決定関数値（絶対値が大きいほど識別境界から離れている）
# X_test_value = model.decision_function(X_test) 
# # 決定関数値をシグモイド関数で確率に変換
# X_test_prob = normal_sigmoid(X_test_value) 

w_0 = model.intercept_[0]
w_1 = model.coef_[0,0]

def sigmoid(x):
    return 1 / (1+np.exp(-(w_1*x+w_0)))

x_range = np.linspace(-1, 500, 3000)

plt.figure(figsize=(9,5))

plt.plot(data1, model.predict_proba(data1)[:,0], 'o', label="predicted prob. (survive)")
plt.plot(data1, model.predict_proba(data1)[:,1], 'o', label="predicted prob. (not survive)")
plt.plot(x_range, sigmoid(x_range), '--', label="sigmoid function with param")
plt.legend()
plt.show()

svg

実装(2変数から生死を判別)

ただ変数を増やすだけでなく、特徴量エンジニアリングを経験するのがポイント

#AgeFillの欠損値を埋めたので
#titanic_df = titanic_df.drop(['Age'], axis=1)

# 'Gender' の male,female を それぞれ 0,1 に置き換え
titanic_df['Gender'] = titanic_df['Sex'].map({'female': 0, 'male': 1}).astype(int)
titanic_df.head(3)

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	AgeFill	Gender
0	0	3	male	22.0	1	7.2500	S	22.0	1
1	1	1	female	38.0	1	71.2833	C	38.0	0
2	1	3	female	26.0	0	7.9250	S	26.0	0

# Pclass と Gender を足し合わせた新たな特徴量 'Pclass_Gender' をつくｒ
titanic_df['Pclass_Gender'] = titanic_df['Pclass'] + titanic_df['Gender']
titanic_df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	AgeFill	Gender	Pclass_Gender
0	0	3	male	22.0	1	7.2500	S	22.0	1	4
1	1	1	female	38.0	1	71.2833	C	38.0	0	1
2	1	3	female	26.0	0	7.9250	S	26.0	0	3
3	1	1	female	35.0	1	53.1000	S	35.0	0	1
4	0	3	male	35.0	0	8.0500	S	35.0	1	4

# 不要になった特徴量を drop
titanic_df = titanic_df.drop(['Pclass', 'Sex', 'Gender','Age'], axis=1)
titanic_df.head()

	Survived	SibSp	Fare	Embarked	AgeFill	Pclass_Gender
0	0	1	7.2500	S	22.0	4
1	1	1	71.2833	C	38.0	1
2	1	0	7.9250	S	26.0	3
3	1	1	53.1000	S	35.0	1
4	0	0	8.0500	S	35.0	4

#　使用する2変数だけのリストを作成
data2 = titanic_df.loc[:, ["AgeFill", "Pclass_Gender"]].values

#生死フラグのみのリストを作成
label2 =  titanic_df.loc[:,["Survived"]].values

# skleanr 実装

# 学習
model2 = LogisticRegression()
model2.fit(data2, label2[:,0])

# 予測　
print("10歳でPclass_Gender=1")
print(model2.predict([[10,1]]))
print(model2.predict_proba([[10,1]]))

# 予測
print("10歳でPclass_Gender=4")
print(model2.predict([[10,4]]))
print(model2.predict_proba([[10,4]]))

10歳でPclass_Gender=1 [1] [[0.03754749 0.96245251]] 10歳でPclass_Gender=4 [0] [[0.78415473 0.21584527]]

# numpy 実装
def add_one(x):
    return np.concatenate([np.ones(len(x))[:, None], x], axis=1)

# overflow,underflow 対策 のため素直な実装はしない
def sigmoid(x):
    return (np.tanh(x/2) + 1)/2

# うまく収束しているかどうか確認するための loss計算
def loss(X,y,w):
    yhat = sigmoid(np.dot(X,w))
    # これだと、yhat = 0,1の時にエラーがでる。
    # temp = y * np.log(yhat) + (1-y) * np.log(1-yhat)
    temp = np.log( (1-y)-(-1)**y * yhat)
    return -sum(temp)/len(temp)

# バッチの共役勾配法
def sgd(X_train, y_train, max_iter, eta):
    # w = np.random.rand(X_train.shape[1])
    w = np.zeros(X_train.shape[1])
    for i in range(max_iter):
        if i % 100000 == 0:
            print(loss(X_train, y_train, w))
        w_prev = np.copy(w)
        yhat = sigmoid(np.dot(X_train, w))
        w -= eta * np.dot(X_train.T, (yhat - y_train))
        #if np.allclose(w, w_prev):
        #    return w
    return w

# ヒューリスティックにパラメータチューニング
max_iter=2000000 
eta = 0.000001
w = sgd(add_one(data2), label2[:,0], max_iter, eta)

0.6931471805599373
    0.5079276216927482
    0.4905915541820542
    0.4862090047552061
    0.48489864641172786
    0.48447258191832004
    0.48432764482843516
    0.48427706879224364
    0.48425915804405467
    0.48425275998902756
    0.4842504626951552
    0.4842496352899579
    0.4842493367393749
    0.4842492288953244
    0.48424918991352967
    0.4842491758173902
    0.48424917071889134
    0.484249168874524
    0.4842491682072763
    0.4842491679658649

print("numpy実装で得られたパラメータ")
print(w)

print("sklearn実装で得られたパラメータ")
print(model2.intercept_)
print(model2.coef_)

numpy実装で得られたパラメータ
    [ 5.27828007 -0.04669006 -1.52804692]
    sklearn実装で得られたパラメータ
    [5.2174269]
    [[-0.04622413 -1.51130754]]

titanic_df.head(3)

	Survived	SibSp	Fare	Embarked	AgeFill	Pclass_Gender
0	0	1	7.2500	S	22.0	4
1	1	1	71.2833	C	38.0	1
2	1	0	7.9250	S	26.0	3

特徴量空間上にサンプルと、決定境界をプロットしてみよう。

ロジスティック回帰の場合、確率0.5が境界。

$\begin{array}{rcl} \frac{1}{1+e^{\boldsymbol{w}^T \boldsymbol{x} + w_0}} &> \frac{1}{2} \\ e^{\boldsymbol{w}^T \boldsymbol{x} + w_0} &< 1 \\ \boldsymbol{w}^T \boldsymbol{x} + w_0 &< 0 \end{array}$ が決定境界になる。

from matplotlib.colors import ListedColormap

np.random.seed = 0
h = 0.02
xmin, xmax = -5, 85
ymin, ymax = 0.5, 4.5
index_survived = titanic_df[titanic_df["Survived"]==0].index
index_notsurvived = titanic_df[titanic_df["Survived"]==1].index

xx, yy = np.meshgrid(np.arange(xmin, xmax, h), np.arange(ymin, ymax, h))
Z = model2.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

fig, ax = plt.subplots()
levels = np.linspace(0, 1.0)
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
#contour = ax.contourf(xx, yy, Z, cmap=cm, levels=levels, alpha=0.5)

sc = ax.scatter(titanic_df.loc[index_survived, 'AgeFill'],
                titanic_df.loc[index_survived, 'Pclass_Gender']+(np.random.rand(len(index_survived))-0.5)*0.1,
                color='r', label='Not Survived', alpha=0.3)
sc = ax.scatter(titanic_df.loc[index_notsurvived, 'AgeFill'],
                titanic_df.loc[index_notsurvived, 'Pclass_Gender']+(np.random.rand(len(index_notsurvived))-0.5)*0.1,
                color='b', label='Survived', alpha=0.3)

ax.set_xlabel('AgeFill')
ax.set_ylabel('Pclass_Gender')
ax.set_xlim(xmin, xmax)
ax.set_ylim(ymin, ymax)
#fig.colorbar(contour)

# 決定境界
x1 = xmin
x2 = xmax
y1 = -1*(model2.intercept_[0]+model2.coef_[0][0]*xmin)/model2.coef_[0][1]
y2 = -1*(model2.intercept_[0]+model2.coef_[0][0]*xmax)/model2.coef_[0][1]
ax.plot([x1, x2] ,[y1, y2], 'g--', label="decision boundary(sklearn)")

y1 = -1*(w[0]+w[1]*xmin)/w[2]
y2 = -1*(w[0]+w[1]*xmax)/w[2]
ax.plot([x1, x2] ,[y1, y2], 'y--', label="decision boundary(numpy)")

# ax.legend(bbox_to_anchor=(1.4, 1.03))
ax.legend(bbox_to_anchor=(1.0, 1.0))

svg

モデル評価

ここでは sklearn, seaborn などを使ったモデル評価の方法をいくつか使ってみる。可視化はおまけ、ということなので、何をやってるか簡単に理解するにとどめておく。

confusion matrix や、Recall, Precision などはビデオ講義で学んだ
seaborn の point plot では $x$ で指定したクラス分類それぞれに付いて、 $y$ の点推定・信頼区間をプロットする.
- 今回 estimator には何も指定していないので母平均の推定を行っている
seaborn の lmplot は、回帰+その結果の可視化をしてくれるもの。
- 今回は logistic に True を指定しているため、ロジスティック回帰を行うが、線形回帰や多項式回帰なども行えるようだ。
- hue,col の指定はそのままでは回帰に使えない, カテゴリ変数を扱うためのもの? グラフを分けたり、色を分けたりする. (この機能を使うことをさして faceted と言っているもよう)
- 薄い範囲は信頼区間(デフォルト95%)を示すようだ。

# データの分割 (本来は同じデータセットを分割しなければいけない。(簡易的に別々に分割している)
from sklearn.model_selection import train_test_split
traindata1, testdata1, trainlabel1, testlabel1 = train_test_split(data1, label1, test_size=0.2)
traindata2, testdata2, trainlabel2, testlabel2 = train_test_split(data2, label2, test_size=0.2)

# 学習
eval_model1=LogisticRegression()
predictor_eval1=eval_model1.fit(traindata1, trainlabel1[:,0]).predict(testdata1)

eval_model2=LogisticRegression()
predictor_eval2=eval_model2.fit(traindata2, trainlabel2[:,0]).predict(testdata2)

# 評価
print("")
print("score:")
print("(model 1, train) = %f" % (eval_model1.score(traindata1, trainlabel1)))
print("(model 1, test)  = %f" % (eval_model1.score(testdata1,testlabel1)))
print("(model 2, train)  = %f" % (eval_model2.score(traindata2, trainlabel2)))
print("(model 1, test)  = %f" % (eval_model2.score(testdata2,testlabel2)))

from sklearn import metrics
print("")
print("metrics:")
print("model1")
print(metrics.classification_report(testlabel1, predictor_eval1))
print("model2")
print(metrics.classification_report(testlabel2, predictor_eval2))

score:
    (model 1, train) = 0.653090
    (model 1, test)  = 0.715084
    (model 2, train)  = 0.761236
    (model 1, test)  = 0.821229
    
    metrics:
    model1
                  precision    recall  f1-score   support
    
               0       0.70      0.97      0.81       113
               1       0.86      0.27      0.41        66
    
        accuracy                           0.72       179
       macro avg       0.78      0.62      0.61       179
    weighted avg       0.76      0.72      0.67       179
    
    model2
                  precision    recall  f1-score   support
    
               0       0.82      0.91      0.86       108
               1       0.83      0.69      0.75        71
    
        accuracy                           0.82       179
       macro avg       0.82      0.80      0.81       179
    weighted avg       0.82      0.82      0.82       179

def plot_confusion_matrix(cm,title):
    fig = plt.figure(figsize = (7,7))
    plt.title(title)
    sns.heatmap(
        cm,
        vmin=None,
        vmax=None,
        cmap="Blues",
        center=None,
        robust=False,
        annot=True, fmt='.2g',
        annot_kws=None,
        linewidths=0,
        linecolor='white',
        cbar=True,
        cbar_kws=None,
        cbar_ax=None,
        square=True, ax=None, 
        #xticklabels=columns,
        #yticklabels=columns,
        mask=None)
    return 


from sklearn.metrics import confusion_matrix
confusion_matrix1=confusion_matrix(testlabel1, predictor_eval1)
confusion_matrix2=confusion_matrix(testlabel2, predictor_eval2)


plot_confusion_matrix(confusion_matrix1, "model1")
plot_confusion_matrix(confusion_matrix2, "model2")

svg

#Paired categorical plots

import seaborn as sns
sns.set(style="whitegrid")

# Load the example Titanic dataset
titanic = sns.load_dataset("titanic")

# Set up a grid to plot survival probability against several variables
g = sns.PairGrid(titanic, y_vars="survived",
                 x_vars=["class", "sex", "who", "alone"],
                 size=5, aspect=.5)

# Draw a seaborn pointplot onto each Axes
g.map(sns.pointplot, color=sns.xkcd_rgb["plum"])
g.set(ylim=(0, 1))
sns.despine(fig=g.fig, left=True)

plt.show()

実際には warning が表示されていたが、割愛。ローカルPC内のパスが含まれていたため

svg

#Faceted logistic regression

import seaborn as sns
sns.set(style="darkgrid")

# Load the example titanic dataset
df = sns.load_dataset("titanic")

# Make a custom palette with gendered colors
pal = dict(male="#6495ED", female="#F08080")

# Show the survival proability as a function of age and sex
g = sns.lmplot(x="age", y="survived", col="sex", hue="sex", data=df,
               palette=pal, y_jitter=.02, logistic=True)
g.set(xlim=(0, 80), ylim=(-.05, 1.05))
plt.show()

svg

主成分分析 (PCA)

動画講義メモ(要点のまとめ)

次元圧縮(変量の個数を減らす)のに使われる方法
- 線形変換によって、 $N$ 個の $M$ 次元のデータを $J$ 次元にしたい
  - データ行列から、データを引いた $\bar{X} := X - \bar{x}$ として、 $\bm{s}_j = \bar{X} \bm{a}_j$ が各データの $j$ 次元めの値となる。
- 変換前後のデータの散らばり具合を維持できれば、データの情報量が減らないと考える
  - $\bm{a}_j j=1,2,\ldots, J$ は、元のデータの広がりが大きい方向。
  - $\bm{a}_j$ の見つけ方
    - 目的関数: 変換後の分散最大化
      - $\argmax_{a} = \bm{a}_j^T Var(\bar{X}) \bm{a}_j$
    - 制約条件: ノルム一定の制約で解の任意制を取り除く
      - $\bm{a}_j^T \bm{a}_j =1$
    - ラグランジュ未定乗数法をつかって解く -> 解: $Var(\bar{X}) \bm{a}_j = \lambda \bm{a}_j$
      - 取り出すべき軸の方向は、固有ベクトルの方向
      - 射影先の分散は固有値と一致
  - 寄与率 : 圧縮前後の情報のロスを定量化したもの
    - 第 $k$ 成分の寄与率は、 $c_k = \frac{\lambda_k}{\sum_{m=1}^{M} \lambda_m}$
    - 第 1 ~ $k$ 主成分までの累積寄与率は $r_k = \frac{\sum_{j=1}^{J}\lambda_j}{\sum_{m=1}^{M} \lambda_m}$

ハンズオン

https://ohke.hateblo.jp/entry/2017/08/11/230000 を参考に利用した、乳がんデータセットの主成分分析

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline

cancer_df = pd.read_csv('../data/cancer.csv')
print('cancer df shape: {}'.format(cancer_df.shape))

cancer df shape: (569, 33)

cancer_df

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst	Unnamed: 32
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	...	17.33	184.60	2019.0	0.16220	0.66560	0.7119	0.2654	0.4601	0.11890	NaN
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	...	23.41	158.80	1956.0	0.12380	0.18660	0.2416	0.1860	0.2750	0.08902	NaN
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	...	25.53	152.50	1709.0	0.14440	0.42450	0.4504	0.2430	0.3613	0.08758	NaN
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	...	26.50	98.87	567.7	0.20980	0.86630	0.6869	0.2575	0.6638	0.17300	NaN
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	...	16.67	152.20	1575.0	0.13740	0.20500	0.4000	0.1625	0.2364	0.07678	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
564	926424	M	21.56	22.39	142.00	1479.0	0.11100	0.11590	0.24390	0.13890	...	26.40	166.10	2027.0	0.14100	0.21130	0.4107	0.2216	0.2060	0.07115	NaN
565	926682	M	20.13	28.25	131.20	1261.0	0.09780	0.10340	0.14400	0.09791	...	38.25	155.00	1731.0	0.11660	0.19220	0.3215	0.1628	0.2572	0.06637	NaN
566	926954	M	16.60	28.08	108.30	858.1	0.08455	0.10230	0.09251	0.05302	...	34.12	126.70	1124.0	0.11390	0.30940	0.3403	0.1418	0.2218	0.07820	NaN
567	927241	M	20.60	29.33	140.10	1265.0	0.11780	0.27700	0.35140	0.15200	...	39.42	184.60	1821.0	0.16500	0.86810	0.9387	0.2650	0.4087	0.12400	NaN
568	92751	B	7.76	24.54	47.92	181.0	0.05263	0.04362	0.00000	0.00000	...	30.37	59.16	268.6	0.08996	0.06444	0.0000	0.0000	0.2871	0.07039	NaN

569 rows × 33 columns

cancer_df.drop('Unnamed: 32', axis=1, inplace=True)
cancer_df

	id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	...	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
0	842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.30010	0.14710	...	25.380	17.33	184.60	2019.0	0.16220	0.66560	0.7119	0.2654	0.4601	0.11890
1	842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.08690	0.07017	...	24.990	23.41	158.80	1956.0	0.12380	0.18660	0.2416	0.1860	0.2750	0.08902
2	84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.19740	0.12790	...	23.570	25.53	152.50	1709.0	0.14440	0.42450	0.4504	0.2430	0.3613	0.08758
3	84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.24140	0.10520	...	14.910	26.50	98.87	567.7	0.20980	0.86630	0.6869	0.2575	0.6638	0.17300
4	84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.19800	0.10430	...	22.540	16.67	152.20	1575.0	0.13740	0.20500	0.4000	0.1625	0.2364	0.07678
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
564	926424	M	21.56	22.39	142.00	1479.0	0.11100	0.11590	0.24390	0.13890	...	25.450	26.40	166.10	2027.0	0.14100	0.21130	0.4107	0.2216	0.2060	0.07115
565	926682	M	20.13	28.25	131.20	1261.0	0.09780	0.10340	0.14400	0.09791	...	23.690	38.25	155.00	1731.0	0.11660	0.19220	0.3215	0.1628	0.2572	0.06637
566	926954	M	16.60	28.08	108.30	858.1	0.08455	0.10230	0.09251	0.05302	...	18.980	34.12	126.70	1124.0	0.11390	0.30940	0.3403	0.1418	0.2218	0.07820
567	927241	M	20.60	29.33	140.10	1265.0	0.11780	0.27700	0.35140	0.15200	...	25.740	39.42	184.60	1821.0	0.16500	0.86810	0.9387	0.2650	0.4087	0.12400
568	92751	B	7.76	24.54	47.92	181.0	0.05263	0.04362	0.00000	0.00000	...	9.456	30.37	59.16	268.6	0.08996	0.06444	0.0000	0.0000	0.2871	0.07039

569 rows × 32 columns

# ・diagnosis: 診断結果 (良性がB / 悪性がM)
# ・説明変数は 3列以降、目的変数を2列目としロジスティック回帰で分類

# 目的変数の抽出
y = cancer_df.diagnosis.apply(lambda d: 1 if d == 'M' else 0)

# 説明変数の抽出
X = cancer_df.loc[:, 'radius_mean':]

まずは、すべての説明変数を用いたロジスティック回帰で分類問題をといてみる。_ 検証スコア97で分類できた。

# 学習用とテスト用でデータを分離
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# 標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ロジスティック回帰で学習
#  https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html
# cv : 指定した cross validator によって、hyper parameter を自動で選択
#  int値が指定されたときは、指定された分割数で、 k-fold cross validation を行う. 
logistic = LogisticRegressionCV(cv=10, random_state=0, max_iter=200) # 100回だと収束条件を満たす前に終わってしまったようなので増やした　
logistic.fit(X_train_scaled, y_train)

# 検証
print('Train score: {:.3f}'.format(logistic.score(X_train_scaled, y_train)))
print('Test score: {:.3f}'.format(logistic.score(X_test_scaled, y_test)))
print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic.predict(X_test_scaled))))

Train score: 0.988 Test score: 0.972 Confustion matrix: [[89 1] [ 3 50]]

つぎに、PCAによる次元圧縮を行った上で精度を落とさずに分類問題を解けるか? やってみる各主成分の寄与率は下図の通り。第四主成分くらいまでで、寄与率8割になるので、そのあたりがいいところか?

pca = PCA(n_components=30)
pca.fit(X_train_scaled)
plt.bar([n for n in range(1, len(pca.explained_variance_ratio_)+1)], pca.explained_variance_ratio_)

svg

次元圧縮後の特徴量空間上でのサンプルの分布を直接みることも可能だ。わかりやすさのために、第二主成分までを使ってやってみる。 ( numpy 実装を試してみたが、sklearnと同様の結果になることが確認できた。)

# PCA
# 次元数2まで圧縮
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)
print('X_train_pca shape: {}'.format(X_train_pca.shape))
# X_train_pca shape: (426, 2)

# 寄与率
print('explained variance ratio: {}'.format(pca.explained_variance_ratio_))
# explained variance ratio: [ 0.43315126  0.19586506]

# 散布図にプロット
temp = pd.DataFrame(X_train_pca)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') # 良性は○でマーク
plt.scatter(x=m[0], y=m[1], marker='^') # 悪性は△でマーク
plt.xlabel('PC 1') # 第1主成分をx軸
plt.ylabel('PC 2') # 第2主成分をy軸

X_train_pca shape: (426, 2) explained variance ratio: [0.43315126 0.19586506] Text(0, 0.5, 'PC 2')

svg

# PCA の numpy 実装をためしてみる
n_components=2

def get_moments(X):
    mean = X.mean(axis=0)
    stan_cov = np.dot((X - mean).T, X - mean) / (len(X) - 1)
    return mean, stan_cov
    
def get_components(eigenvectors, n_components):
    # 固有ベクトル・固有値両方 昇順なので、逆転させるてから、取り出す
    W = eigenvectors[:, ::-1][:, :n_components]
    return W.T

def transform_by_pca(X, pca):
    mean = X.mean(axis=0)
    return np.dot(X-mean, pca)

## 実際にやってみる
n_components = 2
mean, stan_cov = get_moments(X_train_scaled)
eigenvalues, eigenvectors = np.linalg.eigh(stan_cov)
components = get_components(eigenvectors, n_components)

X_train_pca_np = transform_by_pca(X_train_scaled, -components.T) ## sklearn の結果と合わせるために符号を逆にした

# 散布図にプロット
temp = pd.DataFrame(X_train_pca_np)
temp['Outcome'] = y_train.values
b = temp[temp['Outcome'] == 0]
m = temp[temp['Outcome'] == 1]
plt.scatter(x=b[0], y=b[1], marker='o') # 良性は○でマーク
plt.scatter(x=m[0], y=m[1], marker='^') # 悪性は△でマーク
plt.xlabel('PC 1') # 第1主成分をx軸
plt.ylabel('PC 2') # 第2主成分をy軸

Text(0, 0.5, 'PC 2')

svg

さほど時間のかかる計算ではないので、第何主成分まで使うとどれくらいの精度、というのを虱潰しにためしてみる。実際には問題の要求される精度や、演算量制約にもよるのだが、ざっくり寄与率から判断した第4主成分あたりまで使う、というのは悪くない選択に思われる。

import numpy as np
train_scores = np.zeros(30) 
test_scores = np.zeros(30)

for k in range(1,31):
    pca = PCA(n_components=k)
    pca.fit(X_train_scaled)

    X_train_pca = pca.transform(X_train_scaled)
    X_test_pca  = pca.transform(X_test_scaled)

    logistic2 = LogisticRegressionCV(cv=10, random_state=0, max_iter=500) # 100回だと収束条件を満たす前に終わってしまったようなので増やした　
    logistic2.fit(X_train_pca, y_train)

    # 検証
    train_scores[k-1] = logistic2.score(X_train_pca, y_train)
    test_scores[k-1]  = logistic2.score(X_test_pca, y_test)
    # print('----------------------------')
    # print('n_components: %d' % ( k ))
    # print('Train score: {:.3f}'.format(train_scores[k-1]))
    # print('Test score: {:.3f}'.format(test_scores[k-1]))
    # print('Confustion matrix:\n{}'.format(confusion_matrix(y_true=y_test, y_pred=logistic2.predict(X_test_pca))))

plt.figure(figsize=(8,4))
plt.plot(range(30), train_scores,label="train")
plt.plot(range(30), test_scores,label="test")
plt.legend()
plt.xlabel("the number of components used")
plt.ylabel("score")
plt.grid()

svg

k-nn/means 法 (レポート提出課題の "アルゴリズム" の単元に該当?)

動画講義メモ(要点のまとめ)

k近傍法
- 分類問題の機械学習手法
- 最近傍のデータをk個撮ってきて、それらがもっとも多く所属するクラスに識別
  - ということは、2クラス分類なら、k は奇数にするということか
- k はハイパパラメータ
  - k を大きくすると決定境界がなめらかになる(多すぎると過学習気味)
k-平均法(k-means)
- 教師なし学習, クラスタリングを行う手法
- アルゴリズム
  1. 各クラスタについて、中心の初期値を設定
  2. 各データ点に対して、各クラスタ中心との距離を計算し、最も距離が近いクラスタを割り当てる
  3. 各クラスたの平均ベクトルを計算しあらたな中心とする
  4. 収束するまで2,3の処理を繰り返す
- 初期値重要
  - 初期値を変えるとクラスタリング結果も変わりうる
    - 初期値が離れていたほうがいい?
    - cf. k-means++ (試験でも出たらしい)
  - クラスタ数 k を変えてももちろん結果が変わる

ハンズオン

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

from sklearn import cluster, preprocessing, datasets
from sklearn.cluster import KMeans

#https://datahexa.com/kmeans-clustering-with-wine-dataset
wine = datasets.load_wine()
print(wine["DESCR"])

.. _wine_dataset:
    
    Wine recognition dataset
    ------------------------
    
    **Data Set Characteristics:**
    
        :Number of Instances: 178 (50 in each of three classes)
        :Number of Attributes: 13 numeric, predictive attributes and the class
        :Attribute Information:
     		- Alcohol
     		- Malic acid
     		- Ash
    		- Alcalinity of ash  
     		- Magnesium
    		- Total phenols
     		- Flavanoids
     		- Nonflavanoid phenols
     		- Proanthocyanins
    		- Color intensity
     		- Hue
     		- OD280/OD315 of diluted wines
     		- Proline
    
        - class:
                - class_0
                - class_1
                - class_2
    		
        :Summary Statistics:
        
        ============================= ==== ===== ======= =====
                                       Min   Max   Mean     SD
        ============================= ==== ===== ======= =====
        Alcohol:                      11.0  14.8    13.0   0.8
        Malic Acid:                   0.74  5.80    2.34  1.12
        Ash:                          1.36  3.23    2.36  0.27
        Alcalinity of Ash:            10.6  30.0    19.5   3.3
        Magnesium:                    70.0 162.0    99.7  14.3
        Total Phenols:                0.98  3.88    2.29  0.63
        Flavanoids:                   0.34  5.08    2.03  1.00
        Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
        Proanthocyanins:              0.41  3.58    1.59  0.57
        Colour Intensity:              1.3  13.0     5.1   2.3
        Hue:                          0.48  1.71    0.96  0.23
        OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
        Proline:                       278  1680     746   315
        ============================= ==== ===== ======= =====
    
        :Missing Attribute Values: None
        :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
        :Creator: R.A. Fisher
        :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
        :Date: July, 1988
    
    This is a copy of UCI ML Wine recognition datasets.
    https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
    
    The data is the results of a chemical analysis of wines grown in the same
    region in Italy by three different cultivators. There are thirteen different
    measurements taken for different constituents found in the three types of
    wine.
    
    Original Owners: 
    
    Forina, M. et al, PARVUS - 
    An Extendible Package for Data Exploration, Classification and Correlation. 
    Institute of Pharmaceutical and Food Analysis and Technologies,
    Via Brigata Salerno, 16147 Genoa, Italy.
    
    Citation:
    
    Lichman, M. (2013). UCI Machine Learning Repository
    [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
    School of Information and Computer Science. 
    
    .. topic:: References
    
      (1) S. Aeberhard, D. Coomans and O. de Vel, 
      Comparison of Classifiers in High Dimensional Settings, 
      Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
      Mathematics and Statistics, James Cook University of North Queensland. 
      (Also submitted to Technometrics). 
    
      The data was used with many others for comparing various 
      classifiers. The classes are separable, though only RDA 
      has achieved 100% correct classification. 
      (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
      (All results using the leave-one-out technique) 
    
      (2) S. Aeberhard, D. Coomans and O. de Vel, 
      "THE CLASSIFICATION PERFORMANCE OF RDA" 
      Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
      Mathematics and Statistics, James Cook University of North Queensland. 
      (Also submitted to Journal of Chemometrics).

df = pd.DataFrame(wine.data, columns=wine.feature_names)
df.head()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

df.describe()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
count	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000	178.000000
mean	13.000618	2.336348	2.366517	19.494944	99.741573	2.295112	2.029270	0.361854	1.590899	5.058090	0.957449	2.611685	746.893258
std	0.811827	1.117146	0.274344	3.339564	14.282484	0.625851	0.998859	0.124453	0.572359	2.318286	0.228572	0.709990	314.907474
min	11.030000	0.740000	1.360000	10.600000	70.000000	0.980000	0.340000	0.130000	0.410000	1.280000	0.480000	1.270000	278.000000
25%	12.362500	1.602500	2.210000	17.200000	88.000000	1.742500	1.205000	0.270000	1.250000	3.220000	0.782500	1.937500	500.500000
50%	13.050000	1.865000	2.360000	19.500000	98.000000	2.355000	2.135000	0.340000	1.555000	4.690000	0.965000	2.780000	673.500000
75%	13.677500	3.082500	2.557500	21.500000	107.000000	2.800000	2.875000	0.437500	1.950000	6.200000	1.120000	3.170000	985.000000
max	14.830000	5.800000	3.230000	30.000000	162.000000	3.880000	5.080000	0.660000	3.580000	13.000000	1.710000	4.000000	1680.000000

df.isnull().any()

alcohol                         False
    malic_acid                      False
    ash                             False
    alcalinity_of_ash               False
    magnesium                       False
    total_phenols                   False
    flavanoids                      False
    nonflavanoid_phenols            False
    proanthocyanins                 False
    color_intensity                 False
    hue                             False
    od280/od315_of_diluted_wines    False
    proline                         False
    dtype: bool

X=wine.data
y=wine.target

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3, random_state=55)

def species_label(theta):
    if theta == 0:
        return wine.target_names[0]
    if theta == 1:
        return wine.target_names[1]
    if theta == 2:
        return wine.target_names[2]

print(wine.target_names)
y_label = [species_label(theta) for theta in wine.target]
df['species']=y_label

['class_0' 'class_1' 'class_2']

def plot_confusion_matrix(cm,title):
    fig = plt.figure(figsize = (7,7))
    plt.title(title)
    sns.heatmap(
        cm,
        vmin=None,
        vmax=None,
        cmap="Blues",
        center=None,
        robust=False,
        annot=True, fmt='.2g',
        annot_kws=None,
        linewidths=0,
        linecolor='white',
        cbar=True,
        cbar_kws=None,
        cbar_ax=None,
        square=True, ax=None, 
        #xticklabels=columns,
        #yticklabels=columns,
        mask=None)
    return

k-nn で分類

k-nn は学習のプロセスは無いが、識別のために事前にクラスが判明しているサンプルが必要。上で分割したデータを使って試してみる。つまり、Xtrain, ytrain の情報を用いて、X_test の各サンプルのクラスを判別する。

from scipy import stats

def distance(x1, x2):
    return np.sum((x1 - x2)**2, axis=1)

def knc_predict(n_neighbors, x_train, y_train, X_test):
    y_pred = np.empty(len(X_test), dtype=y_train.dtype)
    for i, x in enumerate(X_test):
        distances = distance(x, X_train)
        nearest_index = distances.argsort()[:n_neighbors]
        mode, _ = stats.mode(y_train[nearest_index])
        y_pred[i] = mode
    return y_pred

# X_train, y_train のデータをつかって y_test の識別を行う
n_neighbors = 3
y_pred = knc_predict(n_neighbors, X_train, y_train, X_test)

print(metrics.classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test,y_pred),"knn")

precision    recall  f1-score   support
    
               0       0.87      1.00      0.93        13
               1       0.74      0.74      0.74        23
               2       0.69      0.61      0.65        18
    
        accuracy                           0.76        54
       macro avg       0.76      0.78      0.77        54
    weighted avg       0.75      0.76      0.75        54

svg

k-means クラスタリング

k-means クラスタリングは、教師なしのクラスタリング手法として知られるので、まずはデータ全体に対してクラスタリングを試してみた。

numpy 実装、sklearn 実装ともに同様のクラスタリング結果が得られたようだ (k-means により割り当てられるクラスタの番号が、正解データの class 番号と一致する保証は全くないが、対角成分が最大になるように何度か実行した)

# skelearn で　k-means
model = KMeans(n_clusters=3)
labels = model.fit_predict(X) 

df['species_sklearn'] = labels
pd.crosstab(df['species_sklearn'], df['species'])

species	class_0	class_1	class_2
species_sklearn
0	46	1	0
1	0	50	19
2	13	20	29

# numpy 実装の k-means
# distance 関数は k-nn で定義したものがそのまま使い回せる
#  毎イタレーションで 配列を確保して速度的には効率が悪いが目をつぶる

def predict(x, centers, n_clusters):
    D = np.zeros((len(x), n_clusters))
    for i, x in enumerate(x):
        D[i] = distance(x, centers)
        cluster_index = np.argmin(D, axis=1)
    return cluster_index

def fit(x, cluster_index, n_clusters):
    centers = np.zeros( (n_clusters,x.shape[1]) )
    # 3) 各クラスタの平均ベクトル（中心）を計算する
    for k in range(n_clusters):
        index_k = cluster_index == k
        centers[k] = np.mean(x[index_k], axis=0)
    return centers

n_clusters = 3
iter_max = 100

# 1) 各クラスタ中心の初期値を設定する ここでは ランダム
centers = X[np.random.choice(len(X), n_clusters, replace=False)]
for _ in range(iter_max):
    prev_centers = np.copy(centers)

    # 2) 各データ点に対して、各クラスタ中心との距離を計算し、最も距離が近いクラスタを割り当てる
    cluster_index = predict(X, centers, n_clusters)

    # 3) 各クラスタの平均ベクトル（中心）を計算する
    centers = fit(X, cluster_index, n_clusters)

    # 4) 収束するまで2, 3の処理を繰り返す    
    if np.allclose(prev_centers, centers):
        break

# 
df['species_np'] = predict(X,centers,n_clusters)
pd.crosstab(df['species_np'], df['species'])

species	class_0	class_1	class_2
species_np
0	46	1	0
1	0	50	19
2	13	20	29

# クラスタリング結果の比較
pd.crosstab(df['species_np'], df['species_sklearn'])

# 各クラスのセントロイド間の ユークリッド距離
print(distance(model.cluster_centers_, centers))

[2.20190800e-28 1.29254120e-26 2.32245581e-28]

ちょっとナンセンスな気もするが、あえて教師あり学習することも可能なのでやってみる。

centers_train = fit(X_train, y_train, n_clusters)
y_pred_kmeans = predict(X_test, centers_train, n_clusters)

print(metrics.classification_report(y_test, y_pred_kmeans))
plot_confusion_matrix(confusion_matrix(y_test,y_pred_kmeans),"k-means")

precision    recall  f1-score   support
    
               0       0.86      0.92      0.89        13
               1       0.74      0.87      0.80        23
               2       0.77      0.56      0.65        18
    
        accuracy                           0.78        54
       macro avg       0.79      0.78      0.78        54
    weighted avg       0.78      0.78      0.77        54

svg

サポートベクターマシン(SVM)

動画講義メモ(要点のまとめ)

2クラス分類を解くための教師あり学習の手法
- 基本の手法
  - 線形判別を前提に境界面(識別関数)を直接的に求める
    - $t = \mathrm{sign} ( \bm{w}^T \bm{x} + b)$ の取る値 +1,-1 が分類クラスに対応
  - 問題設定: マージン(境界面と境界面に最も近い点の距離)が最大になるケース汎化性能が高い、という仮定で境界面を決める
    - ということは、最終的に最近接点(これをサポートベクターと呼ぶ) のみが学習に必要ということになる
    - $\bm{w}, b$ について "最小化"すべき目的関数: $\frac{1}{2} ||\bm{w} ||^{2}$
      - マージン最大の時、境界面から各クラスの最近接点までの距離は等しくなる (どちらかに偏ると最大でなくなる)
      - 最近接点に対しては、 $t_i (\bm{w}^T \bm{x} + b) = 1$ が成立
        
        右辺1 とすることで、 $\bm{w}, b$ のスケールの任意性が取り除かれる
      - 上記2点からマージンは、 $\frac{1}{|| \bm{w} ||}$ と分かる (境界面の法線ベクトル $\bm{w}$ への射影を使う)
      - 逆数にすれば最小化問題。問題が解きやすいように 2乗した上で 1/2 しておく
    - 制約条件: $t_{i}(\bm{w}^T\bm{x}_{i} + b) \ge 1 \quad (i=1, 2, \cdots, n)$
      - この制約条件は、問題の定義から自明
  - 主問題 <-> 双対問題 (結果だけ)
    - 上記の問題設定は解きにくいので問題を解きやすい形に変換する。ここでは結果だけしめすと..
      - $\bm{a}$ について"最大化" すべき目的関数 : $\tilde{L}(\bm{a}) = \sum_{i=1}^{n} a_{i} - \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} a_{i} a_{j} t_{i} t_{j} \bm{x}_i^T \bm{x}_i = \bm{a}^{\mathrm{T}} \bm{1} - \frac{1}{2} \bm{a}^{\mathrm{T}} H \bm{a}$
      - 制約条件: $\bm{a}^T \bm{t} = 0, \quad a_i \ge 0 \ (i=1,2,\ldots n)$
      - $\bm{a}$ と $\bm{w}, b$ の関係:
        
        $\bm{w} = \sum_{i=1}^{n} a_{i} t_{i} \bm{x}_{i}$
        
        境界面に最も近い点 $s$ に対して、 $b =\frac{1}{t_i} (t_s - \bm{w}^T \bm{x_s})$
      - 主問題と双対問題の変換は、ラグランジュ, KKT条件等を用いた議論が必要だが、"要点のまとめ"としては割愛
    - 数値解法 (ざっくりの理解)
      - 基本的には勾配法
      - 双対問題の目的関数の勾配つかって更新 (最大化)
        
        $\bm{a} \leftarrow \bm{a} + \eta_{1} (\bm{1} - H \bm{a})$
      - 等式制約条件の左辺=0になるよう、こっちはこっちで勾配使って更新 (最小化)
        
        $\bm{a} \leftarrow \bm{a} - \eta_{2} (\bm{a}^{\mathrm{T}} \bm{t}) \bm{t}$
      - 更新の途中で、 $a_i \ge 0$ を満たさない場合は、無理やり0に射影する。
        
        非負制約は凸な集合なので、そんな悪いことは怒らないはず..
- カーネル関数を使って判別関数を曲面にする
  - 線形判別できない場合も、 $\phi(\bm{x})$ と予め変換した別の特徴量空間に飛ばせばうまくいく?
    - パット思いつくのは単位円が境界となっているような場合? $r = \phi(\bm{x}) = x_1^2 + x_2^2$ と原点からの距離に変換してあげれば、 $r = 1$ というめちゃくちゃシンプルな直線で判別できる
    - $\bm{\tilde{x}}_i = \phi(\bm{x}_i)$ と変換してあげれば、定式化は基本のSVMと同じになるはず
  - 式を展開していくと、必ず $\phi(\bm{x}_i)^T \phi(\bm{x}_j)$ とまとまった形で現れる。これを 2サンプル間の類似度を計算する"カーネル関数" として使う。
    - カーネルのサイズは、 $\phi$ の次元数によらず、サンプル数で演算量が抑えられる
    - $\phi(x)$ の設計・計算経由することなく、直接カーネル関数を扱える
      - RBFカーネルに対応する $phi$ は無限次元!
- ソフトマージンSVM
  - 境界面できれいに2クラス分類できないときに、はみ出てしまうサンプルがあることを許容する
  - はみ出し具合: $\xi_i = \mathrm{max} ( 1 - t_i (\bm{w}^T \bm{x}_i + b), 0 )$ としてこれをペナルティ項として元の目的関数に加える
  - 元の問題設定:
    - $\text{min}_{\bm{w}, b} \qquad \frac{1}{2} || \bm{w} ||^{2} + C \sum_{i=1}^{n} \xi_{i}$
      - ペナルティ項の重み係数 C はハイパパラメータ
    - $\text{subject to} \qquad t_{i}(\bm{w}^T \bm{x}_{i} + b) \ge 1 - \xi_{i} \quad (i=1, 2, \cdots, n)$
  - 双対問題
    - 基本は通常のSVMと同じ。ただし、
    - $a$ の非負制約が $0 \le \bm{a}_i \le C \quad (i=1,2,\ldots,n)$ になる

ハンズオン

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

線形分離可能なケース

def fit(X_train, t, n_iter=500,  eta1=0.01, eta2=0.001, C=0):
    K = X_train.dot(X_train.T) 
    H = np.outer(t, t) * K
    n_samples = len(t)

    a = np.ones(n_samples) # 初期値
    for _ in range(n_iter):
        grad = 1 - H.dot(a)

        # 基本の更新式
        a += eta1 * grad
        # 制約部分
        a -= eta2 * a.dot(t) * t
        # 未定定数 a が全部0以上というのは凸なので、
        # 無理やり射影してもまあ悪いことにはならない
        if C == 0:
            a = np.where(a > 0, a, 0) 
        elif C>0: # ソフトマージンのときはここだけ違う
            a = np.clip(a, 0, C)


    # w,b を求める
    index = a > 1e-6 # epsilon
    support_vectors = X_train[index]
    support_vector_t = t[index]
    support_vector_a = a[index]

    term2 = K[index][:, index].dot(support_vector_a * support_vector_t)
    b = (support_vector_t - term2).mean()
    
    w = np.zeros(X_train.shape[1]) 
    for a, sv_t, sv in zip(support_vector_a, support_vector_t, support_vectors):
        w += a * sv_t * sv

    return w, b, support_vectors

def predict(X,w,b):
    y_project = np.zeros(len(X))
    for i in range(len(X)):
         y_project[i] = w.dot(X[i]) + b
    return np.sign(y_project), y_project

# (2,2) と (-2,-2) の周辺にランダムに点を発生させる
def gen_data():
    x0 = np.random.normal(size=50).reshape(-1, 2) - 2.
    x1 = np.random.normal(size=50).reshape(-1, 2) + 2.
    X_train = np.concatenate([x0, x1])
    ys_train = np.concatenate([np.zeros(25), np.ones(25)]).astype(np.int)
    return X_train, ys_train

# -------- データ生成
X_train, ys_train = gen_data()
# plt.scatter(X_train[:, 0], X_train[:, 1], c=ys_train)
t = np.where(ys_train == 1.0, 1.0, -1.0)

# --------- 学習
w,b,support_vectors = fit(X_train,t)


# --------- 学習結果の確認
# 領域可視化のために mesh 上の各点を入力データとして生成 
xx0, xx1 = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
xx = np.array([xx0, xx1]).reshape(2, -1).T
X_test = xx

y_pred, y_project = predict(X_test,w,b)

# 訓練データを可視化
plt.scatter(X_train[:, 0], X_train[:, 1], c=ys_train)

# サポートベクトルを可視化
plt.scatter(support_vectors[:, 0], support_vectors[:, 1],
                    s=100, facecolors='none', edgecolors='k')

# 領域を可視化
plt.contourf(xx0, xx1, y_pred.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
# マージンと決定境界を可視化
plt.contour(xx0, xx1, y_project.reshape(100, 100), colors='k',
                     levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])

# マージンと決定境界を可視化
plt.quiver(0, 0, 0.1, 0.35, width=0.01, scale=1, color='pink')

<matplotlib.quiver.Quiver at 0x7f1ce7904070>

SVG

線形分離不可能なケース

元のデータ空間では線形分離は出来ないケースに対して、特徴空間上で線形分離することを考える。今回はカーネルとしてRBFカーネル（ガウシアンカーネル）を利用する。

def fit_k(K, X_train, t, n_iter=500,  eta1=0.01, eta2=0.001):
    n_samples = len(t)
    eta1 = 0.01
    eta2 = 0.001
    n_iter = 5000

    H = np.outer(t, t) * K
    # a の学習
    a = np.ones(n_samples)
    for _ in range(n_iter):
        grad = 1 - H.dot(a)
        a += eta1 * grad
        a -= eta2 * a.dot(t) * t
        a = np.where(a > 0, a, 0)

    # support vector の情報取得
    index = a > 1e-6
    support_vectors = X_train[index]
    support_vector_t = t[index]
    support_vector_a = a[index]

    # b の計算
    term2 = K[index][:, index].dot(support_vector_a * support_vector_t)
    b = (support_vector_t - term2).mean()

    return a, b, support_vectors, support_vector_t, support_vector_a


def predict_k(X, b, support_vectors, support_vector_t, support_vector_a):
    y_project = np.ones(len(X)) * b
    for i in range(len(X)):
        for a, sv_t, sv in zip(support_vector_a, support_vector_t, support_vectors):
            y_project[i] += a * sv_t * rbf(X[i], sv)
    y_pred = np.sign(y_project)
    return y_pred,  y_project

# 半径の異なる円周上に点をばらつかせ, 内側を1, 外側を0 に分類
factor = .2
n_samples = 50
linspace = np.linspace(0, 2 * np.pi, n_samples // 2 + 1)[:-1]
outer_circ_x = np.cos(linspace)
outer_circ_y = np.sin(linspace)
inner_circ_x = outer_circ_x * factor
inner_circ_y = outer_circ_y * factor

X = np.vstack((np.append(outer_circ_x, inner_circ_x),
               np.append(outer_circ_y, inner_circ_y))).T
y = np.hstack([np.zeros(n_samples // 2, dtype=np.intp),
               np.ones(n_samples // 2, dtype=np.intp)])
X += np.random.normal(scale=0.15, size=X.shape)
x_train = X
y_train = y

# RBFカーネルの計算
def rbf(u, v):
        sigma = 0.8
        return np.exp(-0.5 * ((u - v)**2).sum() / sigma**2)

X_train = x_train
t = np.where(y_train == 1.0, 1.0, -1.0)

K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
    for j in range(n_samples):
        K[i, j] = rbf(X_train[i], X_train[j])

# 学習
a,b,support_vectors,support_vector_t, support_vector_a = fit_k(K, X_train, t)

# 予測
xx0, xx1 = np.meshgrid(np.linspace(-1.5, 1.5, 100), np.linspace(-1.5, 1.5, 100))
xx = np.array([xx0, xx1]).reshape(2, -1).T
X_test = xx

y_pred, y_project = predict_k(X_test, b, support_vectors, support_vector_t, support_vector_a)

# 訓練データを可視化
plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train)
# サポートベクトルを可視化
plt.scatter(support_vectors[:, 0], support_vectors[:, 1],
                    s=100, facecolors='none', edgecolors='k')
# 領域を可視化
plt.contourf(xx0, xx1, y_pred.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))
# マージンと決定境界を可視化
plt.contour(xx0, xx1, y_project.reshape(100, 100), colors='k',
                     levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])

<matplotlib.contour.QuadContourSet at 0x7f1cec0fb0a0>

SVG

重なりのあるケースでのソフトマージンSVM

要点まとめに書いた通り、最終的にパラメータ更新における、 $a$ の不等式制約が少し違うだけ。関数をまとめて、Cの値によって場合分け実装した (オリジナルの実装だと epsilon が異なっていたがまぁ、共通でもいいだろう。)

if C == 0: # 通常のSVM
    a = np.where(a > 0, a, 0) 
elif C>0: # ソフトマージンのとき
    a = np.clip(a, 0, C)

# 通常のSVMで使用したデータよりも集合が近い (1,1)と(-1,-1)の中心で散らばりをもたせたデータ
x0 = np.random.normal(size=50).reshape(-1, 2) - 1.
x1 = np.random.normal(size=50).reshape(-1, 2) + 1.
x_train = np.concatenate([x0, x1])
y_train = np.concatenate([np.zeros(25), np.ones(25)]).astype(np.int)

plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train)

# 学習
X_train = x_train
t = np.where(y_train == 1.0, 1.0, -1.0)
w, b, support_vectors  = fit(X_train, t, n_iter=1000, C=1)

# 領域可視化のため、mesh上の点を生成
xx0, xx1 = np.meshgrid(np.linspace(-4, 4, 100), np.linspace(-4, 4, 100))
xx = np.array([xx0, xx1]).reshape(2, -1).T
X_test = xx 
y_pred, y_project = predict(X_test,w,b)

# 訓練データを可視化
plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train)

# サポートベクトルを可視化
plt.scatter(support_vectors[:, 0], support_vectors[:, 1],
                    s=100, facecolors='none', edgecolors='k')
# 領域を可視化
plt.contourf(xx0, xx1, y_pred.reshape(100, 100), alpha=0.2, levels=np.linspace(0, 1, 3))

# マージンと決定境界を可視化
plt.contour(xx0, xx1, y_project.reshape(100, 100), colors='k',
                     levels=[-1, 0, 1], alpha=0.5, linestyles=['--', '-', '--'])

<matplotlib.contour.QuadContourSet at 0x7f1ce7e997c0>

SVG

双対問題の導出 (通常のSVMのケースについて)

自分もSVMよく覚えてなかったので、ざっくり思い出したい
ハンズオンの説明でも双対問題への変換の説明はすっ飛ばされているように感じる
SVMのケースに特化して、納得した気になるだけならそこまで難しくなさそう

ということで、厳密性 < 直感的理解重視で、自分なりの納得の仕方を記す。

ラビットチャレンジの同期生で、早々と Stage.2 を終えた方はたくさんいるようだが、双対問題がどうやって導かれたかわからん、という人が一定数いるようなので、参考になれば幸い。(厳密に理解したい人は教科書をちゃんと読むのがいいと思います)

最小化問題と不等式制約の２つを分けて扱うのは大変なので、「不等式制約のことは一旦わすれて、最小化問題を解いていたら、勝手に不等式制約もみたされていた」という状況を作れると嬉しい。「目的関数を小さくしつつ、こういう性質を満たして欲しい」みたいな解を探す方法は色々あるが、(ラビットチャレンジ勢的に) 最初に思いつくのはペナルティ項の追加(正則化)ですよね。不等式制約が満たされないと大きな値をとるような関数 $f(\bm{w},b)$ を足してやればよさそう。

J(\bm{w},b) = \frac{1}{2} ||\bm{w} ||^{2} + f(\bm{w},b)

ただし、今まで扱ってきた正則化では、「なるべくノルムが小さくなるように」でヨカッタのだが、不等式制約は絶対に満たさなければいけないので、下記注意が必要。

不等式制約が満たされているときに $f(\bm{w},b)$ が何らかの値を持ってしまうとすると、

この新たな目的関数の勾配は、元の目的関数 $\frac{1}{2} ||\bm{w} ||^2$ をもっとも減少させる勾配とは少しズレた方向を向くことになって問題だ。つまり、不等式制約を満たす時、 $f=0$ でなければならない。

また制約を満たさないペナルティに対して、制約を犯したときの第一項目が減少量が大きいければ、不等式制約を満たす保証ができない。

つまり、不等式制約を満たさないとき、 $f=+\infty$ でないとならない。

このような性質をもった関数はあるだろうか? ぱっと考え着くのは、微分不可能・不連続点のある関数だが、これを扱うのは数学的に難しそう。こういうちょっと特殊な関数の振る舞いを、再現するときは、新しい変数・次元を増やしてやるとうまくいくことがある。(カーネルSVMもそうだよね) ここでは、新しい"非負の" 変数を導入して、それに対する max をとる関数として定義してみよう。

f(\bm{w},b) = \mathrm{max}_{\bm{a}} \sum_i a_i \bigl(-(t_i (\bm{w}^T \bm{x}_i + b) -1)\bigr)

不等式制約を満たすときつまり $(t_i (\bm{w}^T \bm{x}_i + b) -1) \ge 0$ の時、 $a_i$ と掛け算する値が負なので、最大化するには $a_i=0$ 　で $f=0$ ※ これKKT条件等式制約に一致する
不等式制約を満たさないときつまり $(t_i (\bm{w}^T \bm{x}_i + b) -1) < 0$ の時 $a_i$ と掛け算する値が生なので、最大化するには $a_i \rightarrow \infty$ で $f=\infty$

うん、よさそうだ。ということで解くべき問題は、

\mathrm{min}_{\bm{w},b} \biggl( \frac{1}{2} ||\bm{w} ||^{2} +　\mathrm{max}_{\bm{a}>0} \sum_i a_i \bigl(-(t_i (\bm{w}^T \bm{x}_i + b) -1)\bigr) \biggr) \qquad a_i \ge 0 \quad (i=1,2,\ldots)

ということになるが、 $\bm{w}, b, \bm{a}$ は本来独立に動かせる変数で、第一項目に $\bm{a}$ は含まれないので、

\mathrm{min}_{\bm{w},b} \mathrm{max}_{\bm{a}>0} 　 \biggl( \frac{1}{2} ||\bm{w} ||^{2} + \sum_i a_i \bigl(-(t_i (\bm{w}^T \bm{x}_i + b) -1)\bigr) \biggr) \qquad a_i \ge 0 \quad (i=1,2,\ldots) := \mathrm{min}_{\bm{w},b} \mathrm{max}_{\bm{a}>0} \tilde{L}(\bm{w},b,\bm{a})

と書き換えてもよい。

ここで、内側で $\mathrm{max}_{\bm{a}>0}$ の操作をすることは、より小さな最適値を見つけるための可能性を狭めることにつながるので、一般には

\mathrm{min}_{\bm{w},b} \mathrm{max}_{\bm{a}>0} \tilde{L}(\bm{w},b,\bm{a}) \ge \mathrm{max}_{\bm{a}>0} \mathrm{min}_{\bm{w},b} \tilde{L}(\bm{w},b,\bm{a})

の関係が成り立ち、この等号が成立するかどうかは自明ではない。(cf. 弱双対定理)

しかしながら、今、各パラメータによる $\tilde{L}$ の勾配 = 0 から得られる式は、基本的に各パラメータに対しての1次ないし0次の式。また制約式も線形だ。最大化、最小化を行う順番を変える、というのは、勾配=0の式を使って、そのパラメータを消去する順番が違うということに相当するわけだが、問題がずっと線形のままならば、どういう順番でパラメータを消去していったところで、直感的には最適値が変わることはなさそうである。

ということで、問題は、

\mathrm{max}_{\bm{a}>0} \mathrm{min}_{\bm{w},b} \tilde{L}(\bm{w},b,\bm{a})

と書き換えられる事になり、ハンズオンの議論につながる。

取り組み概要 & 感想

取り組みの記録

3/31, 4/1: まずはざっと新しい動画を視聴(1.5倍速)しながらメモをとり・微分計算。基本的に動画を途中でストップすることはなかった。約200分で完了。まぁ、問題なさそうだなぁ。という感触だけ得る。

4/1~28 : しばらく塩漬け状態.. (GTC21の聴講などをしていたので)

4/29 : ハンズオンの実行環境の構築。Collaborately はセル実行のレスポンスが遅いのでローカルに環境を整える。win10上に pyenv + poetry でを仮想環境を作る。30分程度。

4/30-5/1 : 線形回帰旧動画の方をみながらハンズオン。新旧動画をみながら残したメモを適当に要点まとめとしてblog記事にする。その後ハンズオンのノートブックファイルを markdown 化し、課題提出用blogの記事ty中にinclude。ハンズオンが 45分, 要点まとめや記事の整形が 1h.

5/1 : 非線形回帰。旧動画をみながら、新動画をみたときのメモと見比べていくつかコメント追加し、要点まとめ。その後、ハンズオンを実行。scikit-learn で RBFカーネルといったときには何をさすのか? SVRってなんだ? など調べていたらすこし時間をくった。合計 3h。

5/2~5/5 : 非線形回帰と同様の取り組み方で、残りの単元を1日1つずつ仕上げる。各日3h程度なので、合計12hくらい

5/6 : ステージテスト。所要時間は 30 分程度で、結果は14点/15点。ハヤトチリしました。。スタートテストと違って制限時間が無いのでリラックスして受けられた。

感想ほか　

講義動画について。旧動画と新動画でけっこう趣が違う。時間に余裕があれば、両方見てもよいのではなかろうか。

新動画
- 「基本きっちり押さえれば、後の細かい所は頑張れるでしょ」的な解説。講師の男性はもともとは研究寄りの人な印象を受ける
- 手計算してるところをみせてくれる (ロジスティック回帰の勾配計算とかやったことない人はやりましょう)
- 実戦的な注意点などは、参考情報が旧動画より多め
- その他資料に載っていない背景説明などもある。結果的に時間の都合で資料の説明を一部すっとばすことがある
- 「お前らコレ難しいからわかってねーだろ」的な発言がチラホラ。人によってはイラッとするかも?

旧動画のいいところ
- 基本的に資料に沿った説明をたんたんとしてくれる感じで、講師の女性はどちらかというとソフトウェア開発寄の人な印象
- ハンズオンの説明はこちらの方が丁寧

ハンズオンは基本的に、資料/動画で学習した手法を使ったデータ分析で、自分で編集しなくてもそのまま実行できてしまう。コードから学ぶことに慣れている人は、買ってに勘所を見つけて得るべきものを得るが、人によっては、「へーこんな結果になるんだぁ」で終わってしまうんでなかろうか? それだとちょっともったいない。例えば、大事なところを数カ所穴埋めにして、考えるきっかけにしてもらう、などやっても良い気がした。

テストについての感想。(他の受講生もあとから受ける可能性があるため具体的な内容には触れない) 15問中13問はビデオ講義のどこかででやった内容そのままだった。単純なミスや早とちりこそあれ、これらの問題に自信を持って回答できないのであれば、しっかり復習をした方がよさそうだ。残り2問は、ラビットチャレンジで初めて勉強を始めた人にとっては、未知の手法が出題されたように感じるかもしれない。しかし、この2問のうち先に出題される問題が、その未知の手法自体を問題文で説明してくれているので、問題文の意味をしっかり捉えれば正解は導ける。また、それができれば、芋づる式に残り1問も計算問題として解けるので、難しすぎるとか、講義内容とのミスマッチなどを感じることは個人的にはなかった。

計画の見直し (2021/05/06 時点)

4/18に終わらせるはずだったステージ2が半月遅れてしまった。実際に学習に費やしている時間はそこまで長くないが、取り組む時間を確保するのが難しい。

これまでステージテストはスムーズに合格しているので、各ステージうまく理解ができていると思いたい。この調子ですすめれば、修了テストにかかる時間は少し縮めてもよいだろう。スケジュールを下記の様に修正する。

~2021/2/15 : スタートテスト (2021/02/07完了)
~2021/3/30 : ステージ1 (2021/03/30完了)
~2021/5/6 : ステージ2 (2021/05/06完了)
~2021/5/30 : ステージ3
~2021/6/27 : ステージ4
~2021/7/4 : 復習 -> 修了テスト
~2021/7/15 : Eもぎライト -> 今後の計画具体化
~2021/7/30 : シラバスの未習箇所の学習
~2021/8/26 : 全体の復習
2021/8/27,28: E資格受験

だんだん 8月に万全を期しての受験が厳しくなってきた気もするが、、淡々と進めていこう。