TensorFlow點入門 – 製作二元分類模型

這次的題目會介紹如何使用Keras製作二元分類 Binary Classification 問題；這類問題的應用有很多，例如判斷是否為有機會患上什麼病，是否為垃圾郵件問題。

學習準備

如果你沒有任何關於ML的知識，可以先看看我之前的幾編文章：

另外，文章內容需要有基本Python和Google Colab / Juypter notebook基礎，有助理解文章中的代碼。

Google Colab的使用方法，可以參考以下的資訊：

建立模型的流程

首先回顧一下建立機器學習模型，有那些步驟：

獲取數據：從Sklearn獲取 Toy Dataset
研究及整理數據：檢查數據Shape 及分割 Train / Test 數據
建立模型：建立Binary Classification
調試及評估模型：評估模型好壞的方法

以下製作模型的方法，也是根據這個流程的設計的。

匯入Library

第一部分，就是匯入Library，Library雖然可以在notebook中間加入，但是，在頭部加入，可以避免搬動代碼後，出現編譯錯誤。

代碼內容：

import tensorflow as tf
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import numpy as np
import itertools

簡單解釋說以上的庫：

tensorflow – 用於建立模型
matplotlib – 繪圖使用
pandas – 建立Data Frame (數據表)
sklearn – Machine Learning 工具
numpy – 數據轉換工具
itertools – 繪圖工具

獲取數據

這次是使用了 Breast Cancer的數據來訓練模型。而數據來源，是使用Sklearn 的Toy Dataset.

sample_dataset = load_breast_cancer()

詳細內容，可以參考：
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

和之前使用Tensorflow的Dataset不同，這次是下載整個數據集，不會分開train / test set。

研究及整理數據

獲取數據後，就了解數據，可以使用以下代碼來探索：

sample_features = sample_dataset.data
sample_target = sample_dataset.target
column_names = sample_dataset.feature_names


print("Dataset Information")
print("Features Shape: ", sample_features.shape)
print("Target Shape: ", sample_target.shape)
print("Number of Feature: ", sample_features.shape[1])
print("Number of Features Row: ", sample_features.shape[0])
print("Number of Target Row: ", sample_target.shape[0])
print("Features Col Names: ", column_names)

feature_data_df = pd.DataFrame(sample_features, columns=column_names)
print(feature_data_df.head())

feature_target_df = pd.DataFrame(sample_target, columns=["Target"])
print(feature_target_df.head())

# Distribution of the target
plt.hist(sample_target)
plt.title("Breast Cancer Target Distribution")
plt.show()

有幾點必須探索：

Features 數據的 Shape
結果數據的Shape
Features 的 Columns

Shape可以看到有多少Features 和有多少項數據。

另外，可以使用histogram plt.hist來查看Target的分佈；
分類問題來說，Target的分佈很重要，如果兩者比例太多，很難做到好成績的模型。

訓練/測試數據

由於獲取的數據是沒有預先區分“訓練“和“測試”數據，所以需要自行處理。一般來說，收集到的數據都是只有一份整體，沒有區分的，甚至沒有把結果數據(Target / Y)分割出來。

要分割“訓練”、“測試”數據，其實很簡單，Sklearn提供了有一個很好用的功能，使用方法如下：

train_data, test_data, train_target, test_target = train_test_split(feature_dataset, target_dataset)

輸入：

Feature_dataset：特徵數據，或者稱為 “X"
Target_dataset: 結果數據，或者稱為"y"，或 “label"
就可以輸出：
train_data：訓練用的特徵數據
test_data ：測試（評估模型）用的特徵數據
train_target：訓練用的結果數據
test_target：測試用的結果數據

另外，有以下option可以使用

test_size：測試數據的比例，一般來說：0.2 ~ 0.3 左右
random_state：Random Seed 數值，相同的輸入，可讓隨機結果相同，這可以排除Random導致錯誤率有所變化。
shuffle：“洗牌”，即是是否打亂數據先後順序，當然是做比較好。

相關代碼如下：

# Split Train and Test

train_data, test_data, train_target, test_target = train_test_split(
    sample_features, sample_target, random_state=104,
     test_size=0.25, shuffle=True
)

# # Train & Test data information
print("Train Data Shape: ", train_data.shape)
print("Test Data Shape: ", test_data.shape)
print("Train Target Shape: ", train_target.shape)
print("Test Target Shape: ", test_target.shape)

建立模型

和Linear Regression的做法雷同，需要：

定義架構 – model = tf.keras.Sequential
編譯模型 – model.compile
訓練模型 – model.fit

而不同的地方，就Classification會使用其他Loss和Activation 方法。

相關代碼如下：

# model setup 
# Set random seed
tf.random.set_seed(42)

model = tf.keras.Sequential(
        [
            tf.keras.layers.Dense(10, activation="relu"), 
            tf.keras.layers.Dense(1, activation="sigmoid")
        ])

Hidden Layer的方法會用 relu，而Output Layer的方法會用Sigmoid

# model compile 
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.legacy.Adam(learning_rate=0.05),
    metrics=["accuracy"]
)

使用BinaryCrossEntropy來評估錯誤率（Loss)

最後就是訓練模型了

epochs = 100
history = model.fit(train_data, train_target, epochs=epochs)

如果想了解training的情況，可以使用以下代碼來查看：

model_1.summary()
pd.DataFrame(history.history).plot()
plt.show()

調試及評估模型

訓練完成後，就是評估模型；

分類問題主頁檢查這3個重要指標（Metrics）：

Precision：計算的答案，有多少錯誤率。
Recall：計算的結果，有多少結果能判斷出來。
F1-Score：總合Precision & Recall的數值。
這些都是越高越好，詳細可以看看
https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers
https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classification-model-to-take-the-right-decisions/

這是獲取相關指標的代碼：

## Evaluate the model
model.evaluate(test_data, test_target)

## Create a confusion matrix
y_pred = model.predict(test_data)
y_pred = np.round(y_pred)

# Precision and Recall
# Recall = TP / (TP + FN)
# Precision = TP / (TP + FP)
# F1 Score = 2 * (precision * recall) / (precision + recall)
precision = sklearn.metrics.precision_score(test_target, y_pred)
recall = sklearn.metrics.precision_score(test_target, y_pred)
f1 = sklearn.metrics.f1_score(test_target, y_pred)
print("Precision: ", precision)
print("Recall: ", recall)
print("F1 Score: ", f1)

混淆矩陣 (Confusion Matrix)

最後，說說Classification中比較重要的圖表 Confusion Matrix，這個表可以清楚看到多少“False-Positive”和“False-Negative"有多嚴重；

這是Confusion Matrix的樣子：

詳細解釋可看下圖：

簡單來說，左上/右下是正確預測，另外的兩角就是錯誤；
如果表現良好，左上/右下出現深色；
否則的話，就代表模型預測未夠好了。

相關代碼如下：

print("Confusion Matrix")
print(confusion_matrix(test_target, y_pred))


def confusion_matrix_graph(test_target, y_pred):
    cm = confusion_matrix(test_target, y_pred)
    classes = ["Negative", "Positive"]
    plt.imshow(cm, interpolation="nearest", cmap=plt.cm.Blues)
    plt.title("Confusion Matrix")
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    threshold = cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > threshold else "black")

    plt.ylabel("True Label")
    plt.xlabel("Predicted Label")
    plt.show()


confusion_matrix_graph(test_target, y_pred)

首先，使用y_pred = model.predict(test_data)獲取模型輸出的結果，然後連同test_target給confusion_matrix_graph進行顯示。
至於confusion_matrix_graph邏輯，就在分析matplotlib的時候再說了。

總結

完整代碼，可以在我的Google Colab 看到：
https://colab.research.google.com/drive/11pPRaoN0LRkcFeNLOkDAHQry53V9EaK1?usp=sharing

而完成這篇文章，你應該可以做到以下事情

使用sklearn獲得toy dataset
建立及訓練二元分類模型
如何評估二元分類
製作Confusion Matrix

最後，大家如有問題，可以歡迎透過Facebook和Twitter直接問我的；

另外，請加入我的FB專頁，有新文章發布時，大家就能立即知道了。
Facebook 專頁連結：https://www.facebook.com/kencoder1024
Twitter連結：https://twitter.com/kenlakoo

TensorFlow點入門 – 製作二元分類模型

學習準備

建立模型的流程

匯入Library

獲取數據

研究及整理數據

訓練/測試數據

建立模型

調試及評估模型

混淆矩陣 (Confusion Matrix)

總結

發表留言取消回覆

標籤

文章存檔

學習準備

建立模型的流程

匯入Library

獲取數據

研究及整理數據

訓練/測試數據

建立模型

調試及評估模型

混淆矩陣 (Confusion Matrix)

總結

分享此文：

發表留言 取消回覆

標籤

文章存檔

發表留言取消回覆