关键名词:
1. 训练数据集 Training Set
2. 训练数据集的代价函数 Jtrain(theta)
3. 交叉验证数据集 Cross Validation
4. 交叉验证数据集的代价函数 Jcv(theta)
5. 测试数据集 Test
6. 测试数据集的代价函数 Jtest(theta)
7. 预测误差、代价 error(代价函数的计算结果)
8. 偏差 Bias
9. 特征多样性、方差 Variance
10. 欠拟合 Under-Fitting
11. 过拟合 Over-Fitting
12. 正则化 Regularisation
13. 查准率 Precision
14. 召回率 Recall
15. F值 F Score

Advice for applying learning algorithms 应用计算学习算法的一些建议
1. Machine Learning Diagnostic 机器学习诊断
1. To get better avenue, you may:
1. Collect larger training examples
2. Get additional features
3. Add polynomial features
4. Reduce features
5. Increase lambda
6. Decrease lambda
2. However, it may takes you 6 month to train data, and nothing to gain.
3. Machine Learning Diagnostic is:
1. A test that you can run to gain insight what is / isn’t working with a learning algorithm, and gain guidance as to how best to improve its performance.
2. It takes time to implement.
3. It sometimes rule out certain courses of action (changes to your learning algorithm) as being unlikely to improve its performance significantly.
4. Training/Testing Procedure
1. Split dataset into 7:3, 70% of which is going to be the training set, 30% of which is going to be the testing set.
2. Learn parameter theta from training data.
3. Compute test set error:
1. Jtest(theta) = CostFunction(training_data) => the square deviation equation
4. Misclassification error ( 0/1 misclassification error )
5.
2. Model Selection and Train/Validation/Test Sets 模型选择与训练/校验/测试数据集
1. Degree of Polynomial 多项式级数
2.
3. Evaluating your hypothesis
1. Training Set - 60% of data set
2. Cross Validation Set (CV) - 20% of data set
3. Test Set - 20% of data set
4. Use Validation Set to select the model
1. Compute cost function J(theta) by CV in different degrees
2. Choose the minimal one as the target degree, as below, d = 4
3. Estimate generalisation erro for test set Jtest(theta(4))
4.
3. Bias vs. Variance 偏差 vs. 特征变化
1. Bias => 特征值以外的代价函数偏差
1. High Bias => high lambda => Under-fitting
2. Variance => 特征值
1. High Variance => low lambda => Overfitting
3. Diagnosing Bias vs. Variance 诊断:偏差高低和特征值大小
1. Set coordinates of (degree_of_polynomial_d, error); 设置一个横轴为特征数量(多项式维度)d,纵轴为预测误差error的坐标系
2. Train the 60% data set, draw a curve which will converge when d goes bigger; 用60%的数据集作为训练集合,计算J(theta),当多项式维度d(特征数)增大时,error会减少,向0收敛
3. Run prediction with Cross-Validation-Set by trained model, there will be an overfitting point, after which Jcv(theta) will continuously arise. 使用从训练集中获得的模型,计算交叉验证数据集的预测误差,会发现过拟合问题。
4. As below:
5.
4. Question 问题
1. How can we figure out it is suffered from Bias, or from Variance? 我们如何知道模型的问题出在偏差还是特征多样性上?
1. Bias => Underfitting, as on the left part of the diagram 偏差往往和欠拟合相关,如图左半部分,特征数过少
1. Jtraining(theta) will be high 训练集的代价函数error会很大
2. Jcv(theta) ≈ Jtraining(theta) 交叉验证的代价函数error约等于训练集,也很大
2. Variance => Overfitting, as on the right part of the diagram 特征多样性往往和过拟合相关,如图右半部分,特征数量过多
1. Jtraining(theta) will be low 训练集的代价函数error越来越低
2. Jcv(theta) >>( much higher than ) Jtraining(theta) 交叉验证数据集的代价函数error在经过一个极小值后开始上升,最终远大于训练集的错误。这是典型的过拟合特征:对新数据的fitting性能非常差。
2. As Below:
1.
5. Choosing the regularisation parameter lambda 选择正则化参数lambda
1. Try lambda from small to large, like 从小到大,尝试lambda
1. 0
2. 0.01
3. 0.02
4. 0.04
5. 0.08
6. ...
7. 10
2. Pick up best fitting lambda of Cross-Validation cost function, say Theta(5) 选择一个对交叉验证数据集的代价函数拟合最好的lambda值,例如theta(5)
3. Compute J(theta) by the test data set 计算测试数据集的代价函数
4.
4. Bias/Variance as a function of the regularisation parameter lambda 以lambda为参数的偏差/方差函数
1. When lambda is 0 当lambda为0时
1. you can fit the training set relatively well, since there is no regularisation. 训练集会拟合的相对不错,因为没有做任何的正则化
2. When lambda is small 当lambda很小时
1. You get a small value of Jtrain 训练集的预测误差也很小
3. When lambda goes large 当lambda变大时
1. The bias becomes larger, so Jtrain goes much larger 偏差越来越大,训练集的预测误差会显著增大
4.
5. Learning Curves 学习曲线
1. When training set number 「m」grows
1. It is much harder to fit, so training error grows 训练集增大,代价函数越来越难拟合所有的数据集,error会增大
2. As examples grows, it does better at generalising to new examples, so Cross-Validation error decreases 训练集增大时,代价函数归纳新元素的性能会更好,因此cross-validation的错误率会下降
3.
2. High Bias 高偏差的情况
1. High Bias means low Variance, so h(theta) would a low-dimensional function, which cannot fit all the dataset well 高偏差代表预测多项式的维度过低,因此很难预测整个数据集
2. When m is small, Jcv is high and Jtrain is low, while both of them will converge to similar value when dataset grows large enough 当数据集的总数很小时,交叉验证集的预测误差很大,训练数据集的预测误差很小;但当m越来越大,两者将越来越接近
3. Both the error of Jcv and Jtrain would be fairly HIGH 交叉验证和训练数据集的J都会很大
4. Conclusion 结论
1. If a learning algorithm is suffering from high bias, getting more training data will not(by itself) help much 当学习算法存在高偏差问题时,训练更多的数据无法解决问题(J会收敛于一个很高的错误值,不再下降)
2.
3. High Variance 高特征多样性 / 高方差的情况
1. High Variance means low lambda and the polynomial hypothesis function has many many features 高特征多样性说明正则化参数lambda很小,此外预测多项式有非常多的特征
2. When m is small 当训练数据量m很小时
1. Jtrain is small; as m grows up, Jtrain becomes larger too, but the training set error would still be pretty low 训练集的J也很小;但是当训练数据越来越多时,由于预测函数的维度过高,拟合开始变得困难,Jtrain逐渐上升,但是仍然是一个非常小的数值
2. Jcv is large; as m grows up, Jtrain becomes smaller and coverage to a value similar with Jtrain 交叉验证数据集的J很大(高特征多样性带来的过拟合会导致预测函数对样本外的数据点预测偏差很大);当m增大时,Jcv会逐渐下降
3. The indicative diagnostic that we have a high variance problem 我们遇到高特征多样性问题的一个象征性指标:
1. With m becomes larger, there is a large gap between the training error and cross-validation error 随着m增大,训练数据集的预测误差和交叉验证数据集的预测误差之间会存在一个很大的空白
3. Conclusion 结论
1. If learning algorithm is suffering from high variance, getting moe training data is likely to help 当学习算法存在高特征多样性为题是,使用更多的训练数据可能会有帮助
4.
6. Deciding what to try next (revisited) 决定下一步做什么
1. When debugging a learning algorithm, you find your model makes unacceptably large errors in its prediction, what to do next? 当调试一个学习算法时,你发现你的预测模型得出了不可接受的高误差,下一步该怎么办?
1. Get more training examples 使用更多的训练数据
1. When Jcv is much higher than Jtrain, it fixes high variance 当Jcv比Jtrain大的多时,它可以解决高特征多样性的问题
2. Try smaller sets of features 尝试减少特征数量
1. It fixes high variance problem too 减少一部分用处不大的特征可以解决高特征多样性的问题
3. Try getting additional features 尝试使用额外的特征
1. Maybe the model is under-fitting (high bias), try additional features can make the model fitting training set better 高误差也有可能是因为特征数量太少了,因此使用额外的特征可以解决高偏差的问题
4. Try adding polynomial features 尝试添加多项式特征 (x1 * x1, x2 * x2, x1 * x2, etc)
1. Which also solves high bias problem 也是可以解决高偏差的问题(欠拟合)
5. Try decreasing lambda 尝试降低正则化参数lambda
1. Which also solves high bias problem 也是可以解决高偏差的问题(欠拟合)
6. Try increasing lambda 尝试增加正则化参数lambda
1. Which also solves high variance problem 也是可以解决高特征数量的问题(过拟合)
2. Neural Networks and overfitting 神经网络与过拟合
1. Small neural network 小型神经网络
1. fewer parameters, more prone to under-fitting 网络层数少、神经元数量少,更易导致欠拟合
2. Computationally cheaper 计算资源消耗少
2. Large neural network 大型神经网络
1. Type1. few layers, lot of units 类型1:层数少,每层的神经元多
2. Type2. few units, lot of layers 类型2:层数多,每层的神经元少
3. more parameters, more prone to over-fitting 参数多,更易导致过拟合
4. Computationally more expensive 计算资源消耗多
5. Use regularisation to address overfitting 使用正则化来解决过拟合问题
6. Try one layer, two layers and three layers.. and compute Jcv(theta) to decide how many layers you will use 尝试1层、2层、3层… 并计算交叉验证代价函数,据此来选择最合适的神经网络层数

Machine Learning System Design 机器学习系统设计
1. Building a Spam Classifier 构建一个垃圾邮件分类器
1. Prioritising What to Work On
1.
2. Recommended Approach 推荐的方法
1.
2.
2. Handling Skewed Data 处理歪曲的数据
1. Error Metrics for Skewed Data 歪曲数据的错误度量方法
1. 查准率 Precision
1. Precision = TruePositive / (No. of predicted positive)
2. No. of predicted positive = TruePositive + FalsePositive
2. 召回率 Recall
1. Recall = TruePositive / (No. of actual positive)
2. No. of actual positive = TruePositive + FalseNegative
2. Trading off Precision and Recall 权衡查准率和召回率
1. High Precision and Low Recall: Suppose we want to predict y = 1 only if very confident
1. Predict 1 if h(x) >= 0.9
2. Predict 0 if h(x) = 0.3
2. Predict 0 if h(x)