本文共 3107 字,大约阅读时间需要 10 分钟。
import pandas as pd import numpy as np import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import fetch_california_housing
house = fetch_california_housing()
X = pd.DataFrame(house.data,columns= house.feature_names)X.head()
MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|
0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |
1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |
2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |
3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |
4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |
数据说明:
MedInc:该街区住户的收入中位数
HouseAge:该街区房屋使用年代的中位数 AveRooms:该街区平均的房间数目 AveBedrms:该街区平均的卧室数目 Population: 街区人口 AveOccup: 平均入住率 Latitude: 街区的纬度 Longitude:街区的经度房子价值的中位数
house.target # array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894])
sns.heatmap(X.corr() ,annot=True ,center=0 ,linewidths=0.8 )
由于经纬度的相关性较高,为保证各变量线性无关,只保留其中一个变量
删除 Latitude
X.drop(columns='Latitude',inplace=True)
划分数据集:
from sklearn.model_selection import train_test_splitxtrain,xtest,ytrain,ytest = train_test_split(X,house.target)
数据标准化
x i − x ˉ s t d \frac{ x_i-\bar{x}}{std} stdxi−xˉ
0-1 缩放
x i − x m i n x m a x − x m i n \frac{x_i-x_{min}}{x_{max}-x_{min}} xmax−xminxi−xmin
from sklearn.preprocessing import StandardScaler
std保存了每一个特征的平均值和标准差
std = StandardScaler().fit(xtrain)
xtrain_
xtrain_ = pd.DataFrame(std.transform(xtrain),columns= X.columns)xtrain_
xtest_ 此处有疑问
xtest_ = pd.DataFrame(std.transform(xtest),columns= X.columns)xtest_
步骤:实例化,fit()传参,transform转化
from sklearn.preprocessing import MinMaxScaler
mm = MinMaxScaler().fit(xtrain)
xtrain_mm = mm.transform(xtrain)xtest_mm = mm.transform(xtest)
导入包
from sklearn.linear_model import Lassofrom sklearn.linear_model import Ridgefrom sklearn.linear_model import LinearRegression
初始化:
Lasso_ = Lasso(alpha=1.0, max_iter=5000,tol=0.0001)#L1 正则 max_iter : 梯度下降最多的迭代次数, tol :收敛域Ridge_ = Ridge(alpha=1.0,tol=0.0001)#L2 正则LR = LinearRegression(fit_intercept=True)# 常规线性回归
Lasso回归:
Lasso_.fit(xtrain_,ytrain)# 标准化的xtrain# Lasso(max_iter=5000)
Lasso_.coef_ # array([ 0., 0., 0., -0., -0., -0., -0.])
Ridge 回归:
Ridge_.fit(xtrain_,ytrain) # Ridge(tol=0.0001)
Ridge_.coef_ # array([ 1.01918834, 0.20957514, -0.4967906 , 0.44822796, 0.02923409, -0.06038656, -0.03551321])
常规线性回归:
LR.fit(xtrain_,ytrain) # LinearRegression()
LR.coef_# array([ 1.01948766, 0.20957077, -0.4973972 , 0.44878819, 0.02922805, -0.06039858, -0.03553571])
from sklearn.metrics import mean_squared_error
LASSO 回归的误差:
mean_squared_error(ytest,Lasso_.predict(xtest_))
结果:1.3495937394960946
Ridge 回归的误差:
mean_squared_error(ytest,Ridge_.predict(xtest_))
结果:0.6598541912416058
常规线性回归误差:
mean_squared_error(ytest,LR.predict(xtest_))
结果:0.6598326497351307
Lasso_.score(xtest_,ytest)
结果:-0.00012336012803904062
Ridge_.score(xtest_,ytest)
结果:0.5110116684554803
LR.score(xtest_,ytest)
结果:0.5110276318992988
转载地址:http://xkbvn.baihongyu.com/