案例-Kaggle泰坦尼克号生存预测分析

xiaoxiao2021-03-01 28

数据采集和理解

#设置ast_node_interactivity = "all"使得可以同时输出多条语句 from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" #导入包 import pandas as pd import numpy as np #导入数据 train=pd.read_csv(r'E:\python\data\titanic\train.csv') test=pd.read_csv(r'E:\python\data\titanic\test.csv') print('训练集数据规模:{}'.format(train.shape)) print('测试集数据规模:{}'.format(test.shape)) 训练集数据规模:(891, 12) 测试集数据规模:(418, 11)

训练数据集比测试数据集的列多一个，即Survived值。由于它是预测的生存值，所以，在测试数据集中没有。

#查看训练集信息 train.head() test.head()

为了方便对训练数据和测试数据进行清洗，将训练数据和测试数据进行合并

#通过设置ignore_index=True参数，合并后的数据集会重新生成一个index full=pd.concat([train,test],ignore_index=True) full.head()

针对每一个字段做一个简单的解释： PassengerId: 乘客ID；

Survived: 生存情况，0代表不幸遇难，1代表存活；

Pclass: 仓位等级，1为一等舱，2为二等舱，3为三等舱；

Name: 乘客姓名；

Sex: 性别；

Age: 年龄；

SibSp: 乘客在船上的兄妹姐妹数/配偶数（即同代直系亲属数）；

Parch: 乘客在船上的父母数/子女数（即不同代直系亲属数）；

Ticket: 船票编号；

Fare: 船票价格；

Cabin: 客舱号；

Embarked: 登船港口（S: Southampton; C: Cherbourg Q: Queenstown）

#查看数据描述性统计 full.describe()

因为，describe()函数只能查看数据类型的描述统计信息，无法查看类似字符类型的信息。故需用info()函数进一步查看每一列的数据信息。

full.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 12 columns): Age 1046 non-null float64 Cabin 295 non-null object Embarked 1307 non-null object Fare 1308 non-null float64 Name 1309 non-null object Parch 1309 non-null int64 PassengerId 1309 non-null int64 Pclass 1309 non-null int64 Sex 1309 non-null object SibSp 1309 non-null int64 Survived 891 non-null float64 Ticket 1309 non-null object dtypes: float64(3), int64(4), object(5) memory usage: 122.8+ KB

数据的总行数为1309行，其中，Age一栏中263列有缺失项；Fare一栏中1列有缺失项；Survived一栏只有891列，刚好对应训练数据集的行数。除了Age和Fare以外，Cabin/Embarked也有缺失项。也可以用另一个命令，查看缺失项信息

full.isnull().sum() Age 263 Cabin 1014 Embarked 2 Fare 1 Name 0 Parch 0 PassengerId 0 Pclass 0 Sex 0 SibSp 0 Survived 418 Ticket 0 dtype: int64

数据清洗

如果是数值类型，使用平均值或者中位数进行填充

年龄(Age) 最小值为0.17，不存在0值，其数据缺失率为263/1309=20.09%，由于Age的平均数与中位数接近，故选择平均值作为缺失项的填充值。

full['Age']=full['Age'].fillna(full['Age'].mean()) 船票价格(Fare)一栏数据缺失项仅为一行，且存在票价为0的记录，如下： full.loc[full['Fare']==0,:]

让我们先看下那些票价不为0的数据，其不同仓位等级的票均价

full.loc[full['Fare']!=0,:].groupby('Pclass')['Fare'].mean() Pclass 1 89.447482 2 21.648108 3 13.378473 Name: Fare, dtype: float64 我们可以用这三个均值分别填充不同仓位其票价为0的记录,并用所有记录的均值填充na full.loc[(full['Fare']==0)&(full['Pclass']==1),'Fare']=89.4 full.loc[(full['Fare']==0)&(full['Pclass']==2),'Fare']=21.6 full.loc[(full['Fare']==0)&(full['Pclass']==3),'Fare']=13.4 full['Fare']=full['Fare'].fillna(full['Fare'].mean()) full.describe()

如果是分类数据，使用最常见的类别取代

#查看Embarked列中各value的数目 full['Embarked'].value_counts() S 914 C 270 Q 123 Name: Embarked, dtype: int64

可以看到登船港口Embarked最常见的类别是”S”，故，使用其填充缺失项。

full['Embarked']=full['Embarked'].fillna('S')

如果是字符串类型，按照实际情况填写，无法追踪的信息，用”Unknow”填充。处理Cabin缺失值 U代表Unknow

full['Cabin']=full['Cabin'].fillna('U') full.describe()

full.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 12 columns): Age 1309 non-null float64 Cabin 1309 non-null object Embarked 1309 non-null object Fare 1309 non-null float64 Name 1309 non-null object Parch 1309 non-null int64 PassengerId 1309 non-null int64 Pclass 1309 non-null int64 Sex 1309 non-null object SibSp 1309 non-null int64 Survived 891 non-null float64 Ticket 1309 non-null object dtypes: float64(3), int64(4), object(5) memory usage: 122.8+ KB

特征提取

如何知道哪些特征比较重要呢？通常需要与熟悉业务逻辑的人进行沟通，将业务人员说的特征反映到代码中，并通过实验和经验不断尝试，产生新的特征。

Sex（性别）：

#将性别的值映射为数值 #male对应数值1，female对应数值0 sex_dict={'male':1,'female':0} full['Sex']=full['Sex'].map(sex_dict) full['Sex'].head() 0 1 1 0 2 0 3 0 4 1 Name: Sex, dtype: int64

Embarked（登船港口）：

使用get_dummies进行one-hot编码

#使用get_dummies进行one-hot编码，产生虚拟变量（dummy variables），列名前缀(prefix)是Embarked EmbarkedDf=pd.get_dummies(full['Embarked'],prefix='Embarked') EmbarkedDf.head()

# 将EmbarkedDf的特征添加至full数据集 full=pd.concat([full,EmbarkedDf],axis=1)#axis=1表示按列插入数据 full.head()

因为已经使用登船港口(Embarked)进行了one-hot编码产生了它的虚拟变量（dummy variables）, 所以这里把登船港口(Embarked)删掉

full=full.drop('Embarked',axis=1) full.head()

Pclass(客舱等级)

方法同上

PclassDf=pd.get_dummies(full['Pclass'],prefix='Pclass') PclassDf.head()

full=pd.concat([full,PclassDf],axis=1) full=full.drop('Pclass',axis=1) full.head()

Name(乘客姓名)：

full['Name'].head() 0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry Name: Name, dtype: object

从上述Name字符串中发现每个名字里面都包含了头衔，我们可以获取到每个乘客的头衔，它可以帮助我们分析到更多有用的信息。

def getTitle(name): s1=name.split(',')[1] s2=s1.split('.')[0] return s2.strip()#移除字符串头尾空格 full['Title']=full['Name'].map(getTitle) full['Title'].value_counts() Mr 757 Miss 260 Mrs 197 Master 61 Rev 8 Dr 8 Col 4 Ms 2 Major 2 Mlle 2 Capt 1 Don 1 Mme 1 Jonkheer 1 Dona 1 Sir 1 the Countess 1 Lady 1 Name: Title, dtype: int64

将上述头衔对应到下面的几种类别中:

Officer：政府官员； Royalty：王室（皇室）； Mr：已婚男士； Mrs：已婚妇女； Miss：年轻未婚女子； Master：有技能的人/教师

title_dict={"Capt":"Officer","Col":"Officer","Major":"Officer","Jonkheer":"Royalty","Don":"Royalty","Sir":"Royalty","Dr":"Officer","Rev":"Officer" ,"the Countess":"Royalty","Dona":"Royalty","Mme":"Mrs","Mlle":"Miss","Ms":"Mrs","Mr" :"Mr","Mrs" :"Mrs","Miss" :"Miss" ,"Master" :"Master", "Lady" : "Royalty"} full['Title']=full['Title'].map(title_dict) full.head()

full['Title'].value_counts() Mr 757 Miss 262 Mrs 200 Master 61 Officer 23 Royalty 6 Name: Title, dtype: int64

利用上述头衔数据框进行One-hot编码

TitleDf=pd.get_dummies(full['Title'])#One-hot编码 full=pd.concat([full,TitleDf],axis=1)#将特征添加至源数据集 full.head()

full=full.drop(['Name','Title'],axis=1)#删掉不需要的列 full.head()

Cabin(客舱号)：

客场号的类别值是首字母，因此我们提取客舱号的首字母为特征。

full['Cabin']=full['Cabin'].map(lambda x:x[0]) full['Cabin'].value_counts() U 1014 C 94 B 65 D 46 E 41 A 22 F 21 G 5 T 1 Name: Cabin, dtype: int64 CabinDf=pd.get_dummies(full['Cabin'],prefix='Cabin') full=pd.concat([full,CabinDf],axis=1) full=full.drop('Cabin',axis=1) full.head()

建立家庭人数和家庭类别：

家庭人数=同代直系亲属数（Parch）+不同代直系亲属数（SibSp）+乘客自己（因为乘客自己也是家庭成员的一个，所以这里加1）

小家庭Family_Single：家庭人数=1

中等家庭Family_Small: 2<=家庭人数<=4

大家庭Family_Large: 家庭人数>=5

full['familysize']=full['Parch']+full['SibSp']+1 full['family_singel']=np.where(full['familysize']==1,1,0) full['family_small']=np.where((full['familysize']>=2)&(full['familysize']<=4),1,0) full['family_large']=np.where(full['familysize']>=5,1,0) full.head()

full.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 33 columns): Age 1309 non-null float64 Fare 1309 non-null float64 Parch 1309 non-null int64 PassengerId 1309 non-null int64 Sex 1309 non-null int64 SibSp 1309 non-null int64 Survived 891 non-null float64 Ticket 1309 non-null object Embarked_C 1309 non-null uint8 Embarked_Q 1309 non-null uint8 Embarked_S 1309 non-null uint8 Pclass_1 1309 non-null uint8 Pclass_2 1309 non-null uint8 Pclass_3 1309 non-null uint8 Master 1309 non-null uint8 Miss 1309 non-null uint8 Mr 1309 non-null uint8 Mrs 1309 non-null uint8 Officer 1309 non-null uint8 Royalty 1309 non-null uint8 Cabin_A 1309 non-null uint8 Cabin_B 1309 non-null uint8 Cabin_C 1309 non-null uint8 Cabin_D 1309 non-null uint8 Cabin_E 1309 non-null uint8 Cabin_F 1309 non-null uint8 Cabin_G 1309 non-null uint8 Cabin_T 1309 non-null uint8 Cabin_U 1309 non-null uint8 familysize 1309 non-null int64 family_singel 1309 non-null int32 family_small 1309 non-null int32 family_large 1309 non-null int32 dtypes: float64(3), int32(3), int64(5), object(1), uint8(21) memory usage: 134.3+ KB full.loc[889:898,:]

特征选择和特征降维

通过前面的特征选取，得到32个特征，下面使用相关系数法选取特征

#计算相关性矩阵 corr_df=full.corr() corr_df

#提取各特征与生存情况（Survived）的相关系数，并降序排列 corr_df['Survived'].sort_values(ascending=False) Survived 1.000000 Mrs 0.344935 Miss 0.332795 Pclass_1 0.285904 family_small 0.279855 Fare 0.246552 Cabin_B 0.175095 Embarked_C 0.168240 Cabin_D 0.150716 Cabin_E 0.145321 Cabin_C 0.114652 Pclass_2 0.093349 Master 0.085221 Parch 0.081629 Cabin_F 0.057935 Royalty 0.033391 Cabin_A 0.022287 familysize 0.016639 Cabin_G 0.016040 Embarked_Q 0.003650 PassengerId -0.005007 Cabin_T -0.026456 Officer -0.031316 SibSp -0.035322 Age -0.070323 family_large -0.125147 Embarked_S -0.149683 family_singel -0.203367 Cabin_U -0.316912 Pclass_3 -0.322308 Sex -0.543351 Mr -0.549199 Name: Survived, dtype: float64

根据各特征与生存情况（Survived）的相关系数大小，选取以下特征进行建模：头衔（前面所在的数据集TitleDf）、客舱等级（PclassDf）、船票价格（Fare）、船舱号（CabinDf）、登船港口（EmbarkedDf）、性别（Sex）、家庭大小及类别（familysize,family_small,family_large,family_singel）

full_x=pd.concat([TitleDf,PclassDf,CabinDf,EmbarkedDf,full['Fare'],full['Sex'],full['familysize'],full['family_small'] ,full['family_large'],full['family_singel']],axis=1) full_x.head()

构建模型

建立训练数据集和测试数据集

根据前面的数据我们知道，train.csv里包含Survived标签，因此用来作为模型训练的数据，并需要将其分为训练数据集和测试数据集，test.csv无Survived标签，用来作为预测数据集

#前891行为原始训练数据，我们将其提取出来 source_x=full_x.loc[0:890,:]#提取特征 source_y=full.loc[0:890,'Survived']#提取标签 #后418行为预测数据 pred_x=full_x.loc[891:,:] source_x.shape source_y.shape pred_x.shape (891, 27) (891,) (418, 27) #建立模型用的训练数据集和测试数据集，按照二八原则分为训练数据和测试数据，其中80%为训练数据 from sklearn.cross_validation import train_test_split train_x,test_x,train_y,test_y=train_test_split(source_x,source_y,train_size=0.8) print('训练数据集特征:{0},训练数据集标签:{1}'.format(train_x.shape,train_y.shape)) print('测试数据集特征:{0},测试数据集标签:{1}'.format(test_x.shape,test_y.shape)) 训练数据集特征:(712, 27),训练数据集标签:(712,) 测试数据集特征:(179, 27),测试数据集标签:(179,) #对train_x,test_x进行标准化 from sklearn.preprocessing import StandardScaler sc = StandardScaler() train_x_std=sc.fit_transform(train_x) test_x_std=sc.transform(test_x)

选择算法训练模型

这里我们选择逻辑回归

#第一步：选择算法，并导入相应算发包 from sklearn.linear_model import LogisticRegression #第二步：创建模型 model=LogisticRegression() #第三步：训练模型 model.fit(train_x_std,train_y) LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

评估模型

#得出模型正确率 model.score(test_x_std,test_y) 0.8212290502793296

方案实施

#使用训练得到的模型对pred_x的生存情况进行预测 pred_x_std=sc.fit_transform(pred_x) pred_y=model.predict(pred_x_std) pred_y array([0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 1., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1., 1., 0., 0., 1., 0., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 0., 1.]) pred_df=pd.DataFrame({'PassengerId':test.PassengerId,'Survived':pred_y}) pred_df.head()

pred_df.shape (418, 2) pred_df['Survived']=pred_df['Survived'].astype('int') pred_df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 2 columns): PassengerId 418 non-null int64 Survived 418 non-null int32 dtypes: int32(1), int64(1) memory usage: 5.0 KB #保存结果 pred_df.to_csv(r'E:\python\data\titanic\predict.csv',index=False)

转载请注明原文地址: https://www.6miu.com/read-3100126.html

技术

最新回复(0)