文章插图
gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("private bath", "pb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("private baths", "pbs", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("shared bath", "sb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("shared baths", "sb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("shared half-bath", "sb", case=False)gm_regression_df['bathrooms_text'] =gm_regression_df['bathrooms_text'].str.replace("private half-bath", "sb", case=False)gm_regression_df = split_bathroom(gm_regression_df, column='bathrooms_text', text='bath', new_column='bathrooms_new')gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].str.split(" ", expand=True)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].str.split(" ", expand=True)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].str.split(" ", expand=True)# 填充缺失值为0gm_regression_df = gm_regression_df.fillna(0)gm_regression_df['shared_bath'] = gm_regression_df['shared_bath'].replace(to_replace='Shared', value=https://www.huyubaike.com/biancheng/0.5)gm_regression_df['private_bath'] = gm_regression_df['private_bath'].replace(to_replace='Private', value=https://www.huyubaike.com/biancheng/0.5)gm_regression_df['bathrooms_new'] = gm_regression_df['bathrooms_new'].replace(to_replace='Half-bath', value=https://www.huyubaike.com/biancheng/0.5)# 转成数值型gm_regression_df['shared_bath'] = pd.to_numeric(gm_regression_df['shared_bath']).astype(int)gm_regression_df['private_bath'] = pd.to_numeric(gm_regression_df['private_bath']).astype(int)gm_regression_df['bathrooms_new'] =pd.to_numeric(gm_regression_df['bathrooms_new']).astype(int)# 查看处理后的字段gm_regression_df[['shared_bath', 'private_bath', 'bathrooms_new']].head()
文章插图
下面我们对类别型字段进行编码 , 根据字段含义的不同,我们使用「序号编码」和「独热向量编码」等方法来完成 。
# 序号编码def encoder(df):for column in df[['neighbourhood_group_cleansed', 'property_type']].columns:labels = df[column].astype('category').cat.categories.tolist()replace_map = {column : {k: v for k,v in zip(labels,list(range(1,len(labels)+1)))}}df.replace(replace_map, inplace=True)print(replace_map)return df gm_regression_df = encoder(gm_regression_df)
文章插图
我们对于
host_response_time
和room_type
字段 , 使用独热向量编码(哑变量变换)host_dummy = pd.get_dummies(gm_regression_df['host_response_time'], prefix='host_response')room_dummy = pd.get_dummies(gm_regression_df['room_type'], prefix='room_type')# 拼接编码后的字段gm_regression_df = pd.concat([gm_regression_df, host_dummy, room_dummy], axis=1)# 剔除原始字段gm_regression_df = gm_regression_df.drop(columns=['host_response_time', 'room_type'], axis=1)
我们再把之前处理过的df_amenities做一点处理,再拼接到数据特征里df_3 = pd.DataFrame(df_amenities.sum())features = df_3['amenities'][:150].to_list()amenities_updated = df_amenities.filter(items=(features))gm_regression_df = pd.concat([gm_regression_df, amenities_updated], axis=1)
查看一下最终数据的维度gm_regression_df.shape# (3584, 198)
我们最后得到了198个字段,为了避免特征之间的多重共线性,使用方差因子法(VIF)来选择机器学习模型的特征 。VIF 大于 10 的特征被删除 , 因为这些特征的方差可以由数据集中的其他特征表示和解释 。# 计算VIFvif_model = gm_regression_df.drop(['price'], axis=1)vif_df = pd.DataFrame()vif_df['feature'] = vif_model.columnsvif_df['VIF'] = [variance_inflation_factor(vif_model.values, i) for i in range(len(vif_model.columns))]# 选出小于10的特征vif_df_new = vif_df[vif_df['VIF']<=10]feature_list =vif_df_new['feature'].to_list()# 选出这些特征对应的数据model_df = gm_regression_df.filter(items=(feature_list))model_df.head()
文章插图
我们拼接上
price
目标标签字段,可以构建完整的数据集price_col = gm_regression_df['price']model_df = model_df.join(price_col)
机器学习算法我们在这里使用几个典型的回归算法,包括线性回归、RandomForestRegression、Lasso Regression 和 GradientBoostingRegression 。关于机器学习算法的应用方法,欢迎大家查阅ShowMeAI对应的教程与文章,快学快用 。推荐阅读
- 一篇文章带你了解NoSql数据库——Redis简单入门
- 一篇文章带你了解服务器操作系统——Linux简单入门
- 一步一图带你深入理解 Linux 虚拟内存管理
- 一篇文章带你了解热门版本控制系统——Git
- 一篇文章带你了解网页框架——Vue简单入门
- 我用canvas带你看一场流星雨
- 一篇文章带你掌握主流办公框架——SpringBoot
- 带你认识JDK8中超nice的Native Memory Tracking
- SpringBoot+Vue3 AgileBoot - 手把手一步一步带你Run起全栈项目
- 带你读AI论文丨ACGAN-动漫头像生成