In [1]:
import pandas as pd
import numpy as np
df=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006],
"date":pd.date_range('20130102', periods=6),
"city":['Beijing ', 'SH', ' guangzhou ', 'Shenzhen', 'shanghai', 'BEIJING '],
"age":[23,44,54,32,34,32],
"category":['100-A','100-B','110-A','110-C','210-A','130-F'],
"price":[1200,np.nan,2133,5433,np.nan,4432]},columns =['id','date','city','category','age','price'])
df1=pd.DataFrame({"id":[1001,1002,1003,1004,1005,1006,1007,1008],
"gender":['male','female','male','female','male','female','male','female'],
"pay":['Y','N','Y','Y','N','Y','N','Y',],
"m-point":[10,12,20,40,40,40,30,20]})
df_inner=pd.merge(left=df,right=df1,how='inner',on='id')
# 数据清洗,city全部转换为小写,并去除空格
df_inner['city'] = df_inner['city'].str.lower()
df_inner['city'] = df_inner['city'].map(str.strip)
# 数据清洗,将date设置为索引
df_inner.set_index(keys=['date'],inplace=True)
df_inner
Out[1]:
id city category age price gender pay m-point
date
2013-01-02 1001 beijing 100-A 23 1200.0 male Y 10
2013-01-03 1002 sh 100-B 44 NaN female N 12
2013-01-04 1003 guangzhou 110-A 54 2133.0 male Y 20
2013-01-05 1004 shenzhen 110-C 32 5433.0 female Y 40
2013-01-06 1005 shanghai 210-A 34 NaN male N 40
2013-01-07 1006 beijing 130-F 32 4432.0 female Y 40
In [2]:
# 数据采样
df_inner.sample(n=4)
Out[2]:
id city category age price gender pay m-point
date
2013-01-05 1004 shenzhen 110-C 32 5433.0 female Y 40
2013-01-06 1005 shanghai 210-A 34 NaN male N 40
2013-01-02 1001 beijing 100-A 23 1200.0 male Y 10
2013-01-03 1002 sh 100-B 44 NaN female N 12

Weights参数是采样的权重,通过设置不同的权重可以更改采样的 结果,权重高的数据将更有希望被选中。这里手动设置6条数据的权 重值。将前面4个设置为0,后面两个分别设置为0.5。

In [7]:
weights = [0,0,0,0,0.5,0.5]
df_inner.sample(n=2,weights=weights) # replace=True代表采样后是否放回,默认False不放回
Out[7]:
id city category age price gender pay m-point
date
2013-01-06 1005 shanghai 210-A 34 NaN male N 40
2013-01-07 1006 beijing 130-F 32 4432.0 female Y 40
In [13]:
# 描述统计descibe(),T代表转置行列
df_inner.describe().round(2).T
Out[13]:
count mean std min 25% 50% 75% max
id 6.0 1003.5 1.87 1001.0 1002.25 1003.5 1004.75 1006.0
age 6.0 36.5 10.88 23.0 32.00 33.0 41.50 54.0
price 4.0 3299.5 1966.64 1200.0 1899.75 3282.5 4682.25 5433.0
m-point 6.0 27.0 14.63 10.0 14.00 30.0 40.00 40.0
In [14]:
# std计算标准差
df_inner.price.std()
Out[14]:
1966.6385026231944
In [17]:
# cov计算协方差
df_inner['price'].cov(df_inner['age'])
Out[17]:
-2255.833333333333
In [20]:
# cov计算所有字段的协方差
df_inner.cov().round(2)
Out[20]:
id age price m-point
id 3.50 -0.70 3243.33 25.40
age -0.70 118.30 -2255.83 -31.00
price 3243.33 -2255.83 3867667.00 28771.67
m-point 25.40 -31.00 28771.67 214.00

相关分析corr函数

Corr函数用来计算数据间的相关系数,可以单独对特定数据进行 计算,也可以对整个数据表中各个列进行计算。相关系数在-1到1之 间,接近1为正相关,接近-1为负相关,0为不相关

In [15]:
df_inner['price'].corr(df_inner['age'])
Out[15]:
-0.08689525836897034
In [16]:
# 对全部数据进行相关分析
df_inner.corr()
Out[16]:
id age price m-point
id 1.000000 -0.034401 0.792239 0.928096
age -0.034401 1.000000 -0.086895 -0.194833
price 0.792239 -0.086895 1.000000 0.975325
m-point 0.928096 -0.194833 0.975325 1.000000