In [11]: g[['v1', 'v2']].mean() Out[11]: v1 v2 by1 by2 1955.055.0 995.055.0 2957.077.0 99 NaN NaN big damp 3.033.0 blue dry 3.033.0 red red 4.044.0 wet 1.011.0
In [16]: baseball = pd.DataFrame( ....: {'team': ["team %d" % (x + 1) for x inrange(5)] * 5, ....: 'player': random.sample(list(string.ascii_lowercase), 25), ....: 'batting avg': np.random.uniform(.200, .400, 25)}) ....:
In [17]: baseball.pivot_table(values='batting avg', columns='team', aggfunc=np.max) Out[17]: team team 1 team 2 team 3 team 4 team 5 batting avg 0.3521340.2953270.3971910.3944570.396194
In [33]: pd.melt(cheese, id_vars=['first', 'last']) Out[33]: first last variable value 0 John Doe height 5.5 1 Mary Bo height 6.0 2 John Doe weight 130.0 3 Mary Bo weight 150.0
In [34]: cheese.set_index(['first', 'last']).stack() # alternative way Out[34]: first last John Doe height 5.5 weight 130.0 Mary Bo height 6.0 weight 150.0 dtype: float64
In [39]: df.pivot_table(values='Amount', index='Animal', columns='FeedType', ....: aggfunc='sum') ....: Out[39]: FeedType A B Animal Animal1 10.05.0 Animal2 2.013.0 Animal3 6.0 NaN
In [40]: df.groupby(['Animal', 'FeedType'])['Amount'].sum() Out[40]: Animal FeedType Animal1 A 10 B 5 Animal2 A 2 B 13 Animal3 A 6 Name: Amount, dtype: int64
In [3]: url = ('https://raw.github.com/pandas-dev' ...: '/pandas/master/pandas/tests/data/tips.csv') ...:
In [4]: tips = pd.read_csv(url)
In [5]: tips.head() Out[5]: total_bill tip sex smoker day time size 016.991.01 Female No Sun Dinner 2 110.341.66 Male No Sun Dinner 3 221.013.50 Male No Sun Dinner 3 323.683.31 Male No Sun Dinner 2 424.593.61 Female No Sun Dinner 4
SELECT
在SQL中,使用您要选择的以逗号分隔的列列表(或* 选择所有列)来完成选择:
1 2 3
SELECT total_bill, tip, smoker, time FROM tips LIMIT 5;
使用pandas,通过将列名列表传递给DataFrame来完成列选择:
1 2 3 4 5 6 7 8
In [6]: tips[['total_bill', 'tip', 'smoker', 'time']].head(5) Out[6]: total_bill tip smoker time 016.991.01 No Dinner 110.341.66 No Dinner 221.013.50 No Dinner 323.683.31 No Dinner 424.593.61 No Dinner
In [7]: tips[tips['time'] == 'Dinner'].head(5) Out[7]: total_bill tip sex smoker day time size 016.991.01 Female No Sun Dinner 2 110.341.66 Male No Sun Dinner 3 221.013.50 Male No Sun Dinner 3 323.683.31 Male No Sun Dinner 2 424.593.61 Female No Sun Dinner 4
上面的语句只是将一个 Series 的 True / False 对象传递给 DataFrame,返回所有带有True的行。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
In [8]: is_dinner = tips['time'] == 'Dinner'
In [9]: is_dinner.value_counts() Out[9]: True176 False68 Name: time, dtype: int64
In [10]: tips[is_dinner].head(5) Out[10]: total_bill tip sex smoker day time size 016.991.01 Female No Sun Dinner 2 110.341.66 Male No Sun Dinner 3 221.013.50 Male No Sun Dinner 3 323.683.31 Male No Sun Dinner 2 424.593.61 Female No Sun Dinner 4
-- tips of more than $5.00 at Dinner meals SELECT* FROM tips WHEREtime='Dinner'AND tip >5.00;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# tips of more than $5.00 at Dinner meals In [11]: tips[(tips['time'] == 'Dinner') & (tips['tip'] > 5.00)] Out[11]: total_bill tip sex smoker day time size 2339.427.58 Male No Sat Dinner 4 4430.405.60 Male No Sun Dinner 4 4732.406.00 Male No Sun Dinner 4 5234.815.20 Female No Sun Dinner 4 5948.276.73 Male No Sat Dinner 4 11629.935.07 Male No Sun Dinner 4 15529.855.14 Female No Sun Dinner 5 17050.8110.00 Male Yes Sat Dinner 3 1727.255.15 Male Yes Sun Dinner 2 18123.335.65 Male Yes Sun Dinner 2 18323.176.50 Male Yes Sun Dinner 4 21125.895.16 Male Yes Sat Dinner 4 21248.339.00 Male No Sat Dinner 4 21428.176.50 Female Yes Sat Dinner 3 23929.035.92 Male No Sat Dinner 3
1 2 3 4
-- tips by parties of at least 5 diners OR bill total was more than $45 SELECT* FROM tips WHERE size >=5OR total_bill >45;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
# tips by parties of at least 5 diners OR bill total was more than $45 In [12]: tips[(tips['size'] >= 5) | (tips['total_bill'] > 45)] Out[12]: total_bill tip sex smoker day time size 5948.276.73 Male No Sat Dinner 4 12529.804.20 Female No Thur Lunch 6 14134.306.70 Male No Thur Lunch 6 14241.195.00 Male No Thur Lunch 5 14327.055.00 Female No Thur Lunch 6 15529.855.14 Female No Sun Dinner 5 15648.175.00 Male No Sun Dinner 6 17050.8110.00 Male Yes Sat Dinner 3 18245.353.50 Male Yes Sun Dinner 3 18520.695.00 Male No Sun Dinner 5 18730.462.00 Male Yes Sun Dinner 5 21248.339.00 Male No Sat Dinner 4 21628.153.00 Male Yes Sat Dinner 5
SELECT smoker, day, COUNT(*), AVG(tip) FROM tips GROUPBY smoker, day; /* smoker day No Fri 4 2.812500 Sat 45 3.102889 Sun 57 3.167895 Thur 45 2.673778 Yes Fri 15 2.714000 Sat 42 2.875476 Sun 19 3.516842 Thur 17 3.030000 */
1 2 3 4 5 6 7 8 9 10 11 12 13
In [21]: tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]}) Out[21]: tip size mean smoker day No Fri 4.02.812500 Sat 45.03.102889 Sun 57.03.167895 Thur 45.02.673778 Yes Fri 15.02.714000 Sat 42.02.875476 Sun 19.03.516842 Thur 17.03.030000
SELECT* FROM df1 INNERJOIN df2 ON df1.key = df2.key;
1 2 3 4 5 6 7
# merge performs an INNER JOIN by default In [24]: pd.merge(df1, df2, on='key') Out[24]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632 -0.173215 2 D -1.1356320.119209
In [26]: pd.merge(df1, indexed_df2, left_on='key', right_index=True) Out[26]: key value_x value_y 1 B -0.2828631.212112 3 D -1.135632 -0.173215 3 D -1.1356320.119209
LEFT OUTER JOIN
1 2 3 4 5
-- show all records from df1 SELECT* FROM df1 LEFTOUTERJOIN df2 ON df1.key = df2.key;
1 2 3 4 5 6 7 8 9
# show all records from df1 In [27]: pd.merge(df1, df2, on='key', how='left') Out[27]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209
RIGHT JOIN
1 2 3 4 5
-- show all records from df2 SELECT* FROM df1 RIGHTOUTERJOIN df2 ON df1.key = df2.key;
1 2 3 4 5 6 7 8
# show all records from df2 In [28]: pd.merge(df1, df2, on='key', how='right') Out[28]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632 -0.173215 2 D -1.1356320.119209 3 E NaN -1.044236
-- show all records from both tables SELECT* FROM df1 FULLOUTERJOIN df2 ON df1.key = df2.key;
1 2 3 4 5 6 7 8 9 10
# show all records from both frames In [29]: pd.merge(df1, df2, on='key', how='outer') Out[29]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
SELECT city, rank FROM df1 UNIONALL SELECT city, rank FROM df2; /* city rank Chicago 1 San Francisco 2 New York City 3 Chicago 1 Boston 4 Los Angeles 5 */
1 2 3 4 5 6 7 8 9
In [32]: pd.concat([df1, df2]) Out[32]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 0 Chicago 1 1 Boston 4 2 Los Angeles 5
SQL的UNION类似于UNION ALL,但是UNION将删除重复的行。
1 2 3 4 5 6 7 8 9 10 11 12 13 14
SELECT city, rank FROM df1 UNION SELECT city, rank FROM df2; -- notice that there is only one Chicago record this time /* city rank Chicago 1 San Francisco 2 New York City 3 Boston 4 Los Angeles 5 */
In [33]: pd.concat([df1, df2]).drop_duplicates() Out[33]: city rank 0 Chicago 1 1 San Francisco 2 2 New York City 3 1 Boston 4 2 Los Angeles 5
Pandas等同于某些SQL分析和聚合函数
带有偏移量的前N行
1 2 3 4
-- MySQL SELECT*FROM tips ORDERBY tip DESC LIMIT 10OFFSET5;
1 2 3 4 5 6 7 8 9 10 11 12 13
In [34]: tips.nlargest(10 + 5, columns='tip').tail(10) Out[34]: total_bill tip sex smoker day time size 18323.176.50 Male Yes Sun Dinner 4 21428.176.50 Female Yes Sat Dinner 3 4732.406.00 Male No Sun Dinner 4 23929.035.92 Male No Sat Dinner 3 8824.715.85 Male No Thur Lunch 2 18123.335.65 Male Yes Sun Dinner 2 4430.405.60 Male No Sun Dinner 4 5234.815.20 Female No Sun Dinner 4 8534.835.17 Female No Thur Lunch 4 21125.895.16 Male Yes Sat Dinner 4
每组前N行
1 2 3 4 5 6 7 8 9
-- Oracle's ROW_NUMBER() analytic function SELECT*FROM ( SELECT t.*, ROW_NUMBER() OVER(PARTITIONBYdayORDERBY total_bill DESC) AS rn FROM tips t ) WHERE rn <3 ORDERBYday, rn;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
In [35]: (tips.assign(rn=tips.sort_values(['total_bill'], ascending=False) ....: .groupby(['day']) ....: .cumcount() + 1) ....: .query('rn < 3') ....: .sort_values(['day', 'rn'])) ....: Out[35]: total_bill tip sex smoker day time size rn 9540.174.73 Male Yes Fri Dinner 41 9028.973.00 Male Yes Fri Dinner 22 17050.8110.00 Male Yes Sat Dinner 31 21248.339.00 Male No Sat Dinner 42 15648.175.00 Male No Sun Dinner 61 18245.353.50 Male Yes Sun Dinner 32 19743.115.00 Female Yes Thur Lunch 41 14241.195.00 Male No Thur Lunch 52
同样使用 rank (method =’first’) 函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
In [36]: (tips.assign(rnk=tips.groupby(['day'])['total_bill'] ....: .rank(method='first', ascending=False)) ....: .query('rnk < 3') ....: .sort_values(['day', 'rnk'])) ....: Out[36]: total_bill tip sex smoker day time size rnk 9540.174.73 Male Yes Fri Dinner 41.0 9028.973.00 Male Yes Fri Dinner 22.0 17050.8110.00 Male Yes Sat Dinner 31.0 21248.339.00 Male No Sat Dinner 42.0 15648.175.00 Male No Sun Dinner 61.0 18245.353.50 Male Yes Sun Dinner 32.0 19743.115.00 Female Yes Thur Lunch 41.0 14241.195.00 Male No Thur Lunch 52.0
1 2 3 4 5 6 7 8 9 10
-- Oracle's RANK() analytic function SELECT*FROM ( SELECT t.*, RANK() OVER(PARTITIONBY sex ORDERBY tip) AS rnk FROM tips t WHERE tip <2 ) WHERE rnk <3 ORDERBY sex, rnk;
In [37]: (tips[tips['tip'] < 2] ....: .assign(rnk_min=tips.groupby(['sex'])['tip'] ....: .rank(method='min')) ....: .query('rnk_min < 3') ....: .sort_values(['sex', 'rnk_min'])) ....: Out[37]: total_bill tip sex smoker day time size rnk_min 673.071.00 Female Yes Sat Dinner 11.0 925.751.00 Female Yes Fri Dinner 21.0 1117.251.00 Female No Sat Dinner 11.0 23612.601.00 Male Yes Sat Dinner 21.0 23732.831.17 Male Yes Sat Dinner 22.0
In [5]: url = ('https://raw.github.com/pandas-dev/' ...: 'pandas/master/pandas/tests/data/tips.csv') ...:
In [6]: tips = pd.read_csv(url)
In [7]: tips.head() Out[7]: total_bill tip sex smoker day time size 016.991.01 Female No Sun Dinner 2 110.341.66 Male No Sun Dinner 3 221.013.50 Male No Sun Dinner 3 323.683.31 Male No Sun Dinner 2 424.593.61 Female No Sun Dinner 4
In [8]: tips['total_bill'] = tips['total_bill'] - 2
In [9]: tips['new_bill'] = tips['total_bill'] / 2.0
In [10]: tips.head() Out[10]: total_bill tip sex smoker day time size new_bill 014.991.01 Female No Sun Dinner 27.495 18.341.66 Male No Sun Dinner 34.170 219.013.50 Male No Sun Dinner 39.505 321.683.31 Male No Sun Dinner 210.840 422.593.61 Female No Sun Dinner 411.295
过滤
SAS中的过滤是通过一个或多个列上的if或where语句完成的。
1 2 3 4 5 6 7 8 9 10 11
data tips; set tips; if total_bill > 10; run; data tips; set tips; where total_bill > 10; /* equivalent in this case - where happens before the DATA step begins and can also be used in PROC statements */ run;
In [11]: tips[tips['total_bill'] > 10].head() Out[11]: total_bill tip sex smoker day time size 014.991.01 Female No Sun Dinner 2 219.013.50 Male No Sun Dinner 3 321.683.31 Male No Sun Dinner 2 422.593.61 Female No Sun Dinner 4 523.294.71 Male No Sun Dinner 4
如果/那么逻辑
在SAS中,if / then逻辑可用于创建新列。
1 2 3 4 5 6 7
data tips; set tips; format bucket $4.;
if total_bill < 10 then bucket = 'low'; else bucket = 'high'; run;
Pandas 中的相同操作可以使用where来自的方法来完成numpy。
1 2 3 4 5 6 7 8 9 10
In [12]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')
In [13]: tips.head() Out[13]: total_bill tip sex smoker day time size bucket 014.991.01 Female No Sun Dinner 2 high 18.341.66 Male No Sun Dinner 3 low 219.013.50 Male No Sun Dinner 3 high 321.683.31 Male No Sun Dinner 2 high 422.593.61 Female No Sun Dinner 4 high
日期功能
SAS提供了各种功能来对日期/日期时间列进行操作。
1 2 3 4 5 6 7 8 9 10 11 12
data tips; set tips; format date1 date2 date1_plusmonth mmddyy10.; date1 = mdy(1, 15, 2013); date2 = mdy(2, 15, 2015); date1_year = year(date1); date2_month = month(date2); * shift date to beginning of next interval; date1_next = intnx('MONTH', date1, 1); * count intervals between dates; months_between = intck('MONTH', date1, date2); run;
# keep In [21]: tips[['sex', 'total_bill', 'tip']].head() Out[21]: sex total_bill tip 0 Female 14.991.01 1 Male 8.341.66 2 Male 19.013.50 3 Male 21.683.31 4 Female 22.593.61
# drop In [22]: tips.drop('sex', axis=1).head() Out[22]: total_bill tip smoker day time size 014.991.01 No Sun Dinner 2 18.341.66 No Sun Dinner 3 219.013.50 No Sun Dinner 3 321.683.31 No Sun Dinner 2 422.593.61 No Sun Dinner 4
# rename In [23]: tips.rename(columns={'total_bill': 'total_bill_2'}).head() Out[23]: total_bill_2 tip sex smoker day time size 014.991.01 Female No Sun Dinner 2 18.341.66 Male No Sun Dinner 3 219.013.50 Male No Sun Dinner 3 321.683.31 Male No Sun Dinner 2 422.593.61 Female No Sun Dinner 4
In [24]: tips = tips.sort_values(['sex', 'total_bill'])
In [25]: tips.head() Out[25]: total_bill tip sex smoker day time size 671.071.00 Female Yes Sat Dinner 1 923.751.00 Female Yes Fri Dinner 2 1115.251.00 Female No Sat Dinner 1 1456.351.50 Female No Thur Lunch 2 1356.511.25 Female No Thur Lunch 2
data firstlast; input String $60.; string_up = UPCASE(string); string_low = LOWCASE(string); string_prop = PROPCASE(string); datalines2; John Smith; Jane Cook; ;;; run;
等效Python的功能upper,lower和title。
1 2 3 4 5 6 7 8 9 10 11 12 13
In [34]: firstlast = pd.DataFrame({'String': ['John Smith', 'Jane Cook']})
In [35]: firstlast['string_up'] = firstlast['String'].str.upper()
In [36]: firstlast['string_low'] = firstlast['String'].str.lower()
In [37]: firstlast['string_prop'] = firstlast['String'].str.title()
In [38]: firstlast Out[38]: String string_up string_low string_prop 0 John Smith JOHN SMITH john smith John Smith 1 Jane Cook JANE COOK jane cook Jane Cook
In [43]: inner_join = df1.merge(df2, on=['key'], how='inner')
In [44]: inner_join Out[44]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632 -0.173215 2 D -1.1356320.119209
In [45]: left_join = df1.merge(df2, on=['key'], how='left')
In [46]: left_join Out[46]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209
In [47]: right_join = df1.merge(df2, on=['key'], how='right')
In [48]: right_join Out[48]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632 -0.173215 2 D -1.1356320.119209 3 E NaN -1.044236
In [49]: outer_join = df1.merge(df2, on=['key'], how='outer')
In [50]: outer_join Out[50]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
In [51]: outer_join Out[51]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
In [52]: outer_join['value_x'] + outer_join['value_y'] Out[52]: 0 NaN 10.929249 2 NaN 3 -1.308847 4 -1.016424 5 NaN dtype: float64
In [53]: outer_join['value_x'].sum() Out[53]: -3.5940742896293765
一个区别是丢失的数据无法与其哨兵值进行比较。例如,在SAS中,您可以执行此操作以过滤缺失值。
1 2 3 4 5 6 7 8 9
data outer_join_nulls; set outer_join; if value_x = .; run; data outer_join_no_nulls; set outer_join; if value_x ^= .; run;
这在 Pandas 中不起作用。相反,应使用pd.isna或pd.notna函数进行比较。
1 2 3 4 5 6 7 8 9 10 11 12 13
In [54]: outer_join[pd.isna(outer_join['value_x'])] Out[54]: key value_x value_y 5 E NaN -1.044236
In [55]: outer_join[pd.notna(outer_join['value_x'])] Out[55]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209
In [56]: outer_join.dropna() Out[56]: key value_x value_y 1 B -0.2828631.212112 3 D -1.135632 -0.173215 4 D -1.1356320.119209
In [57]: outer_join.fillna(method='ffill') Out[57]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.5090591.212112 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E -1.135632 -1.044236
proc summary data=tips missing nway; class smoker; var total_bill; outputout=smoker_means mean(total_bill)=group_bill; run; proc sort data=tips; by smoker; run; data tips; merge tips(in=a) smoker_means(in=b); by smoker; adj_total_bill = total_bill - group_bill; if a and b; run;
In [61]: gb = tips.groupby('smoker')['total_bill']
In [62]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
In [63]: tips.head() Out[63]: total_bill tip sex smoker day time size adj_total_bill 671.071.00 Female Yes Sat Dinner 1 -17.686344 923.751.00 Female Yes Fri Dinner 2 -15.006344 1115.251.00 Female No Sat Dinner 1 -11.938278 1456.351.50 Female No Thur Lunch 2 -10.838278 1356.511.25 Female No Thur Lunch 2 -10.678278
proc sort data=tips; by sex smoker; run; data tips_first; set tips; by sex smoker; if FIRST.sex or FIRST.smoker thenoutput; run;
在 Pandas 中,这将写成:
1 2 3 4 5 6 7 8
In [64]: tips.groupby(['sex', 'smoker']).first() Out[64]: total_bill tip day time size adj_total_bill sex smoker Female No 5.251.00 Sat Dinner 1 -11.938278 Yes 1.071.00 Sat Dinner 1 -17.686344 Male No 5.512.00 Thur Lunch 2 -11.678278 Yes 5.255.15 Sun Dinner 2 -13.506344
In [5]: url = ('https://raw.github.com/pandas-dev' ...: '/pandas/master/pandas/tests/data/tips.csv') ...:
In [6]: tips = pd.read_csv(url)
In [7]: tips.head() Out[7]: total_bill tip sex smoker day time size 016.991.01 Female No Sun Dinner 2 110.341.66 Male No Sun Dinner 3 221.013.50 Male No Sun Dinner 3 323.683.31 Male No Sun Dinner 2 424.593.61 Female No Sun Dinner 4
In [8]: tips['total_bill'] = tips['total_bill'] - 2
In [9]: tips['new_bill'] = tips['total_bill'] / 2
In [10]: tips.head() Out[10]: total_bill tip sex smoker day time size new_bill 014.991.01 Female No Sun Dinner 27.495 18.341.66 Male No Sun Dinner 34.170 219.013.50 Male No Sun Dinner 39.505 321.683.31 Male No Sun Dinner 210.840 422.593.61 Female No Sun Dinner 411.295
In [12]: tips[tips['total_bill'] > 10].head() Out[12]: total_bill tip sex smoker day time size 014.991.01 Female No Sun Dinner 2 219.013.50 Male No Sun Dinner 3 321.683.31 Male No Sun Dinner 2 422.593.61 Female No Sun Dinner 4 523.294.71 Male No Sun Dinner 4
如果/那么逻辑
在Stata中,if子句也可用于创建新列。
1 2
generate bucket = "low" if total_bill < 10 replace bucket = "high" if total_bill >= 10
使用 numpy 的 where 方法可以在 pandas 中完成相同的操作。
1 2 3 4 5 6 7 8 9 10
In [13]: tips['bucket'] = np.where(tips['total_bill'] < 10, 'low', 'high')
In [14]: tips.head() Out[14]: total_bill tip sex smoker day time size bucket 014.991.01 Female No Sun Dinner 2 high 18.341.66 Male No Sun Dinner 3 low 219.013.50 Male No Sun Dinner 3 high 321.683.31 Male No Sun Dinner 2 high 422.593.61 Female No Sun Dinner 4 high
# keep In [22]: tips[['sex', 'total_bill', 'tip']].head() Out[22]: sex total_bill tip 0 Female 14.991.01 1 Male 8.341.66 2 Male 19.013.50 3 Male 21.683.31 4 Female 22.593.61
# drop In [23]: tips.drop('sex', axis=1).head() Out[23]: total_bill tip smoker day time size 014.991.01 No Sun Dinner 2 18.341.66 No Sun Dinner 3 219.013.50 No Sun Dinner 3 321.683.31 No Sun Dinner 2 422.593.61 No Sun Dinner 4
# rename In [24]: tips.rename(columns={'total_bill': 'total_bill_2'}).head() Out[24]: total_bill_2 tip sex smoker day time size 014.991.01 Female No Sun Dinner 2 18.341.66 Male No Sun Dinner 3 219.013.50 Male No Sun Dinner 3 321.683.31 Male No Sun Dinner 2 422.593.61 Female No Sun Dinner 4
In [25]: tips = tips.sort_values(['sex', 'total_bill'])
In [26]: tips.head() Out[26]: total_bill tip sex smoker day time size 671.071.00 Female Yes Sat Dinner 1 923.751.00 Female Yes Fri Dinner 2 1115.251.00 Female No Sat Dinner 1 1456.351.50 Female No Thur Lunch 2 1356.511.25 Female No Thur Lunch 2
字符串处理
查找字符串的长度
Stata 分别使用ASCII和Unicode字符串 strlen() 和 ustrlen() 函数确定字符串的长度。
In [44]: inner_join = df1.merge(df2, on=['key'], how='inner')
In [45]: inner_join Out[45]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632 -0.173215 2 D -1.1356320.119209
In [46]: left_join = df1.merge(df2, on=['key'], how='left')
In [47]: left_join Out[47]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209
In [48]: right_join = df1.merge(df2, on=['key'], how='right')
In [49]: right_join Out[49]: key value_x value_y 0 B -0.2828631.212112 1 D -1.135632 -0.173215 2 D -1.1356320.119209 3 E NaN -1.044236
In [50]: outer_join = df1.merge(df2, on=['key'], how='outer')
In [51]: outer_join Out[51]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
In [52]: outer_join Out[52]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E NaN -1.044236
In [53]: outer_join['value_x'] + outer_join['value_y'] Out[53]: 0 NaN 10.929249 2 NaN 3 -1.308847 4 -1.016424 5 NaN dtype: float64
In [54]: outer_join['value_x'].sum() Out[54]: -3.5940742896293765
一个区别是丢失的数据无法与其哨兵值进行比较。例如,在 Stata 中,您可以执行此操作以过滤缺失值。
1 2 3 4
* Keep missing values list if value_x == . * Keep non-missing values list if value_x != .
这在 pandas 中不起作用。相反,应使用pd.isna()或pd.notna()函数进行比较。
1 2 3 4 5 6 7 8 9 10 11 12 13
In [55]: outer_join[pd.isna(outer_join['value_x'])] Out[55]: key value_x value_y 5 E NaN -1.044236
In [56]: outer_join[pd.notna(outer_join['value_x'])] Out[56]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.509059 NaN 3 D -1.135632 -0.173215 4 D -1.1356320.119209
# Drop rows with any missing value In [57]: outer_join.dropna() Out[57]: key value_x value_y 1 B -0.2828631.212112 3 D -1.135632 -0.173215 4 D -1.1356320.119209
# Fill forwards In [58]: outer_join.fillna(method='ffill') Out[58]: key value_x value_y 0 A 0.469112 NaN 1 B -0.2828631.212112 2 C -1.5090591.212112 3 D -1.135632 -0.173215 4 D -1.1356320.119209 5 E -1.135632 -1.044236
# Impute missing values with the mean In [59]: outer_join['value_x'].fillna(outer_join['value_x'].mean()) Out[59]: 00.469112 1 -0.282863 2 -1.509059 3 -1.135632 4 -1.135632 5 -0.718815 Name: value_x, dtype: float64
In [62]: gb = tips.groupby('smoker')['total_bill']
In [63]: tips['adj_total_bill'] = tips['total_bill'] - gb.transform('mean')
In [64]: tips.head() Out[64]: total_bill tip sex smoker day time size adj_total_bill 671.071.00 Female Yes Sat Dinner 1 -17.686344 923.751.00 Female Yes Fri Dinner 2 -15.006344 1115.251.00 Female No Sat Dinner 1 -11.938278 1456.351.50 Female No Thur Lunch 2 -10.838278 1356.511.25 Female No Thur Lunch 2 -10.678278
In [65]: tips.groupby(['sex', 'smoker']).first() Out[65]: total_bill tip day time size adj_total_bill sex smoker Female No 5.251.00 Sat Dinner 1 -11.938278 Yes 1.071.00 Sat Dinner 1 -17.686344 Male No 5.512.00 Thur Lunch 2 -11.678278 Yes 5.255.15 Sun Dinner 2 -13.506344
其他注意事项
磁盘与内存
pandas 和 Stata 都只在内存中运行。这意味着能够在 pandas 中加载的数据大小受机器内存的限制。如果需要进行核心处理,则有一种可能性是dask.dataframe 库,它为磁盘上的pandas功能提供了一个子集DataFrame。