In [10]: df2 Out[10]: A B C D E F 01.02013-01-02 1.03 test foo 11.02013-01-02 1.03 train foo 21.02013-01-02 1.03 test foo 31.02013-01-02 1.03 train foo
In [13]: df.head() Out[13]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401
In [14]: df.tail(3) Out[14]: A B C D 2013-01-04 0.721555 -0.706771 -1.0395750.271860 2013-01-05 -0.4249720.5670200.276232 -1.087401 2013-01-06 -0.6736900.113648 -1.4784270.524988
In [24]: df[0:3] Out[24]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804
In [25]: df['20130102':'20130104'] Out[25]: A B C D 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 2013-01-04 0.721555 -0.706771 -1.0395750.271860
In [32]: df.iloc[3] Out[32]: A 0.721555 B -0.706771 C -1.039575 D 0.271860 Name: 2013-01-04 00:00:00, dtype: float64
类似 NumPy / Python,用整数切片:
1 2 3 4 5
In [33]: df.iloc[3:5, 0:2] Out[33]: A B 2013-01-04 0.721555 -0.706771 2013-01-05 -0.4249720.567020
类似 NumPy / Python,用整数列表按位置切片:
1 2 3 4 5 6
In [34]: df.iloc[[1, 2, 4], [0, 2]] Out[34]: A C 2013-01-02 1.2121120.119209 2013-01-03 -0.861849 -0.494929 2013-01-05 -0.4249720.276232
显式整行切片:
1 2 3 4 5
In [35]: df.iloc[1:3, :] Out[35]: A B C D 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-03 -0.861849 -2.104569 -0.4949291.071804
显式整列切片:
1 2 3 4 5 6 7 8 9
In [36]: df.iloc[:, 1:3] Out[36]: B C 2013-01-01 -0.282863 -1.509059 2013-01-02 -0.1732150.119209 2013-01-03 -2.104569 -0.494929 2013-01-04 -0.706771 -1.039575 2013-01-05 0.5670200.276232 2013-01-06 0.113648 -1.478427
显式提取值:
1 2
In [37]: df.iloc[1, 1] Out[37]: -0.17321464905330858
快速访问标量,与上述方法等效:
1 2
In [38]: df.iat[1, 1] Out[38]: -0.17321464905330858
布尔索引
用单列的值选择数据:
1 2 3 4 5 6
In [39]: df[df.A > 0] Out[39]: A B C D 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 2013-01-02 1.212112 -0.1732150.119209 -1.044236 2013-01-04 0.721555 -0.706771 -1.0395750.271860
选择 DataFrame 里满足条件的值:
1 2 3 4 5 6 7 8 9
In [40]: df[df > 0] Out[40]: A B C D 2013-01-01 0.469112 NaN NaN NaN 2013-01-02 1.212112 NaN 0.119209 NaN 2013-01-03 NaN NaN NaN 1.071804 2013-01-04 0.721555 NaN NaN 0.271860 2013-01-05 NaN 0.5670200.276232 NaN 2013-01-06 NaN 0.113648 NaN 0.524988
In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [43]: df2 Out[43]: A B C D E 2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one 2013-01-02 1.212112 -0.1732150.119209 -1.044236 one 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 two 2013-01-04 0.721555 -0.706771 -1.0395750.271860 three 2013-01-05 -0.4249720.5670200.276232 -1.087401 four 2013-01-06 -0.6736900.113648 -1.4784270.524988 three
In [44]: df2[df2['E'].isin(['two', 'four'])] Out[44]: A B C D E 2013-01-03 -0.861849 -2.104569 -0.4949291.071804 two 2013-01-05 -0.4249720.5670200.276232 -1.087401 four
In [51]: df Out[51]: A B C D F 2013-01-01 0.0000000.000000 -1.5090595 NaN 2013-01-02 1.212112 -0.1732150.11920951.0 2013-01-03 -0.861849 -2.104569 -0.49492952.0 2013-01-04 0.721555 -0.706771 -1.03957553.0 2013-01-05 -0.4249720.5670200.27623254.0 2013-01-06 -0.6736900.113648 -1.47842755.0
用 where 条件赋值:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2
In [54]: df2 Out[54]: A B C D F 2013-01-01 0.0000000.000000 -1.509059 -5 NaN 2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0 2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0 2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0 2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0 2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [57]: df1 Out[57]: A B C D F E 2013-01-01 0.0000000.000000 -1.5090595 NaN 1.0 2013-01-02 1.212112 -0.1732150.11920951.01.0 2013-01-03 -0.861849 -2.104569 -0.49492952.0 NaN 2013-01-04 0.721555 -0.706771 -1.03957553.0 NaN
删除所有含缺失值的行:
1 2 3 4
In [58]: df1.dropna(how='any') Out[58]: A B C D F E 2013-01-02 1.212112 -0.1732150.11920951.01.0
填充缺失值:
1 2 3 4 5 6 7
In [59]: df1.fillna(value=5) Out[59]: A B C D F E 2013-01-01 0.0000000.000000 -1.50905955.01.0 2013-01-02 1.212112 -0.1732150.11920951.01.0 2013-01-03 -0.861849 -2.104569 -0.49492952.05.0 2013-01-04 0.721555 -0.706771 -1.03957553.05.0
提取 nan 值的布尔掩码:
1 2 3 4 5 6 7
In [60]: pd.isna(df1) Out[60]: A B C D F E 2013-01-01 FalseFalseFalseFalseTrueFalse 2013-01-02 FalseFalseFalseFalseFalseFalse 2013-01-03 FalseFalseFalseFalseFalseTrue 2013-01-04 FalseFalseFalseFalseFalseTrue
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [64]: s Out[64]: 2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64
In [65]: df.sub(s, axis='index') Out[65]: A B C D F 2013-01-01 NaN NaN NaN NaN NaN 2013-01-02 NaN NaN NaN NaN NaN 2013-01-03 -1.861849 -3.104569 -1.4949294.01.0 2013-01-04 -2.278445 -3.706771 -4.0395752.00.0 2013-01-05 -5.424972 -4.432980 -4.7237680.0 -1.0 2013-01-06 NaN NaN NaN NaN NaN
Apply 函数
Apply 函数处理数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
In [66]: df.apply(np.cumsum) Out[66]: A B C D F 2013-01-01 0.0000000.000000 -1.5090595 NaN 2013-01-02 1.212112 -0.173215 -1.389850101.0 2013-01-03 0.350263 -2.277784 -1.884779153.0 2013-01-04 1.071818 -2.984555 -2.924354206.0 2013-01-05 0.646846 -2.417535 -2.6481222510.0 2013-01-06 -0.026844 -2.303886 -4.1265493015.0
In [67]: df.apply(lambda x: x.max() - x.min()) Out[67]: A 2.073961 B 2.671590 C 1.785291 D 0.000000 F 4.000000 dtype: float64
In [92]: df Out[92]: A B C D 0 foo one -1.202872 -0.055224 1 bar one -1.8144702.395985 2 foo two 1.0186011.552825 3 bar three -0.5954470.166599 4 foo two 1.3954330.047609 5 bar two -0.392670 -0.136473 6 foo one 0.007207 -0.561757 7 foo three 1.928123 -1.623033
In [93]: df.groupby('A').sum() Out[93]: C D A bar -2.8025882.42611 foo 3.146492 -0.63958
多列分组后,生成多层索引,也可以应用 sum 函数:
1 2 3 4 5 6 7 8 9 10
In [94]: df.groupby(['A', 'B']).sum() Out[94]: C D A B bar one -1.8144702.395985 three -0.5954470.166599 two -0.392670 -0.136473 foo one -1.195665 -0.616981 three 1.928123 -1.623033 two 2.4140341.600434
In [101]: stacked Out[101]: first second B -0.542108 two A 0.282696 B -0.087302 baz one A -1.575170 B 1.771208 two A 0.816482 B 1.100230 dtype: float64
压缩后的 DataFrame 或 Series 具有多层索引, stack() 的逆操作是 unstack(),默认为拆叠最后一层:
In [102]: stacked.unstack() Out[102]: A B first second bar one 0.029399 -0.542108 two 0.282696 -0.087302 baz one -1.5751701.771208 two 0.8164821.100230
In [103]: stacked.unstack(1) Out[103]: second one two first bar A 0.0293990.282696 B -0.542108 -0.087302 baz A -1.5751700.816482 B 1.7712081.100230
In [104]: stacked.unstack(0) Out[104]: first bar baz second one A 0.029399 -1.575170 B -0.5421081.771208 two A 0.2826960.816482 B -0.0873021.100230
In [106]: df Out[106]: A B C D E 0 one A foo 1.418757 -0.179666 1 one B foo -1.8790241.291836 2 two C foo 0.536826 -0.009614 3 three A bar 1.0061600.392149 4 one B bar -0.0297160.264599 5 one C bar -1.146178 -0.057409 6 two A foo 0.100900 -1.425638 7 three B foo -1.0350181.024098 8 one C foo 0.314665 -0.106062 9 one A bar -0.7737231.824375 10 two B bar -1.1706530.595974 11 three C bar 0.6487401.167115
用上述数据生成数据透视表非常简单:
1 2 3 4 5 6 7 8 9 10 11 12 13
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']) Out[107]: C bar foo A B one A -0.7737231.418757 B -0.029716 -1.879024 C -1.1461780.314665 three A 1.006160 NaN B NaN -1.035018 C 0.648740 NaN two A NaN 0.100900 B -1.170653 NaN C NaN 0.536826
In [132]: df["grade"] Out[132]: 0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): [very bad, bad, medium, good, very good]
注意,这里是按生成类别时的顺序排序,不是按词汇排序:
1 2 3 4 5 6 7 8 9
In [133]: df.sort_values(by="grade") Out[133]: id raw_grade grade 56 e very bad 12 b good 23 b good 01 a very good 34 a very good 45 a very good
按类列分组(groupby)时,即便某类别为空,也会显示:
1 2 3 4 5 6 7 8 9
In [134]: df.groupby("grade").size() Out[134]: grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
>>> if pd.Series([False, True, False]): ... print("I was true") Traceback ... ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().