In [1]: df = pd.DataFrame({'one': [1., 2., 3.]}) In [2]: df.two = [4, 5, 6] UserWarning: Pandas doesn't allow Series to be assigned into nonexistent columns - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute_access In [3]: df Out[3]: one 0 1.0 1 2.0 2 3.0
In [33]: df[:3] Out[33]: A B C D 2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 2000-01-02 1.212112 -0.1732150.119209 -1.044236 2000-01-03 -0.861849 -2.104569 -0.4949291.071804
In [34]: df[::-1] Out[34]: A B C D 2000-01-08 -0.370647 -1.157892 -1.3443120.844885 2000-01-07 0.4047050.577046 -1.715002 -1.039268 2000-01-06 -0.6736900.113648 -1.4784270.524988 2000-01-05 -0.4249720.5670200.276232 -1.087401 2000-01-04 0.721555 -0.706771 -1.0395750.271860 2000-01-03 -0.861849 -2.104569 -0.4949291.071804 2000-01-02 1.212112 -0.1732150.119209 -1.044236 2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
In [36]: dfl Out[36]: A B C D 2013-01-01 1.075770 -0.1090501.643563 -1.469388 2013-01-02 0.357021 -0.674600 -1.776904 -0.968914 2013-01-03 -1.2945240.4137380.276662 -0.472035 2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061 2013-01-05 0.8957170.805244 -1.2064122.565646
1 2
In [4]: dfl.loc[2:3] TypeError: cannot do slice indexing on <class'pandas.tseries.index.DatetimeIndex'> with these indexers [2] of <type'int'>
切片中的字符串喜欢可以转换为索引的类型并导致自然切片。
1 2 3 4 5 6
In [37]: dfl.loc['20130102':'20130104'] Out[37]: A B C D 2013-01-02 0.357021 -0.674600 -1.776904 -0.968914 2013-01-03 -1.2945240.4137380.276662 -0.472035 2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
In [45]: df1 Out[45]: A B C D a 0.132003 -0.827317 -0.076467 -1.187678 b 1.130127 -1.436737 -1.4136811.607920 c 1.0241800.5696050.875906 -2.211372 d 0.974466 -2.006747 -0.410001 -0.078638 e 0.545952 -1.219217 -1.2268250.769804 f -1.281247 -0.727707 -0.121306 -0.097883
In [46]: df1.loc[['a', 'b', 'd'], :] Out[46]: A B C D a 0.132003 -0.827317 -0.076467 -1.187678 b 1.130127 -1.436737 -1.4136811.607920 d 0.974466 -2.006747 -0.410001 -0.078638
通过标签切片访问:
1 2 3 4 5 6
In [47]: df1.loc['d':, 'A':'C'] Out[47]: A B C d 0.974466 -2.006747 -0.410001 e 0.545952 -1.219217 -1.226825 f -1.281247 -0.727707 -0.121306
使用标签获取横截面(相当于df.xs('a')):
1 2 3 4 5 6 7
In [48]: df1.loc['a'] Out[48]: A 0.132003 B -0.827317 C -0.076467 D -1.187678 Name: a, dtype: float64
要使用布尔数组获取值:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
In [49]: df1.loc['a'] > 0 Out[49]: A True B False C False D False Name: a, dtype: bool
In [50]: df1.loc[:, df1.loc['a'] > 0] Out[50]: A a 0.132003 b 1.130127 c 1.024180 d 0.974466 e 0.545952 f -1.281247
要明确获取值(相当于已弃用df.get_value('a','A')):
1 2 3
# this is also equivalent to ``df1.at['a','A']`` In [51]: df1.loc['a', 'A'] Out[51]: 0.13200317033032932
用标签切片
使用.loc切片时,如果索引中存在开始和停止标签,则返回位于两者之间的元素(包括它们):
1 2 3 4 5 6 7 8
In [52]: s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
In [53]: s.loc[3:5] Out[53]: 3 b 2 c 5 d dtype: object
In [85]: df1 Out[85]: A B C D a -0.0236882.4101791.4505200.206053 b -0.251905 -2.2135881.0633271.266143 c 0.299368 -0.8638380.408204 -1.048089 d -0.025747 -0.9883870.0940551.262731 e 1.2899970.082423 -0.0557580.536580 f -0.4896820.369374 -0.034571 -2.484478
In [86]: df1.loc[lambda df: df.A > 0, :] Out[86]: A B C D c 0.299368 -0.8638380.408204 -1.048089 e 1.2899970.082423 -0.0557580.536580
In [87]: df1.loc[:, lambda df: ['A', 'B']] Out[87]: A B a -0.0236882.410179 b -0.251905 -2.213588 c 0.299368 -0.863838 d -0.025747 -0.988387 e 1.2899970.082423 f -0.4896820.369374
In [88]: df1.iloc[:, lambda df: [0, 1]] Out[88]: A B a -0.0236882.410179 b -0.251905 -2.213588 c 0.299368 -0.863838 d -0.025747 -0.988387 e 1.2899970.082423 f -0.4896820.369374
In [89]: df1[lambda df: df.columns[0]] Out[89]: a -0.023688 b -0.251905 c 0.299368 d -0.025747 e 1.289997 f -0.489682 Name: A, dtype: float64
您可以使用可调用索引Series。
1 2 3 4 5
In [90]: df1.A.loc[lambda s: s > 0] Out[90]: c 0.299368 e 1.289997 Name: A, dtype: float64
使用这些方法/索引器,您可以在不使用临时变量的情况下链接数据选择操作。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
In [91]: bb = pd.read_csv('data/baseball.csv', index_col='id')
In [92]: (bb.groupby(['year', 'team']).sum() ....: .loc[lambda df: df.r > 100]) ....: Out[92]: stint g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp year team 2007 CIN 637974510120335236125.010.01.0105127.014.01.01.015.018.0 DET 5301106216228354437144.024.07.097176.03.010.04.08.028.0 HOU 43119261092184761477.010.04.060212.03.09.016.06.017.0 LAN 11413102115329361336154.07.05.0114141.08.09.03.08.029.0 NYN 136221854240509101361243.022.04.0174310.024.023.018.015.048.0 SFN 5482130519833767640171.026.07.0235188.051.08.016.06.041.0 TEX 219872911520040428115.021.04.073140.04.05.02.08.016.0 TOR 4459140818737896258223.04.02.0190265.016.012.04.016.038.0
In [100]: s.loc[[1, 2]] Out[100]: 12 23 dtype: int64
以前的行为
1 2 3 4 5 6
In [4]: s.loc[[1, 2, 3]] Out[4]: 12.0 23.0 3 NaN dtype: float64
目前的行为
1 2 3 4 5 6 7 8 9 10 11 12
In [4]: s.loc[[1, 2, 3]] Passing list-likes to .loc withany non-matching elements will raise KeyError in the future, you can use .reindex() as an alternative.
See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
In [137]: df.at[dates[5], 'A'] Out[137]: -0.6736897080883706
In [138]: df.iat[3, 0] Out[138]: 0.7215551622443669
您也可以使用这些相同的索引器进行设置。
1 2 3
In [139]: df.at[dates[5], 'E'] = 7
In [140]: df.iat[3, 0] = 7
at 如果索引器丢失,可以如上所述放大对象。
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [141]: df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7
In [142]: df Out[142]: A B C D E 0 2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 NaN NaN 2000-01-02 1.212112 -0.1732150.119209 -1.044236 NaN NaN 2000-01-03 -0.861849 -2.104569 -0.4949291.071804 NaN NaN 2000-01-04 7.000000 -0.706771 -1.0395750.271860 NaN NaN 2000-01-05 -0.4249720.5670200.276232 -1.087401 NaN NaN 2000-01-06 -0.6736900.113648 -1.4784270.5249887.0 NaN 2000-01-07 0.4047050.577046 -1.715002 -1.039268 NaN NaN 2000-01-08 -0.370647 -1.157892 -1.3443120.844885 NaN NaN 2000-01-09 NaN NaN NaN NaN NaN 7.0
In [148]: df[df['A'] > 0] Out[148]: A B C D E 0 2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 NaN NaN 2000-01-02 1.212112 -0.1732150.119209 -1.044236 NaN NaN 2000-01-04 7.000000 -0.706771 -1.0395750.271860 NaN NaN 2000-01-07 0.4047050.577046 -1.715002 -1.039268 NaN NaN
In [175]: df[df < 0] Out[175]: A B C D 2000-01-01 -2.104139 -1.309525 NaN NaN 2000-01-02 -0.352480 NaN -1.192319 NaN 2000-01-03 -0.864883 NaN -0.227870 NaN 2000-01-04 NaN -1.222082 NaN -1.233203 2000-01-05 NaN -0.605656 -1.169184 NaN 2000-01-06 NaN -0.948458 NaN -0.684718 2000-01-07 -2.670153 -0.114722 NaN -0.048048 2000-01-08 NaN NaN -0.048788 -0.808838
In [182]: df2 Out[182]: A B C D 2000-01-01 0.0000000.0000000.4858550.245166 2000-01-02 0.0000000.3903890.0000001.655824 2000-01-03 0.0000000.2996740.0000000.281059 2000-01-04 0.8469580.0000000.6007050.000000 2000-01-05 0.6696920.0000000.0000000.342416 2000-01-06 0.8685840.0000002.2977800.000000 2000-01-07 0.0000000.0000000.1689040.000000 2000-01-08 0.8011961.3920710.0000000.000000
In [184]: df_orig.where(df > 0, -df, inplace=True)
In [185]: df_orig Out[185]: A B C D 2000-01-01 2.1041391.3095250.4858550.245166 2000-01-02 0.3524800.3903891.1923191.655824 2000-01-03 0.8648830.2996740.2278700.281059 2000-01-04 0.8469581.2220820.6007051.233203 2000-01-05 0.6696920.6056561.1691840.342416 2000-01-06 0.8685840.9484582.2977800.684718 2000-01-07 2.6701530.1147220.1689040.048048 2000-01-08 0.8011961.3920710.0487880.808838
In [196]: s.mask(s >= 0) Out[196]: 4 NaN 3 NaN 2 NaN 1 NaN 0 NaN dtype: float64
In [197]: df.mask(df >= 0) Out[197]: A B C D 2000-01-01 -2.104139 -1.309525 NaN NaN 2000-01-02 -0.352480 NaN -1.192319 NaN 2000-01-03 -0.864883 NaN -0.227870 NaN 2000-01-04 NaN -1.222082 NaN -1.233203 2000-01-05 NaN -0.605656 -1.169184 NaN 2000-01-06 NaN -0.948458 NaN -0.684718 2000-01-07 -2.670153 -0.114722 NaN -0.048048 2000-01-08 NaN NaN -0.048788 -0.808838
In [199]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [200]: df Out[200]: a b c 00.4389210.1186800.863670 10.1381380.5773630.686602 20.5953070.5645920.520630 30.9130520.9260750.616184 40.0787180.8544770.898725 50.0764040.5232110.591538 60.7923420.2169740.564056 70.3978900.4541310.915716 80.0743150.4379130.019794 90.5592090.5020650.026437
# pure python In [201]: df[(df.a < df.b) & (df.b < df.c)] Out[201]: a b c 10.1381380.5773630.686602 40.0787180.8544770.898725 50.0764040.5232110.591538 70.3978900.4541310.915716
# query In [202]: df.query('(a < b) & (b < c)') Out[202]: a b c 10.1381380.5773630.686602 40.0787180.8544770.898725 50.0764040.5232110.591538 70.3978900.4541310.915716
In [219]: index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
In [220]: df = pd.DataFrame(np.random.randn(n, 2), index=index)
In [221]: df Out[221]: 01 color food red ham 0.194889 -0.381994 ham 0.3185872.089075 eggs -0.728293 -0.090255 green eggs -0.7481991.318931 eggs -2.0297660.792652 ham 0.461007 -0.542749 ham -0.305384 -0.479195 eggs 0.095031 -0.270099 eggs -0.707140 -0.773882 eggs 0.2294530.304418
In [222]: df.query('color == "red"') Out[222]: 01 color food red ham 0.194889 -0.381994 ham 0.3185872.089075 eggs -0.728293 -0.090255
In [224]: df Out[224]: 01 red ham 0.194889 -0.381994 ham 0.3185872.089075 eggs -0.728293 -0.090255 green eggs -0.7481991.318931 eggs -2.0297660.792652 ham 0.461007 -0.542749 ham -0.305384 -0.479195 eggs 0.095031 -0.270099 eggs -0.707140 -0.773882 eggs 0.2294530.304418
In [225]: df.query('ilevel_0 == "red"') Out[225]: 01 red ham 0.194889 -0.381994 ham 0.3185872.089075 eggs -0.728293 -0.090255
In [226]: df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
In [227]: df Out[227]: a b c 00.2242830.7361070.139168 10.3028270.6578030.713897 20.6111850.1366240.984960 30.1952460.1234360.627712 40.6186730.3716600.047902 50.4800880.0629930.185760 60.5680180.4834670.445289 70.3090400.2745800.587101 80.2589930.4777690.370255 90.5504590.8408700.304611
In [228]: df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)
In [229]: df2 Out[229]: a b c 00.3575790.2298000.596001 10.3090590.9579230.965663 20.1231020.3369140.318616 30.5265060.3233210.860813 40.5187360.4865140.384724 50.1908040.5057230.614533 60.8919390.6239770.676639 70.4805590.3785280.460858 80.4202230.1364040.141295 90.7322060.4195400.604675 100.6044660.8489740.896165 110.5891680.9200460.732716
In [230]: expr = '0.0 <= a <= c <= 0.5'
In [231]: map(lambda frame: frame.query(expr), [df, df2]) Out[231]: <map at 0x7f65f7952d30>
# get all rows where columns "a" and "b" have overlapping values In [239]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'), .....: 'c': np.random.randint(5, size=12), .....: 'd': np.random.randint(9, size=12)}) .....:
In [240]: df Out[240]: a b c d 0 a a 26 1 a a 47 2 b a 16 3 b a 21 4 c b 36 5 c b 02 6 d b 33 7 d b 21 8 e c 43 9 e c 20 10 f c 06 11 f c 12
In [241]: df.query('a in b') Out[241]: a b c d 0 a a 26 1 a a 47 2 b a 16 3 b a 21 4 c b 36 5 c b 02
# How you'd do it in pure Python In [242]: df[df.a.isin(df.b)] Out[242]: a b c d 0 a a 26 1 a a 47 2 b a 16 3 b a 21 4 c b 36 5 c b 02
In [243]: df.query('a not in b') Out[243]: a b c d 6 d b 33 7 d b 21 8 e c 43 9 e c 20 10 f c 06 11 f c 12
# pure Python In [244]: df[~df.a.isin(df.b)] Out[244]: a b c d 6 d b 33 7 d b 21 8 e c 43 9 e c 20 10 f c 06 11 f c 12
# rows where cols a and b have overlapping values # and col c's values are less than col d's In [245]: df.query('a in b and c < d') Out[245]: a b c d 0 a a 26 1 a a 47 2 b a 16 4 c b 36 5 c b 02
# pure Python In [246]: df[df.b.isin(df.a) & (df.c < df.d)] Out[246]: a b c d 0 a a 26 1 a a 47 2 b a 16 4 c b 36 5 c b 02 10 f c 06 11 f c 12
::: tip 注意
请注意in并在Python中进行评估,因为 它没有相应的操作。但是,只有 / expression本身在vanilla Python中进行评估。例如,在表达式中not in````numexpr**** in````not in
1
df.query('a in b + c + d')
(b + c + d)通过评估numexpr和然后的in 操作在普通的Python评价。通常,任何可以使用的评估操作numexpr都是。
In [247]: df.query('b == ["a", "b", "c"]') Out[247]: a b c d 0 a a 26 1 a a 47 2 b a 16 3 b a 21 4 c b 36 5 c b 02 6 d b 33 7 d b 21 8 e c 43 9 e c 20 10 f c 06 11 f c 12
# pure Python In [248]: df[df.b.isin(["a", "b", "c"])] Out[248]: a b c d 0 a a 26 1 a a 47 2 b a 16 3 b a 21 4 c b 36 5 c b 02 6 d b 33 7 d b 21 8 e c 43 9 e c 20 10 f c 06 11 f c 12
In [249]: df.query('c == [1, 2]') Out[249]: a b c d 0 a a 26 2 b a 16 3 b a 21 7 d b 21 9 e c 20 11 f c 12
In [250]: df.query('c != [1, 2]') Out[250]: a b c d 1 a a 47 4 c b 36 5 c b 02 6 d b 33 8 e c 43 10 f c 06
# using in/not in In [251]: df.query('[1, 2] in c') Out[251]: a b c d 0 a a 26 2 b a 16 3 b a 21 7 d b 21 9 e c 20 11 f c 12
In [252]: df.query('[1, 2] not in c') Out[252]: a b c d 1 a a 47 4 c b 36 5 c b 02 6 d b 33 8 e c 43 10 f c 06
# pure Python In [253]: df[df.c.isin([1, 2])] Out[253]: a b c d 0 a a 26 2 b a 16 3 b a 21 7 d b 21 9 e c 20 11 f c 12
In [265]: df2 Out[265]: a b c 0 one x -1.067137 1 one y 0.309500 2 two x -0.211056 3 two y -1.842023 4 two x -0.390820 5 three x -1.964475 6 four x 1.298329
In [273]: df2.drop_duplicates(['a', 'b']) Out[273]: a b c 0 one x -1.067137 1 one y 0.309500 2 two x -0.211056 3 two y -1.842023 5 three x -1.964475 6 four x 1.298329
defdo_something(df): foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows! # ... many lines here ... # We don't know whether this will modify df or not! foo['quux'] = value return foo
# This will show the SettingWithCopyWarning # but the frame values will be set In [343]: dfb['c'][dfb.a.str.startswith('o')] = 42
然而,这是在副本上运行,不起作用。
1 2 3 4 5 6 7
>>> pd.set_option('mode.chained_assignment','warn') >>> dfb[dfb.a.str.startswith('o')]['c'] = 42 Traceback (most recent call last) ... SettingWithCopyWarning: A value is trying to be set on a copy of a slicefrom a DataFrame. Try using .loc[row_index,col_indexer] = value instead
>>> pd.set_option('mode.chained_assignment','raise') >>> dfc.loc[0]['A'] = 1111 Traceback (most recent call last) ... SettingWithCopyException: A value is trying to be set on a copy of a slicefrom a DataFrame. Try using .loc[row_index,col_indexer] = value instead