Estom的博客

发表于2020-09-26|Python

烹饪指南本节列出了一些短小精悍的 Pandas 实例与链接。我们希望 Pandas 用户能积极踊跃地为本文档添加更多内容。为本节添加实用示例的链接或代码，是 Pandas 用户提交第一个 Pull Request 最好的选择。本节列出了简单、精练、易上手的实例代码，以及 Stack Overflow 或 GitHub 上的链接，这些链接包含实例代码的更多详情。 pd 与 np 是 Pandas 与 Numpy 的缩写。为了让新手易于理解，其它模块是显式导入的。下列实例均为 Python 3 代码，简单修改即可用于 Python 早期版本。惯用语以下是 Pandas 的惯用语。对一列数据执行 if-then / if-then-else 操作，把计算结果赋值给一列或多列： 123456789101112In [1]: df = pd.DataFrame({'AAA': [4, 5, 6, 7], ...: 'BBB': [10, 20, 30, 40], ...: ...

enhancingperf

发表于2020-09-26|Python

Enhancing performanceIn this part of the tutorial, we will investigate how to speed up certainfunctions operating on pandas DataFrames using three different techniques:Cython, Numba and pandas.eval(). We will see a speed improvement of ~200when we use Cython and Numba on a test function operating row-wise on theDataFrame. Using pandas.eval() we will speed up a sum by an order of~2. Cython (writing C extensions for pandas)For many use cases writing pandas in pure Python and NumPy is sufficient...

gotchas

发表于2020-09-26|Python

Frequently Asked Questions (FAQ)DataFrame memory usageThe memory usage of a DataFrame (including the index) is shown when callingthe info(). A configuration option, display.memory_usage(see the list of options), specifies if theDataFrame’s memory usage will be displayed when invoking the df.info()method. For example, the memory usage of the DataFrame below is shownwhen calling info(): 1234567891011121314151617181920212223242526In [1]: dtypes = ['int64', 'float64', 'da...

groupby

发表于2020-09-26|Python

Group By: split-apply-combineBy “group by” we are referring to a process involving one or more of the followingsteps: Splitting the data into groups based on some criteria. Applying a function to each group independently. Combining the results into a data structure. Out of these, the split step is the most straightforward. In fact, in manysituations we may wish to split the data set into groups and do something withthose groups. In the apply step, we might wish to do one of thefollowing: A...

indexing

发表于2020-09-26|Python

索引和数据选择器Pandas对象中的轴标记信息有多种用途：使用已知指标识别数据（即提供元数据），这对于分析，可视化和交互式控制台显示非常重要。启用自动和显式数据对齐。允许直观地获取和设置数据集的子集。在本节中，我们将重点关注最后一点：即如何切片，切块，以及通常获取和设置pandas对象的子集。主要关注的是Series和DataFrame，因为他们在这个领域受到了更多的开发关注。 ::: tip 注意 Python和NumPy索引运算符[]和属性运算符.可以在各种用例中快速轻松地访问pandas数据结构。这使得交互式工作变得直观，因为如果您已经知道如何处理Python字典和NumPy数组，那么几乎没有新的东西需要学习。但是，由于预先不知道要访问的数据类型，因此直接使用标准运算符会有一些优化限制。对于生产代码，我们建议您利用本章中介绍的优化的pandas数据访问方法。 ::: ::: danger 警告是否为设置操作返回副本或引用可能取决于上下文。这有时被称为应该避免。请参阅返回视图与复制。chained assignment ::: ::: danger 警告使用浮...

integer_na

发表于2020-09-26|Python

Nullable整型数据类型在0.24.0版本中新引入 ::: tip 小贴士 IntegerArray目前属于实验性阶段，因此他的API或者使用方式可能会在没有提示的情况下更改。 ::: 在处理丢失的数据部分, 我们知道pandas主要使用 NaN 来代表丢失数据。因为 NaN 属于浮点型数据，这强制有缺失值的整型array强制转换成浮点型。在某些情况下，这可能不会有太大影响，但是如果你的整型数据恰好是标识符，数据类型的转换可能会存在隐患。同时，某些整数无法使用浮点型来表示。 Pandas能够将可能存在缺失值的整型数据使用arrays.IntegerArray来表示。这是pandas中内置的扩展方式。它并不是整型数据组成array对象的默认方式，并且并不会被pandas直接使用。因此，如果你希望生成这种数据类型，你需要在生成array() 或者 Series时，在dtype变量中直接指定。 1234567In [1]: arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())In [2]: arrOut[2]: <...

computation

发表于2020-09-26|Python

Computational toolsStatistical functionsPercent changeSeries and DataFrame have a methodpct_change() to compute the percent change over a given numberof periods (using fill_method to fill NA/null values before computingthe percent change). 12345678910111213In [1]: ser = pd.Series(np.random.randn(8))In [2]: ser.pct_change()Out[2]: 0 NaN1 -1.6029762 4.3349383 -0.2474564 -2.0673455 -1.1429036 -1.6882147 -9.759729dtype: float64 123456789101112131415In [3]: df = pd.Dat...

merging

发表于2020-09-26|Python

Merge, join, and concatenatepandas provides various facilities for easily combining together Series orDataFrame with various kinds of set logic for the indexesand relational algebra functionality in the case of join / merge-typeoperations. Concatenating objectsThe concat() function (in the main pandas namespace) does all ofthe heavy lifting of performing concatenation operations along an axis whileperforming optional set logic (union or intersection) of the indexes (if any) onthe other a...

发表于2020-09-26|Python

IO工具（文本，CSV，HDF5，…）pandas的I/O API是一组read函数，比如pandas.read_csv()函数。这类函数可以返回pandas对象。相应的write函数是像DataFrame.to_csv()一样的对象方法。下面是一个方法列表，包含了这里面的所有readers函数和writer函数。 Format Type Data Description Reader Writer text CSV read_csv to_csv text JSON read_json to_json text HTML read_html to_html text Local clipboard read_clipboard to_clipboard binary MS Excel read_excel to_excel binary OpenDocument read_excel binary HDF5 Format read_hdf to_hdf binary Feather Format read_feather ...

missing_data

发表于2020-09-26|Python

Working with missing dataIn this section, we will discuss missing (also referred to as NA) values inpandas. ::: tip Note The choice of using NaN internally to denote missing data was largelyfor simplicity and performance reasons. It differs from the MaskedArrayapproach of, for example, scikits.timeseries. We are hopeful thatNumPy will soon be able to provide a native NA type solution (similar to R)performant enough to be used in pandas. ::: See the cookbook for some advanced strategies. Value...