985 字
5 分钟
polars学习-08_分类数据处理
背景
polars学习系列文章,第8篇 分类数据处理(Categorical data)
该系列文章会分享到github,大家可以去下载jupyter文件,进行参考学习
小编运行环境
import sys
print('python 版本:',sys.version.split('|')[0])#python 版本: 3.11.9
import polars as pl
print("polars 版本:",pl.__version__)#polars 版本: 0.20.22分类数据 Categorical data
分类数据就是平时在数据库中能进行编码的数据,比如:性别、年龄、国家、城市、职业 等等,可以对这些数据进行编码,可以节省存储空间
Polars 支持两种不同的数据类型来处理分类数据:Enum 和 Categorical
- 当类别预先已知时使用
Enum,需要提前提供所有类别 - 当不知道类别或类别不固定时,可以使用
Categorical
enum_dtype = pl.Enum(["Polar", "Panda", "Brown"])enum_series = pl.Series( ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=enum_dtype)
cat_series = pl.Series( ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical)Categorical 类型
Categorical 相对比较灵活,不用提前获取所有的类别,当有新类别时,会自动进行编码
当对来自2个不同的 Categorical 类别列直接进行拼接时,以下这种方式会比较慢,polars 是根据字符串出现的先后顺序进行编码,不同的字符串在不同的序列里面编码可能不一样,直接合并的话全局会再进行一次编码,速度会比较慢:
cat_series = pl.Series( ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical)cat2_series = pl.Series( ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical)
#CategoricalRemappingWarning: Local categoricals have different encodings,#expensive re-encoding is done to perform this merge operation.#Consider using a StringCache or an Enum type if the categories are known in advanceprint(cat_series.append(cat2_series))可以通过使用 polars 提供的全局字符缓存 StringCache,来提升数据处理效率
with pl.StringCache(): cat_series = pl.Series( ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical ) cat2_series = pl.Series( ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical ) print(cat_series.append(cat2_series))Enum
上面来自2个不同类型列进行拼接的耗时的情况,在Enum中不会存在,因为已经提前获取到了全部的类别
dtype = pl.Enum(["Polar", "Panda", "Brown"])cat_series = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=dtype)cat2_series = pl.Series(["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=dtype)
print(cat_series.append(cat2_series))#shape: (10,)#Series: '' [enum][ "Polar" "Panda" "Brown" "Brown" "Polar" "Panda" "Brown" "Brown" "Polar" "Polar"]如果有编码的字符串类别,当不在提前获取的Enum中时,则会报错:OutOfBounds
dtype = pl.Enum(["Polar", "Panda", "Brown"])try: cat_series = pl.Series(["Polar", "Panda", "Brown", "Black"], dtype=dtype)except Exception as e: print(e)#conversion from `str` to `enum` failed#in column '' for 1 out of 4 values: ["Black"]#Ensure that all values in the input column are present#in the categories of the enum datatype.比较
- Categorical vs Categorical
- Categorical vs String
- Enum vs Enum
- Enum vs String(该字符串必须要在提前获取的Enum中)
Categorical vs Categorical
with pl.StringCache(): cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical) cat_series2 = pl.Series(["Polar", "Panda", "Black"], dtype=pl.Categorical) print(cat_series == cat_series2)#shape: (3,)#Series: '' [bool][ false true false]Categorical vs String
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)print(cat_series <= "Cat")#shape: (3,)#Series: '' [bool][ true false false]
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)cat_series_utf = pl.Series(["Panda", "Panda", "A Polar"])print(cat_series <= cat_series_utf)#shape: (3,)#Series: '' [bool][ true true false]Enum vs Enum
dtype = pl.Enum(["Polar", "Panda", "Brown"])cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=dtype)cat_series2 = pl.Series(["Polar", "Panda", "Brown"], dtype=dtype)print(cat_series == cat_series2)#shape: (3,)#Series: '' [bool][ false true false]Enum vs String(该字符串必须要在提前获取的Enum中)
try: cat_series = pl.Series( ["Low", "Medium", "High"], dtype=pl.Enum(["Low", "Medium", "High"]) ) cat_series <= "Excellent"except Exception as e: print(e)#conversion from `str` to `enum` failed#in column '' for 1 out of 1 values: ["Excellent"]#Ensure that all values in the input column are present#in the categories of the enum datatype.
dtype = pl.Enum(["Low", "Medium", "High"])cat_series = pl.Series(["Low", "Medium", "High"], dtype=dtype)print(cat_series <= "Medium")#shape: (3,)#Series: '' [bool][ true true false]
dtype = pl.Enum(["Low", "Medium", "High"])cat_series = pl.Series(["Low", "Medium", "High"], dtype=dtype)cat_series2 = pl.Series(["High", "High", "Low"])print(cat_series <= cat_series2)#shape: (3,)#Series: '' [bool][ true true false]历史相关文章
- Python polars学习-01 读取与写入文件
- Python polars学习-02 上下文与表达式
- Python polars学习-03 数据类型转换
- Python polars学习-04 字符串数据处理
- Python polars学习-05 包含的数据结构
- Python polars学习-06 Lazy / Eager API
- Python polars学习-07 缺失值
以上是自己实践中遇到的一些问题,分享出来供大家参考学习,欢迎关注微信公众号:DataShare ,不定期分享干货