Python polars学习 08_分类数据处理

背景

polars学习系列文章，第8篇分类数据处理（Categorical data）

该系列文章会分享到github，大家可以去下载jupyter文件，进行参考学习

仓库地址：https://github.com/DataShare-duo/polars_learn

小编运行环境

import sys

print('python 版本：',sys.version.split('|')[0])
#python 版本： 3.11.9

import polars as pl

print("polars 版本：",pl.__version__)
#polars 版本： 0.20.22

分类数据 Categorical data

分类数据就是平时在数据库中能进行编码的数据，比如：性别、年龄、国家、城市、职业等等，可以对这些数据进行编码，可以节省存储空间

Polars 支持两种不同的数据类型来处理分类数据：Enum 和 Categorical

当类别预先已知时使用 Enum，需要提前提供所有类别
当不知道类别或类别不固定时，可以使用 Categorical

enum_dtype = pl.Enum(["Polar", "Panda", "Brown"])
enum_series = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], 
    dtype=enum_dtype)

cat_series = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], 
    dtype=pl.Categorical
)

Categorical 类型

Categorical 相对比较灵活，不用提前获取所有的类别，当有新类别时，会自动进行编码

当对来自2个不同的 Categorical 类别列直接进行拼接时，以下这种方式会比较慢，polars 是根据字符串出现的先后顺序进行编码，不同的字符串在不同的序列里面编码可能不一样，直接合并的话全局会再进行一次编码，速度会比较慢：

cat_series = pl.Series(
    ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
)
cat2_series = pl.Series(
    ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
)

#CategoricalRemappingWarning: Local categoricals have different encodings, 
#expensive re-encoding is done to perform this merge operation. 
#Consider using a StringCache or an Enum type if the categories are known in advance
print(cat_series.append(cat2_series))

可以通过使用 polars 提供的全局字符缓存 StringCache，来提升数据处理效率

with pl.StringCache():
    cat_series = pl.Series(
        ["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=pl.Categorical
    )
    cat2_series = pl.Series(
        ["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=pl.Categorical
    )
    print(cat_series.append(cat2_series))

Enum

上面来自2个不同类型列进行拼接的耗时的情况，在Enum中不会存在，因为已经提前获取到了全部的类别

dtype = pl.Enum(["Polar", "Panda", "Brown"])
cat_series = pl.Series(["Polar", "Panda", "Brown", "Brown", "Polar"], dtype=dtype)
cat2_series = pl.Series(["Panda", "Brown", "Brown", "Polar", "Polar"], dtype=dtype)

print(cat_series.append(cat2_series))
#shape: (10,)
#Series: '' [enum]
[
	"Polar"
	"Panda"
	"Brown"
	"Brown"
	"Polar"
	"Panda"
	"Brown"
	"Brown"
	"Polar"
	"Polar"
]

如果有编码的字符串类别，当不在提前获取的Enum中时，则会报错：OutOfBounds

dtype = pl.Enum(["Polar", "Panda", "Brown"])
try:
    cat_series = pl.Series(["Polar", "Panda", "Brown", "Black"], dtype=dtype)
except Exception as e:
    print(e)
#conversion from `str` to `enum` failed 
#in column '' for 1 out of 4 values: ["Black"]
#Ensure that all values in the input column are present 
#in the categories of the enum datatype.

比较

Categorical vs Categorical
Categorical vs String
Enum vs Enum
Enum vs String(该字符串必须要在提前获取的Enum中)

Categorical vs Categorical

with pl.StringCache():
    cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)
    cat_series2 = pl.Series(["Polar", "Panda", "Black"], dtype=pl.Categorical)
    print(cat_series == cat_series2)
#shape: (3,)
#Series: '' [bool]
[
	false
	true
	false
]

Categorical vs String

cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)
print(cat_series <= "Cat")
#shape: (3,)
#Series: '' [bool]
[
	true
	false
	false
]

cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=pl.Categorical)
cat_series_utf = pl.Series(["Panda", "Panda", "A Polar"])
print(cat_series <= cat_series_utf)
#shape: (3,)
#Series: '' [bool]
[
	true
	true
	false
]

Enum vs Enum

dtype = pl.Enum(["Polar", "Panda", "Brown"])
cat_series = pl.Series(["Brown", "Panda", "Polar"], dtype=dtype)
cat_series2 = pl.Series(["Polar", "Panda", "Brown"], dtype=dtype)
print(cat_series == cat_series2)
#shape: (3,)
#Series: '' [bool]
[
	false
	true
	false
]

Enum vs String(该字符串必须要在提前获取的Enum中)

try:
    cat_series = pl.Series(
        ["Low", "Medium", "High"], dtype=pl.Enum(["Low", "Medium", "High"])
    )
    cat_series <= "Excellent"
except Exception as e:
    print(e)
#conversion from `str` to `enum` failed 
#in column '' for 1 out of 1 values: ["Excellent"]
#Ensure that all values in the input column are present 
#in the categories of the enum datatype.

dtype = pl.Enum(["Low", "Medium", "High"])
cat_series = pl.Series(["Low", "Medium", "High"], dtype=dtype)
print(cat_series <= "Medium")
#shape: (3,)
#Series: '' [bool]
[
	true
	true
	false
]

dtype = pl.Enum(["Low", "Medium", "High"])
cat_series = pl.Series(["Low", "Medium", "High"], dtype=dtype)
cat_series2 = pl.Series(["High", "High", "Low"])
print(cat_series <= cat_series2)
#shape: (3,)
#Series: '' [bool]
[
	true
	true
	false
]

历史相关文章

以上是自己实践中遇到的一些问题，分享出来供大家参考学习，欢迎关注微信公众号：DataShare ，不定期分享干货