背景#

polars学习系列文章，第11篇用户自定义函数，python 自定义函数如何与 polars 结合使用

该库目前已更新到 1.37.1 版本，近一年版本更新迭代的速度非常快，之前分享的前10篇文章的版本是 1.2.1

该系列文章会分享到github，大家可以去下载jupyter文件，进行参考学习仓库地址：https://github.com/DataShare-duo/polars_learn

小编运行环境#

1
import sys
2

3
print('python 版本：', sys.version.split('|')[0])
4
#python 版本： 3.11.11
5

6
import polars as pl
7

8
print("polars 版本：", pl.__version__)
9
#polars 版本： 1.37.1

提供的 api 函数/接口/方法#

map_elements ：对列中的每个值，传入函数，类似pandas中的map
map_batches ：整个列全部传入函数，类似pandas中的apply

示例数据#

1
df = pl.DataFrame(
2
    {
3
        "keys": ["a", "a", "b", "b"],
4
        "values": [10, 7, 1, 23],
5
    }
6
)
7
print(df)
8
shape: (4, 2)
9
┌──────┬────────┐
10
│ keys ┆ values │
11
│ ---  ┆ ---    │
12
│ str  ┆ i64    │
13
╞══════╪════════╡
14
│ a    ┆ 10     │
15
│ a    ┆ 7      │
16
│ b    ┆ 1      │
17
│ b    ┆ 23     │
18
└──────┴────────┘

map_elements 用法#

1
import math
2

3
def my_log(value):
4
    return math.log(value)  # math.log 应用与每个值
5

6
out = df.select(pl.col("values").map_elements(my_log, return_dtype=pl.Float64))
7
print(out)
8
shape: (4, 1)
9
┌──────────┐
10
│ values   │
11
│ ---      │
12
│ f64      │
13
╞══════════╡
14
│ 2.302585 │
15
│ 1.94591  │
16
│ 0.0      │
17
│ 3.135494 │
18
└──────────┘

存在问题：

限于单个项：只用应用在单个值上面，而不能一次应用到整个列
性能开销：为每个单独的项调用函数也很慢，所有这些额外的函数调用会增加大量的开销

map_batches 用法#

1
def diff_from_mean(series):
2
    total = 0
3
    for value in series:
4
        total += value
5
    mean = total / len(series)
6
    return pl.Series([value - mean for value in series])
7

8
out = df.select(pl.col("values").map_batches(diff_from_mean, return_dtype=pl.Float64))
9
print("== select() with UDF ==")
10
print(out)
11
== select() with UDF ==
12
shape: (4, 1)
13
┌────────┐
14
│ values │
15
│ ---    │
16
│ f64    │
17
╞════════╡
18
│ -0.25  │
19
│ -3.25  │
20
│ -9.25  │
21
│ 12.75  │
22
└────────┘
23

24
print("== group_by() with UDF ==")
25
out = df.group_by("keys").agg(
26
    pl.col("values").map_batches(diff_from_mean, return_dtype=pl.Float64)
27
)
28
print(out)
29
== group_by() with UDF ==
30
shape: (2, 2)
31
┌──────┬───────────────┐
32
│ keys ┆ values        │
33
│ ---  ┆ ---           │
34
│ str  ┆ list[f64]     │
35
╞══════╪═══════════════╡
36
│ a    ┆ [1.5, -1.5]   │
37
│ b    ┆ [-11.0, 11.0] │
38
└──────┴───────────────┘

提升用户自定义函数性能#

numpy 通用函数#

纯python实现的自定义函数一般速度都比较慢，要尽量减少代用python实现的方法，可以调用 numpy 中的实现的通用函数/算子，来加速，实际是通过调用C语言的轮子来加速

1
import numpy as np
2

3
out = df.select(pl.col("values").map_batches(np.log, return_dtype=pl.Float64))
4
print(out)

通过 Numba 提升自定义函数性能#

如果 numpy 中没有可用的函数，那么自定义函数可以通过 Numba 来提速，即时编译

1
from numba import guvectorize, int64, float64
2

3
@guvectorize([(int64[:], float64[:])], "(n)->(n)")
4
def diff_from_mean_numba(arr, result):
5
    total = 0
6
    for value in arr:
7
        total += value
8
    mean = total / len(arr)
9
    for i, value in enumerate(arr):
10
        result[i] = value - mean
11

12

13
out = df.select(
14
    pl.col("values").map_batches(diff_from_mean_numba, return_dtype=pl.Float64)
15
)
16
print("== select() with UDF ==")
17
print(out)
18

19
out = df.group_by("keys").agg(
20
    pl.col("values").map_batches(diff_from_mean_numba, return_dtype=pl.Float64)
21
)
22
print("== group_by() with UDF ==")
23
print(out)

注意事项#

加速时，数据缺失是不行的，在利用numba装饰器@guvectorize加速时，要么填充缺失值，要么删除缺失值，否则polars会报错

组合多列#

1
@guvectorize([(int64[:], int64[:], float64[:])], "(n),(n)->(n)")
2
def add(arr, arr2, result):
3
    for i in range(len(arr)):
4
        result[i] = arr[i] + arr2[i]
5

6

7
df3 = pl.DataFrame({"values_1": [1, 2, 3], "values_2": [10, 20, 30]})
8

9
out = df3.select(
10
    pl.struct(["values_1", "values_2"])
11
    .map_batches(
12
        lambda combined: add(
13
            combined.struct.field("values_1"), combined.struct.field("values_2")
14
        ),
15
        return_dtype=pl.Float64,
16
    )
17
    .alias("add_columns")
18
)
19
print(out)

流式计算#

可以使用 map_batches 的 is_elementwise=True 参数将结果流式传输到函数中

设置流式计算，需要确保是针对每个值进行计算，更节省内存

返回数据类型#

返回数据类型是自动推断的，第一个非空值类型，作为结果类型

python 与 polars 数据类型映射：

int -> Int64
float -> Float64
bool -> Boolean
str -> String
list[tp] -> List[tp]
dict[str, [tp]] -> struct
any -> object 尽量禁止这种情况

可以将 return_dtype 参数传递给 map_batches

历史相关文章#

以上是自己实践中遇到的一些问题，分享出来供大家参考学习，欢迎关注微信公众号：DataShare ，不定期分享干货