背景#

polars学习系列文章，第4篇字符串数据处理
该系列文章会分享到github，大家可以去下载jupyter文件，进行参考学习

仓库地址：https://github.com/DataShare-duo/polars_learn

小编运行环境#

1
import sys
2

3
print('python 版本：',sys.version.split('|')[0])
4
#python 版本： 3.11.9
5

6
import polars as pl
7

8
print("polars 版本：",pl.__version__)
9
#polars 版本： 0.20.22

字符串长度#

可以获取字符串中的字符数或者字节数

1
df = pl.DataFrame({"animal": ["Crab", "cat and dog", "rab$bit", '张',None]})
2

3
out = df.select(
4
    pl.col("animal").str.len_bytes().alias("byte_count"),  #字节数
5
    pl.col("animal").str.len_chars().alias("letter_count"),  #字符串数
6
)
7
print(out)
8

9
shape: (5, 2)
10
┌────────────┬──────────────┐
11
│ byte_count ┆ letter_count │
12
│ ---        ┆ ---          │
13
│ u32        ┆ u32          │
14
╞════════════╪══════════════╡
15
│ 4          ┆ 4            │
16
│ 11         ┆ 11           │
17
│ 7          ┆ 7            │
18
│ 3          ┆ 1            │
19
│ null       ┆ null         │
20
└────────────┴──────────────┘

判断是否包含特定字符串或正则字符串#

contains：包含指定的字符串，或正则表达式字符串，返回ture/false
starts_with：判断是否以指定的字符串开头，返回ture/false
ends_with：判断是否以指定的字符串结尾，返回ture/false

如果包含了特殊的字符，但又不是正则表达式，需要设置参数literal=True,literal默认是 False,代表字符是正则表达式字符串

1
out = df.select(
2
    pl.col("animal"),
3
    pl.col("animal").str.contains("cat|bit").alias("regex"),
4
    pl.col("animal").str.contains("rab$", literal=True).alias("literal"),  #匹配$原始字符
5
    pl.col("animal").str.contains("rab$").alias("regex_pattern"),
6
    pl.col("animal").str.starts_with("rab").alias("starts_with"),
7
    pl.col("animal").str.ends_with("dog").alias("ends_with"),
8
)
9
print(out)
10

11
shape: (5, 6)
12
┌─────────────┬───────┬─────────┬───────────────┬─────────────┬───────────┐
13
│ animal      ┆ regex ┆ literal ┆ regex_pattern ┆ starts_with ┆ ends_with │
14
│ ---         ┆ ---   ┆ ---     ┆ ---           ┆ ---         ┆ ---       │
15
│ str         ┆ bool  ┆ bool    ┆ bool          ┆ bool        ┆ bool      │
16
╞═════════════╪═══════╪═════════╪═══════════════╪═════════════╪═══════════╡
17
│ Crab        ┆ false ┆ false   ┆ true          ┆ false       ┆ false     │
18
│ cat and dog ┆ true  ┆ false   ┆ false         ┆ false       ┆ true      │
19
│ rab$bit     ┆ true  ┆ true    ┆ false         ┆ true        ┆ false     │
20
│ 张          ┆ false ┆ false   ┆ false         ┆ false       ┆ false     │
21
│ null        ┆ null  ┆ null    ┆ null          ┆ null        ┆ null      │
22
└─────────────┴───────┴─────────┴───────────────┴─────────────┴───────────┘

正则表达式的各种标识，需要写到字符串开始，用括号括起来，(?iLmsuxU)

1
out=pl.DataFrame({"s": ["AAA", "aAa", "aaa"]}).with_columns(
2
    default_match=pl.col("s").str.contains("AA"),
3
    insensitive_match=pl.col("s").str.contains("(?i)AA")  #忽略大小写
4
)
5

6
print(out)
7

8
shape: (3, 3)
9
┌─────┬───────────────┬───────────────────┐
10
│ s   ┆ default_match ┆ insensitive_match │
11
│ --- ┆ ---           ┆ ---               │
12
│ str ┆ bool          ┆ bool              │
13
╞═════╪═══════════════╪═══════════════════╡
14
│ AAA ┆ true          ┆ true              │
15
│ aAa ┆ false         ┆ true              │
16
│ aaa ┆ false         ┆ true              │
17
└─────┴───────────────┴───────────────────┘

根据正则表达式提取特定字符#

使用extract方法，根据提供的正则表达式模式，进行提取匹配到的字符串,需要提供想要获取的组索引 group_index，默认是第1个

1
df = pl.DataFrame(
2
    {
3
        "a": [
4
            "http://vote.com/ballon_dor?candidate=messi&ref=polars",
5
            "http://vote.com/ballon_dor?candidat=jorginho&ref=polars",
6
            "http://vote.com/ballon_dor?candidate=ronaldo&ref=polars",
7
        ]
8
    }
9
)
10
out = df.select(
11
    a1=pl.col("a").str.extract(r"candidate=(\w+)", group_index=1),
12
    a2=pl.col("a").str.extract(r"candidate=(\w+)", group_index=0),
13
    a3=pl.col("a").str.extract(r"candidate=(\w+)")  #默认获取第1个
14
)
15
print(out)
16

17
shape: (3, 3)
18
┌─────────┬───────────────────┬─────────┐
19
│ a1      ┆ a2                ┆ a3      │
20
│ ---     ┆ ---               ┆ ---     │
21
│ str     ┆ str               ┆ str     │
22
╞═════════╪═══════════════════╪═════════╡
23
│ messi   ┆ candidate=messi   ┆ messi   │
24
│ null    ┆ null              ┆ null    │
25
│ ronaldo ┆ candidate=ronaldo ┆ ronaldo │
26
└─────────┴───────────────────┴─────────┘

如果想获取所有正则表达式匹配到的字符串，需要使用 extract_all 方法，结果是一个列表

1
df = pl.DataFrame({"foo": ["123 bla 45 asd", "xyz 678 910t"]})
2
out = df.select(
3
    pl.col("foo").str.extract_all(r"(\d+)").alias("extracted_nrs"),
4
)
5
print(out)
6

7
shape: (2, 1)
8
┌────────────────┐
9
│ extracted_nrs  │
10
│ ---            │
11
│ list[str]      │
12
╞════════════════╡
13
│ ["123", "45"]  │
14
│ ["678", "910"] │
15
└────────────────┘

字符串替换#

replace：替换第一次匹配到的字符串，为新的字符串
replace_all：替换所有匹配到的字符串，为新的字符串

1
df = pl.DataFrame({"id": [1, 2], "text": ["abc123abc", "abc456"]})
2
out = df.with_columns(
3
    s1=pl.col("text").str.replace(r"abc\b", "ABC"), #\b 字符串结束位置，以 abc 出现在字符串结尾处
4
    s2=pl.col("text").str.replace("a", "-"), #只替换第一次出现的 a
5
    s3=pl.col("text").str.replace_all("a", "-", literal=True) #替换所有的 a
6
)
7
print(out)
8

9
shape: (2, 5)
10
┌─────┬───────────┬───────────┬───────────┬───────────┐
11
│ id  ┆ text      ┆ s1        ┆ s2        ┆ s3        │
12
│ --- ┆ ---       ┆ ---       ┆ ---       ┆ ---       │
13
│ i64 ┆ str       ┆ str       ┆ str       ┆ str       │
14
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
15
│ 1   ┆ abc123abc ┆ abc123ABC ┆ -bc123abc ┆ -bc123-bc │
16
│ 2   ┆ abc456    ┆ abc456    ┆ -bc456    ┆ -bc456    │
17
└─────┴───────────┴───────────┴───────────┴───────────┘

历史相关文章#

以上是自己实践中遇到的一些问题，分享出来供大家参考学习，欢迎关注微信公众号：DataShare ，不定期分享干货