Home PySpark UDF
Post
Cancel

PySpark UDF

PySpark UDF

  • native Python 라이브러리와 Spark을 연동
  • UDF는 독립적인 프로세스로 실행
  • 데이터가 Python과 Java간에 전달됨

PySpark UDF vs Pandas UDF

Existing UDFPandas UDF
Function on RowFunction on Row, Group and Window
Pickle serializationArrow Serialization
Data as Python objectsData as pd.Series (for column) and pd.DataFrame (for table)

Pandas UDF limitations

  • must split data
  • each group must fit entirely in memory

출처

https://databricks.com/session/vectorized-udf-scalable-analysis-with-python-and-pyspark

This post is licensed under CC BY 4.0 by the author.

Trending Tags