技術調査メモ：Dataflow_Pythonサポート状況について_20200917

社内でデータ基盤を作っており、データの加工のしやすさだったり、書きやすさだったりでPythonで書き直すことができるか調べてみました。
調べた限りでは使いたい機能は既にサポートされていそうです。

Dynamic Destinations

テーブル名を参照して動的にBQに書き込む際に利用します。Pythonでも既に使えるようになってそう。

https://beam.apache.org/releases/pydoc/2.23.0/apache_beam.io.gcp.bigquery.html#writing-data-to-bigquery
https://beam.apache.org/releases/pydoc/2.23.0/apache_beam.io.gcp.bigquery.html#apache_beam.io.gcp.bigquery.WriteToBigQuery
https://github.com/apache/beam/pull/7677/files

↓　このドキュメントは更新されてないけど
https://beam.apache.org/documentation/io/built-in/google-bigquery/#using-dynamic-destinations

BigQueryのerror_recordテーブルへの出力

PythonでBQインサート時にエラーレコードを検知して書き込む方法。Google提供のテンプレートにも同様のメソッドがあることを確認。

https://beam.apache.org/documentation/patterns/bigqueryio/

Streaming Engine

Pythonで対応可能。

注: Streaming Engine には Python バージョン 2.16.0 以降の Apache Beam SDK が必要です。

https://beam.apache.org/releases/pydoc/2.23.0/apache_beam.io.gcp.bigquery.html#writing-data-to-bigquery

メッセージの重複排除

サポート済み

id_label: The attribute on incoming Pub/Sub messages to use as a unique record identifier. When specified, the value of this attribute (which can be any string that uniquely identifies the record) will be used for deduplication of messages. If not provided, we cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream. In this case, deduplication of the stream will be strictly best effort.

https://beam.apache.org/releases/pydoc/2.23.0/_modules/apache_beam/io/gcp/pubsub.html