diff --git a/website/www/site/content/en/documentation/io/managed-io.md b/website/www/site/content/en/documentation/io/managed-io.md index 46f20c5e11ab..ec3bb90183d4 100644 --- a/website/www/site/content/en/documentation/io/managed-io.md +++ b/website/www/site/content/en/documentation/io/managed-io.md @@ -121,23 +121,25 @@ and Beam SQL is invoked via the Managed API under the hood. table (str)
+ autosharding (boolean)
catalog_name (str)
catalog_properties (map[str, str])
config_properties (map[str, str])
direct_write_byte_limit (int32)
+ distribution_mode (str)
drop (list[str])
keep (list[str])
only (str)
partition_fields (list[str])
+ sort_fields (list[str])
table_properties (map[str, str])
triggering_frequency_seconds (int32)
- MYSQL + SQLSERVER jdbc_url (str)
- connection_init_sql (list[str])
connection_properties (str)
disable_auto_commit (boolean)
fetch_size (int32)
@@ -153,7 +155,6 @@ and Beam SQL is invoked via the Managed API under the hood. jdbc_url (str)
autosharding (boolean)
batch_size (int64)
- connection_init_sql (list[str])
connection_properties (str)
location (str)
password (str)
@@ -187,9 +188,28 @@ and Beam SQL is invoked via the Managed API under the hood. - SQLSERVER + BIGQUERY + + kms_key (str)
+ query (str)
+ row_restriction (str)
+ fields (list[str])
+ table (str)
+ + + table (str)
+ drop (list[str])
+ keep (list[str])
+ kms_key (str)
+ only (str)
+ triggering_frequency_seconds (int64)
+ + + + MYSQL jdbc_url (str)
+ connection_init_sql (list[str])
connection_properties (str)
disable_auto_commit (boolean)
fetch_size (int32)
@@ -205,6 +225,7 @@ and Beam SQL is invoked via the Managed API under the hood. jdbc_url (str)
autosharding (boolean)
batch_size (int64)
+ connection_init_sql (list[str])
connection_properties (str)
location (str)
password (str)
@@ -212,24 +233,6 @@ and Beam SQL is invoked via the Managed API under the hood. write_statement (str)
- - BIGQUERY - - kms_key (str)
- query (str)
- row_restriction (str)
- fields (list[str])
- table (str)
- - - table (str)
- drop (list[str])
- keep (list[str])
- kms_key (str)
- only (str)
- triggering_frequency_seconds (int64)
- - @@ -401,7 +404,7 @@ and Beam SQL is invoked via the Managed API under the hood. -### `KAFKA` Write +### `KAFKA` Read
@@ -418,228 +421,228 @@ and Beam SQL is invoked via the Managed API under the hood. str -
- A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,... + A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form `host1:port1,host2:port2,...`
- format + topic str - The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO + n/a
- topic + allow_duplicates - str + boolean - n/a + If the Kafka read allows duplicates.
- file_descriptor_path + confluent_schema_registry_subject str - The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. + n/a
- message_name + confluent_schema_registry_url str - The name of the Protocol Buffer message to be used for schema extraction and data conversion. + n/a
- producer_config_updates + consumer_config_updates map[str, str] - A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html + A list of key-value pairs that act as configuration parameters for Kafka consumers. Most of these configurations will not be needed, but if you need to customize your Kafka consumer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html
- schema + file_descriptor_path str - n/a + The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
-
- -### `KAFKA` Read - -
- - - - - - +
ConfigurationTypeDescription
- bootstrap_servers + format str - A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form `host1:port1,host2:port2,...` + The encoding format for the data stored in Kafka. Valid options are: RAW,STRING,AVRO,JSON,PROTO
- topic + message_name str - n/a + The name of the Protocol Buffer message to be used for schema extraction and data conversion.
- allow_duplicates + offset_deduplication boolean - If the Kafka read allows duplicates. + If the redistribute is using offset deduplication mode.
- confluent_schema_registry_subject + redistribute_by_record_key - str + boolean - n/a + If the redistribute keys by the Kafka record key.
- confluent_schema_registry_url + redistribute_num_keys - str + int32 - n/a + The number of keys for redistributing Kafka inputs.
- consumer_config_updates + redistributed - map[str, str] + boolean - A list of key-value pairs that act as configuration parameters for Kafka consumers. Most of these configurations will not be needed, but if you need to customize your Kafka consumer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/consumer-configs.html + If the Kafka read should be redistributed.
- file_descriptor_path + schema str - The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization. + The schema in which the data is encoded in the Kafka topic. For AVRO data, this is a schema defined with AVRO schema syntax (https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL to Confluent Schema Registry is provided, then this field is ignored, and the schema is fetched from Confluent Schema Registry.
+
+ +### `KAFKA` Write + +
+ + + + + + @@ -650,7 +653,7 @@ and Beam SQL is invoked via the Managed API under the hood. str
ConfigurationTypeDescription
- format + bootstrap_servers str - The encoding format for the data stored in Kafka. Valid options are: RAW,STRING,AVRO,JSON,PROTO + A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. | Format: host1:port1,host2:port2,...
- message_name + format str - The name of the Protocol Buffer message to be used for schema extraction and data conversion. + The encoding format for the data stored in Kafka. Valid options are: RAW,JSON,AVRO,PROTO
- offset_deduplication + topic - boolean + str - If the redistribute is using offset deduplication mode. + n/a
- redistribute_by_record_key + file_descriptor_path - boolean + str - If the redistribute keys by the Kafka record key. + The path to the Protocol Buffer File Descriptor Set file. This file is used for schema definition and message serialization.
- redistribute_num_keys + message_name - int32 + str - The number of keys for redistributing Kafka inputs. + The name of the Protocol Buffer message to be used for schema extraction and data conversion.
- redistributed + producer_config_updates - boolean + map[str, str] - If the Kafka read should be redistributed. + A list of key-value pairs that act as configuration parameters for Kafka producers. Most of these configurations will not be needed, but if you need to customize your Kafka producer, you may use this. See a detailed list: https://docs.confluent.io/platform/current/installation/configuration/producer-configs.html
- The schema in which the data is encoded in the Kafka topic. For AVRO data, this is a schema defined with AVRO schema syntax (https://avro.apache.org/docs/1.10.2/spec.html#schemas). For JSON data, this is a schema defined with JSON-schema syntax (https://json-schema.org/). If a URL to Confluent Schema Registry is provided, then this field is ignored, and the schema is fetched from Confluent Schema Registry. + n/a
@@ -765,6 +768,17 @@ and Beam SQL is invoked via the Managed API under the hood. A fully-qualified table identifier. You may also provide a template to write to multiple dynamic destinations, for example: `dataset.my_{col1}_{col2.nested}_table`. + + + autosharding + + + boolean + + + Enables dynamic sharding to automatically adjust the number of parallel writers based on data volume. It handles data skew by further sub-dividing partitions into multiple shards to prevent bottlenecks during high-throughput writes. Only available with 'hash' distribution mode. + + catalog_name @@ -809,6 +823,19 @@ and Beam SQL is invoked via the Managed API under the hood. For a streaming pipeline, sets the limit for lifting bundles into the direct write path. + + + distribution_mode + + + str + + + Defines distribution of write data. Supported distributions: +- none: don't shuffle rows (default) +- hash: shuffle rows by partition key before writing data + + drop @@ -864,6 +891,18 @@ and Beam SQL is invoked via the Managed API under the hood. For more information on partition transforms, please visit https://iceberg.apache.org/spec/#partition-transforms. + + + sort_fields + + + list[str] + + + Fields used to set the table's sort order, applied when the table is created. Each entry has the form ` [asc|desc] [nulls first|nulls last]`, where `` is a field name or one of the partition transforms (e.g. `bucket(col, 4)`, `day(ts)`). Direction defaults to ascending; null order defaults to nulls-first for ascending and nulls-last for descending. Note: this sets the table's declared sort order as metadata; it does not cause Beam to physically sort records before writing. +For more information on sort orders, please visit https://iceberg.apache.org/spec/#sort-orders. + + table_properties @@ -890,7 +929,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `MYSQL` Read +### `SQLSERVER` Read
@@ -910,17 +949,6 @@ For more information on table properties, please visit https://iceberg.apache.or Connection URL for the JDBC source. - - - - -
- connection_init_sql - - list[str] - - Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this. -
connection_properties @@ -1034,7 +1062,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `MYSQL` Write +### `SQLSERVER` Write
@@ -1076,17 +1104,6 @@ For more information on table properties, please visit https://iceberg.apache.or n/a - - - - -
- connection_init_sql - - list[str] - - Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this. -
connection_properties @@ -1145,7 +1162,7 @@ For more information on table properties, please visit https://iceberg.apache.or
-### `POSTGRES` Write +### `POSTGRES` Read
@@ -1162,173 +1179,173 @@ For more information on table properties, please visit https://iceberg.apache.or str -
- Connection URL for the JDBC sink. + Connection URL for the JDBC source.
- autosharding + connection_properties - boolean + str - If true, enables using a dynamically determined number of shards to write. + Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
- batch_size + fetch_size - int64 + int32 - n/a + This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
- connection_properties + location str - Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". + Name of the table to read from.
- location + num_partitions - str + int32 - Name of the table to write to. + The number of partitions
- password + output_parallelization - str + boolean - Password for the JDBC source. + Whether to reshuffle the resulting PCollection so results are distributed to all workers.
- username + partition_column str - Username for the JDBC source. + Name of a column of numeric type that will be used for partitioning.
- write_statement + password str - SQL query used to insert records into the JDBC sink. + Password for the JDBC source.
-
- -### `POSTGRES` Read - -
- - - - - - +
ConfigurationTypeDescription
- jdbc_url + read_query str - Connection URL for the JDBC source. + SQL query used to query the JDBC source.
- connection_properties + username str - Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". + Username for the JDBC source.
+
+ +### `POSTGRES` Write + +
+ + + + + + @@ -1344,30 +1361,30 @@ For more information on table properties, please visit https://iceberg.apache.or
ConfigurationTypeDescription
- fetch_size + jdbc_url - int32 + str - This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors. + Connection URL for the JDBC sink.
- location + autosharding - str + boolean - Name of the table to read from. + If true, enables using a dynamically determined number of shards to write.
- num_partitions + batch_size - int32 + int64 - The number of partitions + n/a
- output_parallelization + connection_properties - boolean + str - Whether to reshuffle the resulting PCollection so results are distributed to all workers. + Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
- partition_column + location str - Name of a column of numeric type that will be used for partitioning. + Name of the table to write to.
- read_query + username str - SQL query used to query the JDBC source. + Username for the JDBC source.
- username + write_statement str - Username for the JDBC source. + SQL query used to insert records into the JDBC sink.
-### `SQLSERVER` Read +### `BIGQUERY` Write
@@ -1378,129 +1395,141 @@ For more information on table properties, please visit https://iceberg.apache.or +
- jdbc_url + table str - Connection URL for the JDBC source. + The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE}
- connection_properties + drop - str + list[str] - Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;". + A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'.
- disable_auto_commit + keep - boolean + list[str] - Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true. + A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'.
- fetch_size + kms_key - int32 + str - This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors. + Use this Cloud KMS key to encrypt your data
- location + only str - Name of the table to read from. + The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'.
- num_partitions + triggering_frequency_seconds - int32 + int64 - The number of partitions + Determines how often to 'commit' progress into BigQuery. Default is every 5 seconds.
+
+ +### `BIGQUERY` Read + +
+ + + + + +
ConfigurationTypeDescription
- output_parallelization + kms_key - boolean + str - Whether to reshuffle the resulting PCollection so results are distributed to all workers. + Use this Cloud KMS key to encrypt your data
- partition_column + query str - Name of a column of numeric type that will be used for partitioning. + The SQL query to be executed to read from the BigQuery table.
- password + row_restriction str - Password for the JDBC source. + Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query.
- read_query + fields - str + list[str] - SQL query used to query the JDBC source. + Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3"
- username + table str - Username for the JDBC source. + The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE}
-### `SQLSERVER` Write +### `MYSQL` Read
@@ -1517,29 +1546,18 @@ For more information on table properties, please visit https://iceberg.apache.or str - - - - - @@ -1555,119 +1573,107 @@ For more information on table properties, please visit https://iceberg.apache.or -
- Connection URL for the JDBC sink. -
- autosharding - - boolean - - If true, enables using a dynamically determined number of shards to write. + Connection URL for the JDBC source.
- batch_size + connection_init_sql - int64 + list[str] - n/a + Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
- location + disable_auto_commit - str + boolean - Name of the table to write to. + Whether to disable auto commit on read. Defaults to true if not provided. The need for this config varies depending on the database platform. Informix requires this to be set to false while Postgres requires this to be set to true.
- password + fetch_size - str + int32 - Password for the JDBC source. + This method is used to override the size of the data that is going to be fetched and loaded in memory per every database call. It should ONLY be used if the default value throws memory errors.
- username + location str - Username for the JDBC source. + Name of the table to read from.
- write_statement + num_partitions - str + int32 - SQL query used to insert records into the JDBC sink. + The number of partitions
-
- -### `BIGQUERY` Read - -
- - - - - -
ConfigurationTypeDescription
- kms_key + output_parallelization - str + boolean - Use this Cloud KMS key to encrypt your data + Whether to reshuffle the resulting PCollection so results are distributed to all workers.
- query + partition_column str - The SQL query to be executed to read from the BigQuery table. + Name of a column of numeric type that will be used for partitioning.
- row_restriction + password str - Read only rows that match this filter, which must be compatible with Google standard SQL. This is not supported when reading via query. + Password for the JDBC source.
- fields + read_query - list[str] + str - Read only the specified fields (columns) from a BigQuery table. Fields may not be returned in the order specified. If no value is specified, then all fields are returned. Example: "col1, col2, col3" + SQL query used to query the JDBC source.
- table + username str - The fully-qualified name of the BigQuery table to read from. Format: [${PROJECT}:]${DATASET}.${TABLE} + Username for the JDBC source.
-### `BIGQUERY` Write +### `MYSQL` Write
@@ -1678,68 +1684,101 @@ For more information on table properties, please visit https://iceberg.apache.or + + + + + + + + + + + + + + +
- table + jdbc_url str - The bigquery table to write to. Format: [${PROJECT}:]${DATASET}.${TABLE} + Connection URL for the JDBC sink.
- drop + autosharding - list[str] + boolean - A list of field names to drop from the input record before writing. Is mutually exclusive with 'keep' and 'only'. + If true, enables using a dynamically determined number of shards to write.
- keep + batch_size + + int64 + + n/a +
+ connection_init_sql list[str] - A list of field names to keep in the input record. All other fields are dropped before writing. Is mutually exclusive with 'drop' and 'only'. + Sets the connection init sql statements used by the Driver. Only MySQL and MariaDB support this.
- kms_key + connection_properties str - Use this Cloud KMS key to encrypt your data + Used to set connection properties passed to the JDBC driver not already defined as standalone parameter (e.g. username and password can be set using parameters above accordingly). Format of the string must be "key1=value1;key2=value2;".
- only + location str - The name of a single record field that should be written. Is mutually exclusive with 'keep' and 'drop'. + Name of the table to write to.
- triggering_frequency_seconds + password - int64 + str - Determines how often to 'commit' progress into BigQuery. Default is every 5 seconds. + Password for the JDBC source. +
+ username + + str + + Username for the JDBC source. +
+ write_statement + + str + + SQL query used to insert records into the JDBC sink.