[TOC]
Data parsing by TFXIO sources is configured by Schema. Among other things, the schema is converted to a "feature spec" (sometimes called a "parsing spec") which is a dict whose keys are feature names and values are one of:
Feature spec is conventionally used by
tf.io.parse_example
and
tf.io.parse_sequence_example
to convert from serialized
Example and
SequenceExample
protos to batched tensors. This doc describes how the conversion and parsing are
performed.
Although TFXIOs don't typically use the TensorFlow ops for parsing in a Beam pipeline and the data format is not restricted to the two proto types, the result is always consistent with that of the parsing ops. Therefore, we express the schema interpretation logic in terms of feature specs that essentially describe how tensors are constructed from primitives.
Key objects of the schema that can be used to describe parsing are
Feature,
SparseFeature,
and
TensorRepresentation.
Normally, only the first two are set by TFXIO users to control the produced
tensors, but the last one can also be used for a more fine-grained control. In
the following sections we will describe how these are mapped to feature spec and
give some examples of advanced usage.
- If the feature has a fixed
shape(i.e. all dimensions have non-negative integer sizes), then the feature must always be present (presence.min_fraction== 1.0), and atf.io.FixedLenFeaturewith the given shape and type will be produced for it. - If the feature has no fixed shape and
represent_variable_length_as_raggedis set to true (defaults to false), thentf.io.RaggedFeaturewith no partitions and value key being the feature name will be produced. - Otherwise, a
tf.io.VarLenFeatureof the corresponding type will be produced.
Where TensorFlow types and default values are derived according to the following table:
| Schema type | TF type | Default value |
|---|---|---|
| BYTES | tf.string | b'' |
| INT | tf.int64 | -1 |
| FLOAT | tf.float32 | -1.0 |
STRUCT features are typically used for SequenceExample inputs. At the moment
only one level of nestedness is supported and leaf features from
StructDomain
of the feature result into top-level tf.io.RaggedFeatures.
Results in tf.io.SparseFeature with the given value and index columns. Dense
shape is derived from index columns' int_domains.
SparseFeatures are only supported with generate_legacy_feature_spec: false.
TFXIO uses
TensorRepresentations
to infer parsing spec from the schema Features and SparseFeatures. Manually
specifying TensorRepresentations in the schena allows a more refined control
of the input data representation. Here are a few examples:
feature {
name: "varlen"
type: BYTES
}
tensor_representation_group {
key: ""
value {
tensor_representation {
key: "varlen"
value {
ragged_tensor {
feature_path { step: "varlen" }
row_partition_dtype: INT64
}
}
}
}
}Produced feature spec:
{
"varlen":
tf.io.RaggedFeature(
value_key="varlen",
dtype=tf.string,
partitions=(,),
row_splits_dtype=tf.int64)
}feature {
name: "value"
type: BYTES
}
feature {
name: "row_length"
type: INT
}
tensor_representation_group {
key: ""
value {
tensor_representation {
key: "ragged"
value {
ragged_tensor {
feature_path { step: "value" }
partition { row_length: "row_length" }
row_partition_dtype: INT64
}
}
}
}
}Produced feature spec:
{
"ragged":
tf.io.RaggedFeature(
value_key="value",
dtype=tf.string,
partitions=(tf.io.RaggedFeature.RowLengths("row_length"),),
row_splits_dtype=tf.int64)
}feature {
name: "value"
type: FLOAT
}
feature {
name: "index0"
type: INT
}
feature {
name: "index1"
type: INT
}
tensor_representation_group {
key: ""
value {
tensor_representation {
key: "sparse"
value {
sparse_tensor {
index_column_names: ["index0", "index1"]
value_column_name: "value"
dense_shape {
dim {
size: 10
}
dim {
size: 20
}
}
already_sorted: true
}
}
}
}
}Produced feature spec:
{
"sparse":
tf.io.SparseFeature(
index_key=["index0", "index1"],
value_key="value",
dtype=tf.float32,
size=[10, 20],
already_sorted=True)
}One of the applications of STRUCT features is SequenceExample reading and
parsing. In this case context features should be declared on the top level while
sequence features - in a struct_domain of a feature named
"##SEQUENCE##":
feature {
name: "##SEQUENCE##"
type: STRUCT
struct_domain {
feature {
name: "seq_int_feature"
type: INT
value_count {
min: 0
max: 2
}
}
feature {
name: "seq_string_feature"
type: BYTES
value_count {
min: 0
max: 2
}
}
}
}
tensor_representation_group {
key: ""
value {
tensor_representation {
key: "seq_string_feature"
value {
ragged_tensor {
feature_path {
step: "##SEQUENCE##" step: "seq_string_feature"
}
}
}
}
tensor_representation {
key: "seq_int_feature"
value {
ragged_tensor {
feature_path {
step: "##SEQUENCE##" step: "seq_int_feature"
}
}
}
}
}
}Resulting parsing spec:
{
"seq_string_feature":
tf.io.RaggedFeature(
dtype=tf.string,
value_key="seq_string_feature",
row_splits_dtype=tf.int64,
partitions=[]),
"seq_int_feature":
tf.io.RaggedFeature(
dtype=tf.int64,
value_key="seq_int_feature",
row_splits_dtype=tf.int64,
partitions=[])
}