TFXIO's schema interpretation

[TOC]

Data parsing by TFXIO sources is configured by Schema. Among other things, the schema is converted to a "feature spec" (sometimes called a "parsing spec") which is a dict whose keys are feature names and values are one of:

Feature spec is conventionally used by tf.io.parse_example and tf.io.parse_sequence_example to convert from serialized Example and SequenceExample protos to batched tensors. This doc describes how the conversion and parsing are performed.

Although TFXIOs don't typically use the TensorFlow ops for parsing in a Beam pipeline and the data format is not restricted to the two proto types, the result is always consistent with that of the parsing ops. Therefore, we express the schema interpretation logic in terms of feature specs that essentially describe how tensors are constructed from primitives.

Key objects of the schema that can be used to describe parsing are Feature, SparseFeature, and TensorRepresentation. Normally, only the first two are set by TFXIO users to control the produced tensors, but the last one can also be used for a more fine-grained control. In the following sections we will describe how these are mapped to feature spec and give some examples of advanced usage.

Feature

Primitive types

If the feature has a fixed shape (i.e. all dimensions have non-negative integer sizes), then the feature must always be present (presence.min_fraction == 1.0), and a tf.io.FixedLenFeature with the given shape and type will be produced for it.
If the feature has no fixed shape and represent_variable_length_as_ragged is set to true (defaults to false), then tf.io.RaggedFeature with no partitions and value key being the feature name will be produced.
Otherwise, a tf.io.VarLenFeature of the corresponding type will be produced.

Where TensorFlow types and default values are derived according to the following table:

Schema type	TF type	Default value
BYTES	tf.string	b''
INT	tf.int64	-1
FLOAT	tf.float32	-1.0

`STRUCT` type

STRUCT features are typically used for SequenceExample inputs. At the moment only one level of nestedness is supported and leaf features from StructDomain of the feature result into top-level tf.io.RaggedFeatures.

SparseFeature

Results in tf.io.SparseFeature with the given value and index columns. Dense shape is derived from index columns' int_domains.

SparseFeatures are only supported with generate_legacy_feature_spec: false.

Advanced usage

TFXIO uses TensorRepresentations to infer parsing spec from the schema Features and SparseFeatures. Manually specifying TensorRepresentations in the schena allows a more refined control of the input data representation. Here are a few examples:

Parsing variable length feature as ragged tensor

feature {
  name: "varlen"
  type: BYTES
}
tensor_representation_group {
  key: ""
  value {
    tensor_representation {
      key: "varlen"
      value {
        ragged_tensor {
          feature_path { step: "varlen" }
          row_partition_dtype: INT64
        }
      }
    }
  }
}

Produced feature spec:

{
  "varlen":
      tf.io.RaggedFeature(
          value_key="varlen",
          dtype=tf.string,
          partitions=(,),
          row_splits_dtype=tf.int64)
}

Parsing as a nested ragged tensor

feature {
  name: "value"
  type: BYTES
}
feature {
  name: "row_length"
  type: INT
}
tensor_representation_group {
  key: ""
  value {
    tensor_representation {
      key: "ragged"
      value {
        ragged_tensor {
          feature_path { step: "value" }
          partition { row_length: "row_length" }
          row_partition_dtype: INT64
        }
      }
    }
  }
}

Produced feature spec:

{
  "ragged":
      tf.io.RaggedFeature(
          value_key="value",
          dtype=tf.string,
          partitions=(tf.io.RaggedFeature.RowLengths("row_length"),),
          row_splits_dtype=tf.int64)
}

Parsing as a sparse tensor

feature {
  name: "value"
  type: FLOAT
}
feature {
  name: "index0"
  type: INT
}
feature {
  name: "index1"
  type: INT
}
tensor_representation_group {
  key: ""
  value {
    tensor_representation {
      key: "sparse"
      value {
        sparse_tensor {
          index_column_names: ["index0", "index1"]
          value_column_name: "value"
          dense_shape {
            dim {
              size: 10
            }
            dim {
              size: 20
            }
          }
          already_sorted: true
        }
      }
    }
  }
}

Produced feature spec:

{
  "sparse":
      tf.io.SparseFeature(
          index_key=["index0", "index1"],
          value_key="value",
          dtype=tf.float32,
          size=[10, 20],
          already_sorted=True)
}

Parsing `STRUCT` feature

One of the applications of STRUCT features is SequenceExample reading and parsing. In this case context features should be declared on the top level while sequence features - in a struct_domain of a feature named "##SEQUENCE##":

feature {
  name: "##SEQUENCE##"
  type: STRUCT
  struct_domain {
    feature {
      name: "seq_int_feature"
      type: INT
      value_count {
        min: 0
        max: 2
      }
    }
    feature {
      name: "seq_string_feature"
      type: BYTES
      value_count {
        min: 0
        max: 2
      }
    }
  }
}
tensor_representation_group {
  key: ""
  value {
    tensor_representation {
      key: "seq_string_feature"
      value {
        ragged_tensor {
          feature_path {
            step: "##SEQUENCE##" step: "seq_string_feature"
          }
        }
      }
    }
    tensor_representation {
      key: "seq_int_feature"
      value {
        ragged_tensor {
          feature_path {
            step: "##SEQUENCE##" step: "seq_int_feature"
          }
        }
      }
    }
  }
}

Resulting parsing spec:

{
  "seq_string_feature":
      tf.io.RaggedFeature(
          dtype=tf.string,
          value_key="seq_string_feature",
          row_splits_dtype=tf.int64,
          partitions=[]),
  "seq_int_feature":
      tf.io.RaggedFeature(
          dtype=tf.int64,
          value_key="seq_int_feature",
          row_splits_dtype=tf.int64,
          partitions=[])
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFXIO's schema interpretation

Feature

Primitive types

`STRUCT` type

SparseFeature

Advanced usage

Parsing variable length feature as ragged tensor

Parsing as a nested ragged tensor

Parsing as a sparse tensor

Parsing `STRUCT` feature

FilesExpand file tree

schema_interpretation.md

Latest commit

History

schema_interpretation.md

File metadata and controls

TFXIO's schema interpretation

Feature

Primitive types

STRUCT type

SparseFeature

Advanced usage

Parsing variable length feature as ragged tensor

Parsing as a nested ragged tensor

Parsing as a sparse tensor

Parsing STRUCT feature

`STRUCT` type

Parsing `STRUCT` feature