On this page
tf.raw_ops.UnicodeDecodeWithOffsets
Decodes each string in input into a sequence of Unicode code points.
tf.raw_ops.UnicodeDecodeWithOffsets(
    input, input_encoding, errors='replace', replacement_char=65533,
    replace_control_characters=False, Tsplits=tf.dtypes.int64, name=None
)
  The character codepoints for all strings are returned using a single vector char_values, with strings expanded to characters in row-major order. Similarly, the character start byte offsets are returned using a single vector char_to_byte_starts, with strings expanded in row-major order.
The row_splits tensor indicates where the codepoints and start offsets for each input string begin and end within the char_values and char_to_byte_starts tensors. In particular, the values for the ith string (in row-major order) are stored in the slice [row_splits[i]:row_splits[i+1]]. Thus:
char_values[row_splits[i]+j]is the Unicode codepoint for thejth character in theith string (in row-major order).char_to_bytes_starts[row_splits[i]+j]is the start byte offset for thejth character in theith string (in row-major order).row_splits[i+1] - row_splits[i]is the number of characters in theith string (in row-major order).
| Args | |
|---|---|
input | 
      A Tensor of type string. The text to be decoded. Can have any shape. Note that the output is flattened to a vector of char values. | 
     
input_encoding | 
      A string. Text encoding of the input strings. This is any of the encodings supported by ICU ucnv algorithmic converters. Examples: "UTF-16", "US ASCII", "UTF-8". | 
     
errors | 
      An optional string from: "strict", "replace", "ignore". Defaults to "replace". Error handling policy when there is invalid formatting found in the input. The value of 'strict' will cause the operation to produce a InvalidArgument error on any invalid input formatting. A value of 'replace' (the default) will cause the operation to replace any invalid formatting in the input with the replacement_char codepoint. A value of 'ignore' will cause the operation to skip any invalid formatting in the input and produce no corresponding output character. | 
     
replacement_char | 
      An optional int. Defaults to 65533. The replacement character codepoint to be used in place of any invalid formatting in the input when errors='replace'. Any valid unicode codepoint may be used. The default value is the default unicode replacement character is 0xFFFD or U+65533.) | 
     
replace_control_characters | 
      An optional bool. Defaults to False. Whether to replace the C0 control characters (00-1F) with the replacement_char. Default is false. | 
     
Tsplits | 
      An optional tf.DType from: tf.int32, tf.int64. Defaults to tf.int64. | 
     
name | 
      A name for the operation (optional). | 
| Returns | |
|---|---|
A tuple of Tensor objects (row_splits, char_values, char_to_byte_starts). | 
     |
row_splits | 
      A Tensor of type Tsplits. | 
     
char_values | 
      A Tensor of type int32. | 
     
char_to_byte_starts | 
      A Tensor of type int64. | 
     
© 2020 The TensorFlow Authors. All rights reserved.
Licensed under the Creative Commons Attribution License 3.0.
Code samples licensed under the Apache 2.0 License.
 https://www.tensorflow.org/versions/r2.3/api_docs/python/tf/raw_ops/UnicodeDecodeWithOffsets