Stream Feature View
A StreamFeatureView
is used for simple row-level transformation streaming features. It processes raw data from a streaming source (e.g. Kafka and Kinesis) and can be backfilled from any StreamDataSource
that contains a historical log of events.
Use a StreamFeatureView
, if:
- your use case requires very fresh features (<1 minute) that update whenever a new raw event is available on the stream
- you want to run simple row-level based transformation on the raw data, or simply ingest raw data without further transformations
- you have your raw events available on a stream
Common Examples:
- Last transaction amount of a user's transaction stream
- Stream ingesting pre-computed feature values from an existing Kafka or Kinesis stream
Please see StreamWindowAggregateFeatureView
for a specialized StreamFeatureView that supports efficiently calculated time window aggregations.
Example
For more examples see Examples here.
Parameters
See the API reference for the full list of parameters.
Transformation Pipeline
Stream Feature Views can use pyspark
or spark_sql
transformation types. You can configure mode=pipeline
to construct a pipeline of those transformations, or use mode=pyspark
or mode=spark_sql
to define an inline transformation.
The output of your transformation must include columns for the entity IDs and a timestamp. All other columns will be treated as features.
Productionizing a Stream
For a stream FeatureView used in production where late data loss is unacceptable, it's recommended to set default_watermark_delay_threshold to your stream retention period, or at least 24 hours. This will configure Spark Structured Streaming to not drop data in the event that it processes the events late or out-of-order. The tradeoff of a longer watermark delay is greater amount of in-memory state used by the streaming job.
Usage Example
See how to use a Stream Feature View in a notebook here.
How it works
When materialized online, Tecton will run the StreamFeatureView
transformation on each event that comes in from the underlying stream source, and write it to the online store. Any previous values will be overwritten, so the online store only has the most recent value.
Streaming Transformations are executed as Spark Structured Streaming jobs (additional compute will be supported soon).
Additionally, Tecton will run the same Stream Feature View transformation pipeline against the StreamDataSource
's batch source (a historical log of stream events) when materializing feature values to the offline store. This offline batch source will enable you to create training data sets using the same feature definition as online.
Stream vs. Stream Window Aggregate Feature Views
A StreamFeatureView
is the more generic but less specialized sibling to a StreamWindowAggregateFeatureView
. A StreamFeatureView is an abstraction on top of Spark Structured Streaming. Use a StreamWindowAggregateFeatureView
whenever you care about running time window aggregations. See the StreamWindowAggregateFeatureView
documentation for a quick explanation of how Tecton supports these types of features under the hood by leveraging Spark Structured Streaming as well as on-demand transformation.