Stream Window Aggregate Feature View

A StreamWindowAggregateFeatureView is used for streaming time-window aggregation features, such as a 10-minute rolling count of per-user transactions. It processes raw data from a StreamDataSource and can be backfilled from any BatchDataSource that contains a historical log of events.

Use a StreamWindowAggregateFeatureView, if:

your use case requires very fresh features (<1 second) that update whenever a new raw event is available on the stream
you need tumbling, hopping or rolling time window aggregations of type count, sum, mean, max, min, last-n
you have your raw events available on a stream

Common Examples:

30 minute rolling click count of a user
5 minute rolling transaction sum
Last 10 transactions of a user
Max transaction amount of a user

StreamWindowAggregateFeatureView is a specialized implementation for time-window aggregations that is more efficient and performant than what a normal StreamFeatureView could accomplish. Tecton is able to achieve higher efficiency and feature freshness, because it stores partial feature values in tiles that are rolled-up at feature request time (for more details, see below).

Example

For more examples see Examples here.

Parameters

See the API reference for the full list of parameters.

Transformation

In the body of your Python function, you'll define row-level transformations that will then be aggregated according to the FeatureAggregation parameter.

Your transformation must output a column for each entity and a timestamp column. Each additional column must be aggregated by at least one FeatureAggregation. The final number of features will be based on the number of time windows you configure.

Usage Example

See how to use a Stream Window Aggregate Feature View in a notebook here.

How it works

StreamWindowAggregateFeatureView use Spark Structured Streaming jobs under the hood. They operate on a sliding time window or with continuous processing. When using a sliding time window, Tecton will update the feature value in the online store after a slide period has elapsed, assuming there was at least one event for that key. With continuous processing, each new event will be persisted as soon as it's processed.

Tecton stores partial aggregations in the form of tiles. The tile size is defined by the aggregation_slide_period parameter. At feature request-time, Tecton's online and offline feature serving capabilities automatically roll up the persisted tiles or persisted event projections in the case of continuous processing. This design has several key benefits:

Significantly reduced storage requirements if you define several time windows.
Reduced stream job memory requirements.
Streaming features with a time window that exceeds the streaming source's retention period can be backfilled from the batch source. Tecton will backfill historical tiles from the batch source and combine tiles that were written by the streaming source transparently at request time.

See this blog post for more technical details.

When to use continuous processing

Continuous processing for Stream Window Aggregate Feature Views can update feature values in less than a second after the event is available in the stream data source, rather than waiting for the slide period to complete.

You should use continuous mode if your model performance depends on extremely fresh features. For example, it may be important for a fraud detection use-case for features to include previous transactions made even a few seconds prior.

You should not use continuous mode if your model can tolerate features updating every few minutes, and you are trying to optimize costs. Continuous mode may lead to higher infrastructure costs due to more frequent feature writes and checkpointing updates. Using a longer slide period can have lower costs, especially if a single key may have multiple events in a short period of time that can be grouped into a single slide period.

Checkpointing costs with continuous processing on EMR

Continuous mode can cause significant S3 costs for customers using Tecton with EMR due to the frequency of writing Spark Streaming checkpoints to S3. The Databricks implementation of Spark Streaming has much lower checkpointing costs.