Batch Feature View
A BatchFeatureView
is used for defining row-level or aggregate transformations against a BatchDataSource
. Batch Feature Views run automatic backfills and can be scheduled to publish new feature data to the Online and Offline Feature Stores on a regular cadence.
Note: Many aggregations are are already supported in a Batch Window Aggregate Feature View out of the box. These aggregations have been optimized for cost and efficiency and are a good place to start if you are looking to define time-windowed aggregations.
Use a BatchFeatureView
, if:
- you have your raw events available in a Batch Data Source
- you want to run simple row-level based transformations on the raw data, or simply ingest raw data without further transformations
- you want to define custom join and aggregation transformations
- your use case can tolerate a feature freshness of > 1 hour
- you wan to ingest a dimension table (e.g. a user's attributes) for feature consumption
Common Examples:
- determining if a user's credit score is over a pre-defined threshold
- counting distinct transactions over a time window
- batch ingesting pre-computed feature values from an existing batch data source
- batch ingesting a user's date of birth
Examples
To create a Batch Feature View, use the @batch_feature_view
decorator on your Python function.
Row-Level Transformation
Custom Aggregation Transformation
Parameters
See the API reference for the full list of parameters.
The backfill_config
parameter (under development) controls the grouping of the backfill jobs that Tecton spins up,
and requires a matching form of the transformation.
Currently, the only available value is BackfillConfig("multiple_batch_schedule_intervals_per_job")
.
More values will be supported in the future.
How it works
When materialized online and offline, Tecton will run the BatchFeatureView
transformation according to the defined batch_schedule
. It publishes the latest feature values per entity key to the Online Feature Store and all historical values to the Offline Feature Store.
These parameters in a Batch Feature View definition configure how Tecton will run the materialization jobs:
batch_schedule
(e.g."1d"
): Controls how often Tecton will materialize new feature values to the Feature Store.feature_start_time
(e.g. "datetime(2021, 4, 1)
): Controls how far back Tecton will backfill feature data to the Feature Store once a new Feature View transformation is registered.window
(e.g."7d"
): An optional parameter on each data sourceInput
, which defaults to equal the Feature Viewbatch_schedule
and determines the time range of raw data Tecton will supply to the transformation for a given materialization run (e.g. the most recent 7 days worth of data). Tecton automatically filters data outside of this window based on the Data Sourcetimestamp_key
.
Using tecton_sliding_window for windowed aggregations
Note
tecton_sliding_window
is deprecated in later versions of Tecton. We recommend upgrading to the latest version of Tecton and using the alternate functionality for windowed aggregations.
When aggregating over a time window with window
, we recommend using the tecton_sliding_window()
transformation. See this notebook for more details on how tecton_sliding_window()
works.
First, add the tecton_sliding_window()
transformation to your transformation pipeline.The tecton_sliding_window()
has 3 primary inputs:
df
: the input data.timestamp_key
: the timestamp column in your input data that represents the time of the event.window_size
: how far back in time the window should go. For example, if my feature is the number of distinct IDs in the last 30 days, then the window size is 30 days. Typically this value should match thewindow
on your Input.
In the example above, our transformation pipeline now looks like this:
def user_distinct_merchant_transaction_count_30d(transactions_batch):
return user_distinct_merchant_transaction_count_transformation(
tecton_sliding_window(transactions_batch,
timestamp_key=const('timestamp'),
window_size=const('30d')))
In the following transformation, you will 'group by' the window_end
column, alongside any entity columns. In the example above, our second transformation looks like this:
@transformation(mode='spark_sql')
def user_distinct_merchant_transaction_count_transformation(window_input_df):
return f'''
SELECT
nameorig AS user_id,
COUNT(DISTINCT namedest) AS distinct_merchant_count,
window_end AS timestamp
FROM {window_input_df}
GROUP BY
nameorig,
window_end
'''
And that's it! Tecton will now be able to calculate your feature that aggregates over the trailing 30 days.
Batch vs. Batch Window Aggregate Feature Views
A BatchFeatureView
is the more flexible, but less specialized alternative to a BatchWindowAggregateFeatureView
. BatchWindowAggregateFeatureView
s are highly recommended when running supported time-window aggregations. See the BatchWindowAggregateFeatureView
documentation for a quick explanation of how Tecton supports these types of features by leveraging pre-computed and on-demand transformations.