Tecton 0.4
Overview
Tecton 0.4 was released in June 2022. Tecton 0.4 includes the following framework improvements and changes:
- Snowflake support
- API simplification & improvements
- Materialization info diffs
Snowflake Support
Tecton 0.4 includes compatibility with Snowflake for processing and storing features. Once connected to a Snowflake warehouse, users can define features in Snowflake SQL or Snowpark.
@batch_feature_view(
sources=[transactions],
entities=[user],
mode='snowflake_sql',
aggregation_interval=timedelta(days=1),
aggregations=[
Aggregation(column='TRANSACTION', function='sum', time_window=timedelta(days=1)),
Aggregation(column='TRANSACTION', function='sum', time_window=timedelta(days=7)),
Aggregation(column='TRANSACTION', function='sum', time_window=timedelta(days=40)),
Aggregation(column='AMT', function='mean', time_window=timedelta(days=1)),
Aggregation(column='AMT', function='mean', time_window=timedelta(days=7)),
Aggregation(column='AMT', function='mean', time_window=timedelta(days=40)),
],
online=True,
feature_start_time=datetime(2020, 10, 10),
description='User transaction totals over a series of time windows, updated daily.'
)
def user_transaction_metrics(transactions):
return f'''
SELECT
USER_ID,
1 as TRANSACTION,
AMT,
TIMESTAMP
FROM
{transactions}
'''
API Simplification and Improvements
0.4
includes a large set of changes to simplify and improve Tecton’s declarative Feature Repository API.
SDK 0.4
maintains backwards compatibility with the tecton.compat
submodule. Users can migrate from 0.3
to 0.4
without changing their Feature Repo by importing Tecton objects from tecton.compat
instead of tecton
.
Functional Changes
- Removed
batch_window_aggregate_feature_view
andstream_window_aggregate_feature_view
types.batch_feature_view
andstream_feature_view
now support Tecton window aggregations.- Rationale: These object types overlapped significantly and unnecessarily increased the number of concepts that new users had to learn.
- Changes to materialization timestamp filtering.
- During materialization, the output of Feature Views will now be automatically filtered to the materialization period (i.e. the window of time that is being backfilled or updated incrementally at steady state).
- Data Sources no longer require a timestamp column to be defined because the time filter is now applied on the output of the Feature View.
- Users have two options for optimizing query performance by pushing down timestamp filtering:
- Handle time filtering with custom logic using the
materialization_context
. - Use
FilteredSource
to have Tecton automatically filter the Data Source to the correct period before the Feature View transformation is applied.
- Handle time filtering with custom logic using the
- Rationale: Tecton's previous timestamp filtering logic worked well when a Feature View had exactly one Data Source and that Data Source had a timestamp column that was used directly as the Feature View feature time. Outside of that case, Tecton's timestamp filtering logic was unintuitive and the frequent source of bugs. This new logic should be simpler for most users while simultaneously providing more flexibility for power users.
- See this batch feature view overview for more information.
- Introduce “Incremental Backfilling” to Batch Feature Views.
incremental_backfills
is a new parameter for Batch Feature Views that changes how Tecton backfills the feature view. If set toTrue
, Tecton will backfill every period in the backfill window in its own job. In some cases (e.g. customer aggregations), this can lead to much simpler query definitions.- Rationale: Provide a means for users to easily and correctly implement Feature Views with custom aggregations.
- See this guide for more info.
- Configurable
data_delay
on Data Sources.- Replaces
schedule_offset
, a Feature View parameter. - By default, incremental (i.e. non-backfill) materialization jobs run immediately at the end of the batch schedule period.
data_delay
configures how long materialization jobs should wait before running after the end of a period, typically to ensure that all data has landed. For example, if a feature view has abatch_schedule
of 1 day and one of the data source inputs has adata_delay
of 1 hour, then incremental materialization jobs will run at01:00
UTC (one hour after the period has ended). - Rationale: This parameter delays materialization due to upstream data delays, which logically fits as a Data Source property. Feature Views now inherit data delays from all dependent Data Sources.
- Replaces
- Support custom names for aggregate features.
- Allow users to set custom names for aggregate features. (Previously, users had to use Tecton auto-generated names like
amount_mean_7d_1d
.) - Example:
@batch_feature_view( ... aggregations=[ Aggregation(name='transaction_amount_daily_avg', column='amount', function='mean', time_window=timedelta(days=1)), Aggregation(name='transaction_amount_weekly_avg', column='amount', function='mean', time_window=timedelta(days=7)), ] ) def user_transaction_counts(transactions): return f''' SELECT user_id, timestamp, amount FROM {transactions} '''
- Allow users to set custom names for aggregate features. (Previously, users had to use Tecton auto-generated names like
Non-functional Changes
- Tecton data types
- Tecton now uses
tecton.types
when defining Feature View schemas and Request Data Sources. - Example:
from tecton import on_demand_feature_view, RequestSource from tecton.types import Int64, Bool, Field transaction_request = RequestSource(schema=[Field('transaction_amount_is_high', Int64)]) @on_demand_feature_view( sources=[transaction_request], mode='python', schema=[Field('transaction_amount_is_high', Bool)], ) def transaction_amount_is_high(transaction_request): return {'transaction_amount_is_high': transaction_request['amount'] >= 10000}
- Rationale: Previously Tecton used PySpark data types to define all schemas. This made PySpark a required dependency for the Tecton SDK, but Tecton can now be used without Spark with Snowflake. Tecton will continue to use native data types (PySpark, Snowflake, etc.) in data platform specific contexts, e.g. when providing an explicit schema for a Spark Data Source.
- Tecton now uses
- Use
timedelta
for a duration parameters instead ofpytime
strings.- E.g.
time_window=timedelta(hours=12)
instead oftime_window="12h"
- Rationale: Consistent with API’s usage of
datetime
objects, removes an API dependency on the PyTime implementation, and less ambiguous.
- E.g.
- Use functional style to define Feature View overrides in Feature Services.
- Example:
transaction_fraud_service = FeatureService( name="transaction_fraud_service", features=[ # Select a subset of features from a feature view. transaction_features[["amount"]], # Rename a feature view and/or rebind its join keys. In this example, we want user features for both the # transaction sender and recipient, so include the feature view twice and bind it to two different feature # service join keys. user_features.with_name("sender_features").with_join_key_map({"user_id" : "sender_id"}), user_features.with_name("recipient_features").with_join_key_map({"user_id" : "recipient_id"}), ], )
- Example:
Parameter/Class Changes
Class Renames/Changes
0.3 Definition | 0.4 Definition |
---|---|
Data Sources | |
BatchDataSource | BatchSource |
StreamDataSource | StreamSource |
FileDSConfig | FileConfig |
HiveDSConfig | HiveConfig |
KafkaDSConfig | KafkaConfig |
KinesisDSConfig | KinesisConfig |
RedshiftDSConfig | RedshiftConfig |
RequestDataSource | RequestSource |
SnowflakeDSConfig | SnowflakeConfig |
Feature Views | |
@batch_window_aggregate_feature_view | @batch_feature_view |
@stream_window_aggregate_feature_view | @stream_feature_view |
Misc Classes | |
FeatureAggregation | Aggregation |
New Classes | |
- | AggregationMode |
- | KafkaOutputStream |
- | KinesisOutputStream |
- | FilteredSource |
Deprecated Classes in 0.3 | |
Input | - |
BackfillConfig | - |
MonitoringConfig | - |
Feature View/Table Parameter Changes
0.3 Definition | 0.4 Definition |
---|---|
inputs | sources |
name_override | name |
aggregation_slide_period | aggregation_interval |
timestamp_key | timestamp_field |
batch_cluster_config | batch_compute |
stream_cluster_config | stream_compute |
online_config | online_store |
offline_config | offline_store |
output_schema | schema |
family | - (removed) |
schedule_offset | - (removed, see DataSource data_delay) |
monitoring.alert_email (nested) | alert_email |
monitoring.monitor_freshness (nested) | monitor_freshness |
monitoring.expected_freshness (nested) | expected_freshness |
Data Source Parameter Changes
0.3 Definition | 0.4 Definition |
---|---|
timestamp_column_name | timestamp_field |
batch_ds_config | batch_config |
stream_ds_config | stream_config |
raw_batch_translator | post_processor |
default_watermark_delay_threshold | watermark_delay_threshold |
default_initial_stream_position | initial_stream_position |
Materialization info in tecton plan
tecton plan
will now print a summary of the backfill and incremental materialization jobs that will result from applying a plan. This feature should help users avoid applying changes that trigger more new jobs than expected.
$ tecton apply
...
+ Create FeatureView
name: user_transaction_counts
owner: matt@tecton.ai
description: User transaction totals over a series of time windows, updated daily.
materialization: 10 backfills, 1 recurring batch job
> backfill: 9 Backfill jobs 2020-10-03 00:00:00 UTC to 2022-04-14 00:00:00 UTC writing to the Offline Store
1 Backfill job 2022-04-14 00:00:00 UTC to 2022-06-06 00:00:00 UTC writing to both the Online and Offline Store
> incremental: 1 Recurring Batch job scheduled every 1 day writing to both the Online and Offline Store
Patch Updates
0.4.14
- Support all aggregation modes (
disabled
,partial
,full
) forFeatureView.run()
for stream feature views withaggregation_mode=AggregationMode.CONTINUOUS
.
0.4.13
- The
spark_config
parameter to DatabricksClusterConfig and EMRClusterConfig now allows arbitrary configs to set (before only a small list was allowed).
0.4.11
- Release
--suppress-recreates
to suppress rematerialization for one-off refactors and migrations. See docs for full instructions on usage.
0.4.9
- Improve point-in-time accuracy for
get_historical_features(spine)
on feature views with Tecton aggregates.- Joins for features in a Batch Feature Views now account for the data delay on the input data sources. If a Batch Feature View has multiple sources, then the maximum data delay is used.
- Joins for features in a Stream Feature View ignore the data delay for their source's batch configuration.
0.4.8
- Update check to allow using Data Sources with
data_delay
with compat Feature Views withschedule_offset
as long as they are equal.
0.4.7
- Add support for
format_string
parameter inDatetimePartitionColumn
. - Improve pytype hints and docstrings for public python classes and decorators.
0.4.6
- Improve error message handling on permission errors
0.4.4
- Add check to prevent using
compat
Feature Views with non-compat
Data Sources that have adata_delay
configured.
0.4.3
- Adds support for instance_availability=SPOT_WITH_FALLBACK to EMRClusterConfig.
0.4.2
- Fix bug that prevented some CLI commands from working in Window environments.
0.4.1
- Include all 0.3 declarative classes and functions in
tecton.compat
so that 0.3 repos can be updated by a single find and replace. Replacefrom tecton import
withfrom tecton.compat import
.