Skip to content

Reading Batch Features for Inference using Spark

Overview

This example demonstrates how to perform batch inference in Tecton. Batch inference in Tecton is very similar to generating training data

Fetch a Batch of Data from Tecton

Assuming your model was trained with data from Tecton, you created a FeatureService in order to generate training data. The same FeatureService you used to generate training data will be used to fetch a batch of data for inference.

Similar to how you built training data, you'll need to generate a DataFrame that represents the data you wish to retrieve from Tecton. This DataFrame should be composed of rows containing:

  • The join keys associated with each of your features
  • Timestamps at which you'd like to retrieve data
  • Columns corresponding to the RequestSource of any OnDemandFeatureView features, if your FeatureService includes one or more OnDemandFeatureView.

If you're not sure which join keys are associated with your features, the page corresponding to your FeatureService in the Web UI will list the entities associated with all of your features. Each entity maps to a join key that you will need.

Example: Building a Prediction Context for Fraud Detection

In this example, let's imagine we have a fraud detection model that we would like to run nightly on the last 24 hours of transactions. The features for our model describe transactions, users, and merchants. To create our prediction context, we fetch a log of the transactions in the last day, which should look like this:

transaction_id user_id merchant_id timestamp
51812359 C1231006815 M1979787155 2020-12-01 01:00:02.595066019
51812360 C1666544295 M2044282225 2020-12-01 01:00:02.940659192
51812361 C1305486145 M5532624065 2020-12-01 01:00:03.336173880
51812362 C840083671 M3899427010 2020-12-01 01:00:06.033070635
51812363 C2048537720 M1230701703 2020-12-01 01:00:06.711752585

Retrieve Data with the Prediction Context

Now that you have a prediction context, you can use the Tecton SDK to retrieve features for inference. This will be the same code you used to generate a dataset:

# transaction_log is a dataframe containing the prediction context made above

ws = tecton.get_workspace('prod')
fs = ws.get_feature_service('demo_fraud_model')
batch_data = fs.get_historical_features(transaction_log, timestamp_key="timestamp")

The call to get_historical_features will return a Tecton DataFrame, where your feature values have been joined onto the prediction context. An example with a single feature joined onto the above context would look like:

transaction_id user_id merchant_id timestamp transaction_details.amount
51812359 C1231006815 M1979787155 2020-12-01 01:00:02.595066019 35.0
51812360 C1666544295 M2044282225 2020-12-01 01:00:02.940659192 522.2
51812361 C1305486145 M5532624065 2020-12-01 01:00:03.336173880 1.2
51812362 C840083671 M3899427010 2020-12-01 01:00:06.033070635 90.2
51812363 C2048537720 M1230701703 2020-12-01 01:00:06.711752585 555.6

Perform Inference

The Tecton DataFrame above can easily be used to perform batch inference; simply convert your data to a Pandas DataFrame:

batch_data_pandas = batch_data.to_pandas()

For other inference frameworks, you can persist your data to a file using Spark, then perform inference by loading from this file.