Neural Collaborative Filtering and Two-Tower Architecture for Recommendation Engines

Oct 02, 2024

In today’s digital world, recommendation engines play a key role in personalizing our online experiences, like helping us find the next movie to watch or suggesting products to buy. One advanced method used to build these recommendation systems is called Neural Collaborative Filtering (NCF). A popular and efficient way to use this method is through something called the two-tower architecture.

What is Collaborative Filtering?

At the core of recommendation engines, collaborative filtering tries to predict what a user would like based on their interactions or similarities with other users. The most common example is a user-item interaction matrix, where:

Rows represent users.
Columns represent items (like movies or products).
Cells indicate whether a user has interacted with an item (e.g., watched a movie, bought a product).

In traditional collaborative filtering, the focus is on using this matrix to find patterns—e.g., users who liked item X also liked item Y. However, this method struggles with cold-start problems (when there’s not enough data) and can’t capture complex, non-linear user-item relationships.

Neural Collaborative Filtering (NCF)

Neural Collaborative Filtering builds on the traditional approach by using deep neural networks to learn complex patterns and interactions between users and items. Instead of relying on simple matrix factorization, NCF uses embeddings—vector representations of users and items that can capture more nuanced features.

Two-Tower Architecture Overview

The two-tower architecture consists of two separate neural networks (or towers):

User Tower: A neural network that takes in user features (e.g., user ID, demographics, or past behavior) and generates a user embedding—a compressed representation of the user.
Item Tower: A separate neural network that takes in item features (e.g., item ID, category, or description) and generates an item embedding.

The architecture then combines these embeddings to compute a score, indicating how well a user matches an item.

Step 1: User and Item Embeddings

Let’s assume a user u is represented by a feature vector xu and an item vector is represented by a feature vector xi
The user tower transforms the user feature vector xu into a user embedding eu i.e.,

\(\mathbf{e}_u = f_{\text{user}}(\mathbf{x}_u)\)

Similarly, the item tower transforms the item feature vector xi into an item embedding ei
\(\mathbf{e}_i = f_{\text{item}}(\mathbf{x}_i)\)

Step 2: Computing the Interaction

Once we have the user embedding eu and the item embedding ei, the next step is to calculate how well they match. The simplest way to do this is by using the dot product:

\(s_{ui}=e_{u}⋅e_{i}\)

Here sui is the similarity score or match score between the user and the item. The dot product captures the interaction between the user and the item embeddings.

Step 3: Loss Function

To train the model, we need to compare the predicted score with the actual user behavior. If we are doing binary classification (e.g., did the user click on the item or not?), we can use a binary cross-entropy loss function:

where:

yui is the true label (1 if the user interacted with the item, 0 if they didn’t),
sui is the predicted similarity score,
σ is the sigmoid function to convert the score into a probability,
D is the dataset of user-item interactions.

\(L=−∑_{(u,i)∈D}[y_{ui} log⁡σ(s_{ui})+(1−y_{ui})log⁡(1−σ(s_{ui}))]\)

This helps the model learn the optimal user and item embeddings by minimizing the difference between the predicted and actual interactions.

The two-tower architecture has several benefits that make it an excellent choice for neural collaborative filtering:

1. Scalability

In large systems with millions of users and items, scalability is key. With the two-tower model, you can precompute item embeddings and store them. When a user makes a query, only the user embedding needs to be computed in real time, which can then be compared against the precomputed item embeddings to quickly generate recommendations.

2. Flexibility

This architecture can easily incorporate additional features for both users and items, such as user demographics or item metadata (like genre or price). This flexibility allows the model to capture richer relationships than traditional collaborative filtering methods.

3. Efficient Retrieval

Once you have the embeddings, retrieving relevant items for a user can be done using techniques like Approximate Nearest Neighbor (ANN) search, which quickly finds the items whose embeddings are most similar to the user’s embedding.

3. Efficient Retrieval

Conclusion

The two-tower architecture is a strong candidate for building recommendation engines, especially when scalability, efficiency, and flexibility are needed. By using deep learning to capture complex user-item relationships, this approach outperforms traditional collaborative filtering methods while being computationally efficient for large-scale system

References:

1.A Dual Augmented Two-tower Model for Online Large-scale Recommendation

2.Neural Collaborative Filtering

3.Matrix factorization techniques for recommender systems

DataJourney

Discussion about this post