Multi-modal Embedding Fusion-based Recommender

10 min read

Do You want to save this article?

Download as PDF

Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with significant focus on e-commerce platforms. We have developed a machine learning-based recommendation platform, which can be easily applied to almost any items and/or actions domain. 

Contrary to existing recommendation systems, our platform supports multiple types of interaction data with multiple modalities of metadata natively. This is achieved through multi-modal fusion of various data representations. We deployed the platform into multiple e-commerce stores of different kinds, e.g. food and beverages, shoes, fashion items, telecom operators. 

Here, we present our system, its flexibility and performance. We also show benchmark results on open datasets, that significantly outperform state-of-the-art prior work.

Abstract

Recommendation systems have lately been popularized globally, with primary use cases in online interaction systems, with signif- icant focus on e-commerce platforms. We have developed a ML (machine learning)-based recommendation platform, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our platform supports multi- ple types of interaction data with multiple modalities of metadata natively. This is achieved through multi-modal fusion of various data representations. We deployed the platform into multiple e- commerce stores of different kinds, e.g. food and beverages, shoes, fashion items, telecom operators. Here, we present our system, its flexibility and performance. We also show benchmark results on open datasets, that significantly outperform state-of-the-art prior work.

CCS CONCEPTS

• Information systems → Recommender systems; • Applied computing → Online shopping; • Computing methodologies → Neural networks; • Human-centered computing → Hu- man computer interaction (HCI).

1 Introduction

Recommender systems aim to suggest relevant items to users (items being movies to watch, texts to read, products to buy or anything else depending on industries). Indeed, the systems are present at almost every large e-commerce store or platform, spanning various sectors from garments, through jewellery to food.

There exist multiple frameworks and algorithms to build recom- mender systems, and the choice of the optimal approach strongly depends on the types of data available, the distributional properties of the data, modalities considered and business use cases [5, 10, 16, 20, 22]. It is usually impossible to adjust existing algorithms to include a new modality of data or a new type of attributes.

Hence, a vast majority of existing recommender systems consider only a single type of interaction, e.g. clicks or purchases - yet even in this simple scenario, the generalization of performance to various datasets seems doubtful [3].

Businesses using large datasets desire, outside of the currently used data range, a system that will be based on predictors derived from variables generated through automatic analysis of customers’ voices (audio) and observations how customers interact with the merchant’s websites, as well as mobile and offline ecosystems. Un- like existing solutions, it is expected to use these variables by cor- relating them in real time with data from other channels, which would significantly increase the systems’ effectiveness, as well as expand its functionality. Improvement in effectiveness of customer behavior predictive analytics is a key challenge for many businesses.

We show our innovative recommender system that utilizes a multi-modal fusion of multiple interaction types (e.g. clicks, pur- chases, adding a product to cart) and multiple attribute modalities (audio, video, images, text, other behavioral data through time). Our system provides a very efficient framework to combine, deploy and evaluate different recommendation algorithms and scenarios utilizing rich, multi-modal and multi-view data sources.

In this work we present our contributions to recommender sys- tems in data science. We define the requirements for a next genera- tion recommender system as follows:

  • multiple input interaction types (e.g. clicks, purchases, add- to-cart, geo-locations),
  • multiple input attribute modalities (e.g. text, image, video, other),
  • ease of adding new back-end algorithms,
  • effective deep learning models for visual search, recommendations with and without session information, which outperform state-of-the-art techniques,
  • specialized techniques to fuse multiple modalites,
  • high efficiency and scalability (services architecture),
  • convenient infrastructure for model evaluation and performance measurements.

In the following sections we describe our recommendation sys- tem architecture and the data workflow 3.1. Then main features – such as multi-modal embeddings and their fusion technology are sketched in sections 3.2 and 3.3, respectively. Subsequently we present a few tests with state-of-the-art (SotA) benchmarks (sec- tion 4). Finally we add a description of our interface, recommenda- tion analytics and a few use-cases from our production deployments (section 5).

2 Motivation and State-of-the-Art Review

Predictive analytics incorporated in recommender systems unleash the power of data for users and businesses. Systems which learn from data how to predict future choices and behaviors of individ- uals can bring significant competitive advantages. While perfect prediction is not possible in practice, sufficiently developed systems can bring benefits for both customers and businesses. This paper focuses mainly on the customer side.

The purpose of recommendation is to help users find the prod- ucts they need, manage their personal budget efficiently and make purchase decisions faster. This is usually achieved by showing re- lated offers and recommending similar products to the ones they have viewed, recommending the next products to consider or to complement a shopping cart.

There exist multiple established recommendation algorithms, ranging from simple heuristic-based methods (such as KNNs), through Collaborative Filtering to deep learning architectures [3, 11, 19, 21].

Different algorithms are useful in different input data settings, use cases and scenarios. Common similar items recommendations based on text and numeric data involves preparing suggestions (i.e. other items or actions to take) considering the context of a single item. Personalized recommendations suggest the products consid- ering the context of users’ buying preferences and their behavioral profiles (based on long- or short-term history). In order to prepare these suggestions, the system analyzes page visits, transactional data and also product feeds (product metadata). There are also other types of recommendations, i.e. cross-sell, top products, last seen offers.

In practice the techniques are often mixed depending on environ- ment and various factors, e.g. vendor domain, website construction, user history and current season or time of the day. They should be adjusted experimentally and measured constantly.

Thus, a system that is highly self-adjustable to the type and modality of data is crucial to cope with many deployments and to use recommendation techniques effectively. A set of recommenda- tion scenarios (which we also use as default settings in our platform) are shown in Fig. 1.

3 Our Approach

3.1 System Overview

Our platform is adjusted to consume business customers’ standard format APIs for product feeds and our proprietary product catalog database infrastructure. (see Fig. 2).

The system is based on Reactive Microservices Architecture [2, 13], implementing its core principles which are: elasticity, scalabil- ity, fault tolerance, high availability, message driven and real-time processing. Especially real-time processing is crucial in order to provide tailored and high quality recommendations taking into account not only the latest changes of in-session user behavior, but also changes in system performance. Not only scores and recom- mendations are being calculated during the request time, but also user representations are being updated and exposed to models after each event flowing through event stream.

The conceptual diagram of an architecture is presented on Fig. 2. The system is accessible throughout an extensive API which is ex- posed by recommendations facade. When a new request for recom- mendation appears, before it is be passed to recommendation logic module, it is validated by the facade and enriched with business rules via recommendation campaigns. Rules may include things like: type of recommendation, recommendation goal or filtering expressions formulated in our dedicated control language, i.e. items query language IQL.

IQL custom query language provides a very flexible framework to build new recommendation scenarios based on item meta-data and recommendation request context. In Fig. 3 there are a few exam- ples of building recommendation filtering rules. IQL expressions are being handled by an items filter, which performs filtering of candi- date items based on given constraints. To achieve high throughput and low latency, items filter uses its own compressed binary repre- sentation of items, serving thousands of requests per second and filtering sets of million+ items. In case of IQL expressions with low selectivity, transfer of the data structure containing candidate item IDs over the network infrastructure could be expensive, therefore a binary protocol between filter and logic has been implemented. The model which will handle the request is selected by the Optimizer. Optimizer implements a form of a Thompson Sampling algorithm solving multi-armed bandit problems allowing not only to easily A/B test new ideas and algorithms, but also to optimize results of running recommendation campaigns. Finally one of the models receives a request to score available candidates based on model itself and to update entity embeddings.

Although most of the system works in real time, the offline part is also present but limited to model training. Algorithms are trained on two main data sources. The first one is a data lake into which events of different types and origins are being ingested through an events stream. To name a few events types: screen view from a mobile app, product add to cart from a web page, offline transaction from a POS system etc. The second source is a master item meta- data database where items are being kept along with their attributes and rich data types like images.

3.2 Multi-modal Embeddings

Our algorithms can be fed with various kinds of input data. The system analyzes long- and short-term interaction history of users and has a deep insight into item metadata. For this purpose we use a multi step pipeline, starting with unsupervised learning. For images and texts off-the-shelf unsupervised models may be used. For interaction data we identify graphs of user-entity interactions (e.g. user-product, user-brand, user-store) and compute multiple graph or network embeddings.

We developed a custom method for massive-scale network em- bedding for networks with hundreds of billions of nodes and tens 

 Diagram of various recommendation scenarios depending on the site within an e-commerce platform
 Diagram of various recommendation scenarios depending on the site within an e-commerce platform


 

A general architecture of our recommendation platform
A general architecture of our recommendation platform

of billions of edges. The task of network embedding is to map a network or a graph into a low-dimensional embedding space, while preserving higher-order proximities between nodes. In our datasets nodes represent interacting entities, e.g. users, device IDs, cookies, products, brands, title words etc. Edges represent interactions, with a single type of interaction per input network, e.g. purchase, view, hover, search.

Similar network embedding approaches include Node2Vec, Deep- Walk and RandNE [23]. These approaches exhibit several undesir- able properties, which our method addresses. Thanks to the right de- sign of algorithm and highly optimized implementation our method allows for:

  • three orders of magnitude improvement in time complexity over Node2Vec and DeepWalk,
  • deterministic output – embedding the same network twice results in the same embeddings,
  • stable output with regards to small input perturbations – small changes in the dataset result in similar embeddings,
  • inductive property and dynamic updating – embeddings for new nodes can be created on the fly,
  • applicable to both networks and hyper-networks – support for multi-node edges.

The input data is constructed from raw interactions - an edge (hyperedge) list for both simple networks and hypernetworks. In case of hypernetworks, where the cardinality of an edge is larger than 2, our algorithm either performs implicit clique expansion in- memory (to avoid excessive storage needs for an exploded input file). For very wide hyperedges star-expansion results in less edges, and can be used instead - via an input file containing virtual interaction nodes.

Items query language implemented and used in our recommendation system
Items query language implemented and used in our recommendation system

Our custom method works as follows: At first we initialize node vectors (Q matrix) randomly via multiple independent hashing of node labels and mapping them to constant interval, resulting in vectors sampled from uniform (-1, 1) distribution. Thus we achieve deterministic sampling. Empirically we determine that dimensional- ity of 1024 or 2048 is enough for most purposes. Then we calculate a Markov transition matrix (M) representing network connectivity. In case of hyper-network, we perform clique expansion adding virtual edges. Final node embeddings are achieved by multiplying M ∗ Q iteratively and L2-normalizing them in each intermediate step. The number of iterations is depends on the distributional properties of the graph, with between 3 and 5 iterations being a good default range.

The algorithm is optimized for extremely large datasets:

  • The Markov transition matrix M is stored in COO (co-occurrence) format in RAM or in memory-mapped files on disk;
  • all operations are parallelized with respect to the embedding dimensions, because dimensions of vectors Q are indepen- dent on each other;
  • the M ∗ Q multiplication is performed with dimension-level concurrency as well;
  • clique expansion for hyper-graphs is performed virtually, only filling the entries in M matrix;
  • star expansion is performed explicitly, with a transient col- umn for the virtual nodes in the input file.

The algorithm’s results are entity embeddings contained in the Q matrix. Creation of inductive embeddings (for new nodes) is possible from raw network data using the formula M ′ ∗ Q , where M′ represents the links between existing and new nodes and Q represents the embeddings of existing nodes. 

It is worth noting that the algorithm not only performs well on interaction networks, but also on short text data, especially product metadata. In this setting we consider words in a product title as a hyperedge. This corresponds to star-expansion, where product identifiers are virtual nodes linking title words.

However our general pipeline can easily use embeddings calcu- lated using the latest techniques of language modeling, e.g. ELMO, BERT embeddings, especially for longer texts. Another data source is visual data (shape, color, style, etc.) i.e. images. To prepare visual data feed for our algorithm we use state- of-the-art deep learning neural networks [4, 15] customized for our use [1].

Indeed, any unsupervised learning method outputting dense embeddings can be considered as input to our general pipeline.

3.3 Embedding Fusion

Having unsupervised dense representations coming from multiple, possibly different algorithms - representing products, or other enti- ties the customers interacts with, we need to aggregate them into fixed-size behavioral profiles for every user.

As most methods of representation learning assume nothing about embedding compositionality (with simple assumptions made by Bag-of-Words models), we develop a custom mechanism of com- positionality allowing meaningful summation of multiple items.

Our algorithm performs multiple feature space partitionings via vector quantization. The algorithm involves ideas derived from Locality Sensitive Hashing and Count-Min Sketch algorithm, com- bined with geometric intuitions. Sparse representations resulting from this approach exhibit additive compositionality, due to Count- Sketch properties (for a set of items, the sketch of the set is equal to the sum of separate sketches).

All modalities and views of data (all embedding vectors) are processed in this way, their sketches are concatenated.

One of the central advantages of the algorithm is the ability to squash representations of multiple objects into a much smaller joint representation which we call (sketch), which allows for easy and fast subsequent retrieval of participating objects, in an analogous way to Count-Min Sketch. E.g. the purchase history of a user can be represented in a single sketch, the website browsing history as another sketch, and the sketches concatenated.

Subsequently sketches containing squashed user behavioral pro- files serve as input to relatively shallow (1-5 layers) feed-forward neural networks. The output structure of the neural network also is structured as a sketch, with the same structure.

Training is done with cross-entropy objective in a depth inde- pendent way (output sketches are normalized to 1, across the width dimension). During inference, we perform a sketch readout opera- tion, as in a classic Count-Min Sketch, exchanging the minimum operation to geometric mean - effectively performing averaging of log-probabilities.

4 A few Experiments and Results on open datasets

As far as visual similarity is concerned we tested our proprietary deep learning models on big open datasets commonly used in this field, i.e. DeepFashion and Street2Shop [14, 15]. Our models are better than SotA in general and in various garment categories as well [1] (see Tab. 1). 

For history/session based model comparison with SotA we used a framework published in [18] and our results are better or com- parable to the results of the SotA methods depending on metrics considered (see Tab. 2).

Prelimiary results on MovieLens 20M dataset [8] regarding fea- tured recommendations (without user history) show that our pro- prietary models are comparable to SotA in the field (P@20 above 20%; we used benchmark published in [7]). Moreover, our algorithm offers significant speed benefits over other neural competitors. It takes 20 sec to train and 14 sec to return predictions for 6000 users and 4000 movies (around 23.000.000 user/movie combinations in total), compared to recent neural approaches: FastAI recommender [12] (901 sec/57 sec) or NCF [9] (790 sec/50 sec) while achieving comparable results [7], using the same hardware.

Comparison of performance on our models and cur- rently published SotA in visual similarity research. Metrics are commonly used in the task [15]
Table 1: Comparison of performance on our models and cur- rently published SotA in visual similarity research. Metrics are commonly used in the task [15]
Comparison of performance on our models and cur- rently published SotA in session-based recommenders re- search. Metrics are commonly used in the task [18]
Table 2: Comparison of performance on our models and cur- rently published SotA in session-based recommenders re- search. Metrics are commonly used in the task [18]

5 Use Cases

In current production deployments our platform achieves 20% - 30% improvements in average order size (AOS) and 10% - 60% improve- ments in average order value (AOV) in comparison to the system without our custom ML-based recommendations as calculated by A/B tests. The numbers vary significantly depending on quality of product and user data, as well as recommendation visibility and website structure.

In Fig. 4,5,6 different recommendation scenarios in various prod- uct categories are shown, these are visually similar products and personalized recommendations based on user interactions in vari- ous e-commerce platforms.

In Tab. 3 we also provide data about user history and recommen- dations in electronics category.

Recommendation box – visually similar dresses
Recommendation box – visually similar dresses

Recommendation analytics in our systems provide easily cus- tomized interface to show aggregated results and plots (for an example see Fig. 7 and Fig. 8).

6 Discussion and Conclusion

In this paper we presented our custom platform and algorithms that can be fed with multimodal, multi-view data, fused and aggregated effectively. We showed use cases with various scenarios and data feeds. With our algorithms, described in previous sections, we achieved very good results in various e-commerce stores and we exceeded state-of-the-art results on open recommendation datasets. Deployment of our system in a new e-commerce store takes about one workday, thanks to a modular architecture easily adaptable to clients’ APIs and data feeds of different formats.

Given the elegant nature of sketch representations of any em- beddings learned in an unsupervised way, allowing for compact representation and additive composability, future applications are numerous. Our focus for current work includes: product propen- sity models, demand forecasting, improved search personalization and recommendation of non-product entities (e.g. coupons, offers, brands). We are also extending our framework with interpretability functionality (XAI methods) to be even more useful in business contexts.

Recommendation box at main page of a service – personalized on user interactions in the same session
Recommendation box at main page of a service – personalized on user interactions in the same session
Recommendation box – cold start in a session based on previous user behavioral history
Recommendation box – cold start in a session based on previous user behavioral history
Examples of recommendations for our personalized recommenders in electronic products.
Table 3: Examples of recommendations for our personalized recommenders in electronic products. We can also see comple- mentary products that fit very well to the viewed product.
Recommendation analytics interface – aggregated results
Recommendation analytics interface – aggregated results
Recommendation analytics plot - viewed and clicked recommenations
Recommendation analytics plot - viewed and clicked recommenations
References

[1] anonymous. 2020. –.
[2] Jonas Bonér. 2016. Reactive Microservices Architecture. O'Reilly Media, Inc. https://www.oreilly.com/library/view/reactive- microservices- architecture/9781491975664/ (book).
[3] MaurizioFerrariDacrema,PaoloCremonesi,andDietmarJannach.2019.AreWe Really Making Much Progress? A Worrying Analysis of Recent Neural Recommen- dation Approaches. In Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing Ma- chinery, New York, NY, USA, 101–109. https://doi.org/10.1145/3298689.3347058
[4] Eric Dodds, Huy Nguyen, Simao Herdade, Jack Culpepper, Andrew Kae, and Pierre Garrigues. 2018. Learning Embeddings for Product Visual Search with Triplet Loss and Online Sampling. arXiv:1810.04652 [cs] (Oct. 2018). http://arxiv.org/abs/1810.04652 arXiv: 1810.04652.
[5] Wenjing Fu, Zhaohui Peng, Senzhang Wang, Yang Xu, and Jin Li. 2019. Deeply Fusing Reviews and Contents for Cold Start Users in Cross-Domain Recommen- dation Systems. Proceedings of the AAAI Conference on Artificial Intelligence 33 (072019),94–101. https://doi.org/10.1609/aaai.v33i01.330194
[6] Diksha Garg, Priyanka Gupta, Pankaj Malhotra, Lovekesh Vig, and Gautam Shroff. 2019. Sequence and Time Aware Neighborhood for Session-Based Rec- ommendations: STAN. In Proceedings of the 42nd International ACM SIGIR Con- ference on Research and Development in Information Retrieval (Paris, France) (SI- GIR’19). Association for Computing Machinery, New York, NY, USA, 1069–1072. https://doi.org/10.1145/3331184.3331322
[7] Scott Graham, Jun-Ki Min, and Tao Wu. 2019. Microsoft Recommenders: Tools to Accelerate Developing Recommender Systems. In Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing Machinery, New York, NY, USA, 542–543. https: //doi.org/10.1145/3298689.3346967
[8] F.MaxwellHarperandJosephA.Konstan.2015.TheMovieLensDatasets:History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article Article 19 (Dec. 2015), 19 pages. https://doi.org/10.1145/2827872
[9] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In Proceedings of the 26th International Conference on World Wide Web (Perth, Australia) (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 173–182. https://doi.org/10.1145/3038912.3052569
[10] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based Recommendations with Recurrent Neural Networks. arXiv:cs.LG/1511.06939
[11] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk. 2016. Parallel Recurrent Neural Network Architectures for Feature-Rich Session-Based Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16). Association for Computing Machinery, New York, NY, USA, 241–248. https://doi.org/10.1145/2959100.2959167
[12] Jeremy Howard et al. 2018. fastai. https://github.com/fastai/fastai.
[13] Roland Kuhn Jonas Bonér, Dave Farley and Martin Thompson. 2014. Reactive Manifesto. https://www.reactivemanifesto.org.
[14] Zhanghui Kuang, Yiming Gao, Guanbin Li, Ping Luo, Yimin Chen, Liang Lin, and Wayne Zhang. 2019. Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid. arXiv:1908.11754 [cs] (Aug. 2019). http://arxiv.org/abs/1908. 11754 arXiv: 1908.11754.
[15] Michal Kucer and Naila Murray. 2019. A Detect-Then-Retrieve Model for Multi- Domain Fashion Item Retrieval. In CVPR Workshops.
[16] JingLi,PengjieRen,ZhuminChen,ZhaochunRen,TaoLian,andJunMa.2017. Neural Attentive Session-Based Recommendation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Singapore, Sin- gapore) (CIKM ’17). Association for Computing Machinery, New York, NY, USA, 1419–1428. https://doi.org/10.1145/3132847.3132926
[17] Malte Ludewig and Dietmar Jannach. 2018. Evaluation of session-based recom- mendation algorithms. User Modeling and User-Adapted Interaction 28, 4-5 (Oct 2018), 331–390. https://doi.org/10.1007/s11257-018-9209-6
[18] Malte Ludewig, Noemi Mauro, Sara Latifi, and Dietmar Jannach. 2019. Empirical Analysis of Session-Based Recommendation Algorithms. In RecSys.
[19] Meirui Wang, Pengjie Ren, Lei Mei, Zhumin Chen, Jun Ma, and Maarten de Rijke. 2019. A Collaborative Session-Based Recommendation Approach with Parallel Memory Modules. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France) (SIGIR’19). Association for Computing Machinery, New York, NY, USA, 345–354. https://doi.org/10.1145/3331184.3331210
[20] ZekaiWang,HongzhiLiu,YingpengDu,ZhonghaiWu,andXingZhang.2019. Unified Embedding Model over Heterogeneous Information Network for Person- alized Recommendation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on ArtificialIntelligenceOrganization,3813–3819. https://doi.org/10.24963/ijcai. 2019/529
[21] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. 2018. Session-based Recommendation with Graph Neural Networks. CoRR abs/1811.00855 (2018). arXiv:1811.00855 http://arxiv.org/abs/1811.00855
[22] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative Knowledge Base Embedding for Recommender Systems. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowl- edge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 353–362. https: //doi.org/10.1145/2939672.2939673
[23] Ziwei Zhang, Peng Cui, Haoyang Li, Xiao Wang, and Wenwu Zhu. 2018. Billion- scale network embedding with iterative random projection. In 2018 IEEE Interna- tional Conference on Data Mining (ICDM). IEEE, 787–796.

Authors:

Anna Wroblewska, Synerise and Faculty of Mathematics and Information Science, Warsaw University of Technology
Andrzej Michalowski, Synerise
Mikolaj Wieczorek, Synerise
Jacek Dabrowski, Synerise 
Michał Daniluk, Synerise
Michał Pastuszak, Synerise 
Barbara Rychalska, Warsaw University of Technology
Sylwia Sysko-Romanczuk, Warsaw University of Technology