Many unsupervised representation learning methods belong to the class of similarity learning models. While various modality-specific approaches exist for different types of data, a core property of many methods is that representations of similar inputs are close under some similarity function.

We propose EMDE (Efficient Manifold Density Estimator) - a framework utilizing arbitrary vector representations with the property of local similarity to succinctly represent smooth probability densities on Riemannian manifolds. Our approximate representation has the desirable properties of being fixed-size and having simple additive compositionality, thus being especially amenable to treatment with neural networks - both as input and output format, producing efficient conditional estimators.

We generalize and reformulate the problem of multi-modal recommendations as conditional, weighted density estimation on manifolds. Our approach allows for trivial inclusion of multiple interaction types, modalities of data as well as interaction strengths for any recommendation setting. Applying EMDE to both top-k and session-based recommendation settings, we establish new state-of-the-art results on multiple open datasets in both uni-modal and multi-modal settings.

We release the source code and our own real-world dataset of e-commerce product purchases, with special focus on modeling of the item cold-start problem.

## Abstract

Many unsupervised representation learning methods belong to the class of similarity learning models. While various modality-specific approaches exist for different types of data, a core property of many methods is that representations of similar inputs are close under some similarity function. We propose EMDE (Efficient Manifold Density Estimator) – a framework utilizing arbitrary vector representations with the property of local similarity to succinctly represent smooth probability densities on Riemannian manifolds. Our approximate representation has the desirable properties of being fixed-size and having simple additive compositionality, thus being especially amenable to treatment with neural networks – both as input and output format, producing efficient conditional estimators. We generalize and reformulate the problem of multi-modal recommendations as conditional, weighted density estimation on manifolds. Our approach allows for trivial inclusion of multiple interaction types, modalities of data as well as interaction strengths for any recommendation setting. Applying EMDE to both top-k and session-based recommendation settings, we establish new state-of-the-art results on multiple open datasets in both uni-modal and multi-modal settings. We release the source code and our own real-world dataset of e-commerce product purchases, with special focus on modeling of the item cold-start problem.

## 1 Introduction

An increasingly common problem setting in many domains is that predictions must be made from multi-modal interaction data, with many interaction types, object metadata coming from multiple domains, and having different types of attributes. Interactions often have one or more weight values attached, representing e.g. cardinality, strength, duration, or monetary value. Such a landscape is prevalent in applications from IoT sensor networks, through social networks to brick-and-mortar retail and e-commerce. A natural habitat for multimodal, multi-attribute interaction data is that of *recommender systems*. Their task is to suggest items which a user might find interesting, often in the setting of online stores or social media. There exist multiple recommendation-based sub-tasks such as session-based recommendation (where the input is an ordered collection of items based on a single user shopping session), or top-k recommendation (where the item collection is an unordered set of items which user found interesting, usually over a longer timespan). Usually, each of these sub- problems is solved by one specialized algorithm. Moreover, state-of-the-art collaborative filtering systems in both top-k and session-based recommendation usually offer no simple way to incorporate multi-modal item information i.e. images or text, and are limited to a single type of interaction. However, a joint multi-modal representation of a single object is a native construct of the human brain which enables us to form complex associations based on multi-level semantic similarity [36] [37], so performance gains can be expected from such representations.

In order to address these issues, we present EMDE (Efficient Manifold Density Estimator) – a one- pass algorithm for efficient, approximate representation of smooth probability densities on Rieman- nian manifolds. EMDE can utilize any locally-similar embedding to construct a structured repre- sentation of the vector embedding manifold (that we call *sketch*). It preserves local geometry, and allows estimation of arbitrary smooth probability densities out of samples scattered on the manifold.

In this paper, we generalize and reformulate recommendation tasks as conditional density estimation on manifolds allowing us to use EMDE. We then apply EMDE to both top-k and session-based rec- ommendation tasks – considering *multiple types of interactions *as well as *multi-modal input*. Sym- metrically we produce *multi-modal output*, which enables an elegant solution of the item cold-start problem (recommending items with no interactions, due to being freshly introduced into inventory). [9] and [26] observe the *phantom progress *problem in recommender systems: carefully tuned simple heuristics (such as nearest-neighbor methods) in practice often outperform complex deep learning models, while algorithm performance is heavily dependent on the dataset and chosen performance metrics. In response to this, we test our approach on a wide range of referential metrics and datasets. We show that our approach consistently outperforms both the most advanced models and simple heuristics alike.

Our contributions are the following:

**We introduce an efficient, approximate representation of smooth probability densities on Riemannian manifolds called EMDE.** It represents real-valued multisets of arbitrary high-dimensional vectors on a locally-similar embedding manifold in the form of *locality-sensitive sketches*. Repre- sentations produced by EMDE have constant size and are independent of the number of samples and original embeddings dimensions. Thus the size of downstream model, that utilizes this representa- tions, does not depend on the number of samples and embedding dimensionality. Flexibility comes from the ability to combine various modalities of input data without the necessity to create a joint embedding model, the ability to aggregate multiple samples with attached weights into a fixed-size structure, and from an efficient readout method.

**We establish new state-of-the-art results on several datasets for session-based and top-k sce- narios.** To this end, we introduce a generalization and reformulation of recommendation tasks as conditional density estimation on manifolds. Our formulation trivially allows the use of multi-modal data, multiple types of interactions, and arbitrary non-negative weights attached to interactions (e.g. counts, durations, monetary values), for all types of recommendation problems.

**We prove that EMDE naturally solves the item cold-start problem.** We also release a new dataset: *Online Sales*, aimed specifically at testing item cold-start scenario. The dataset is collected from real life user-item interactions in our system. This allows us to acquire naturally occurring cold start items, avoiding artificial manipulations of older datasets where cold start items are naturally few [13]. We also release the code for EMDE and our experiments.

## 2 Related Work

Sketching-based density estimators. [4] introduce methods for KDE based on locality-sensitive hashing (LSH). Subsequently [38] utilize a sketch-based structure for a compressed representation of multiple LSH partitions for KDE estimation. Both methods require a computation-heavy sampling procedure to arrive at density estimates. [5] introduce RACE - a LSH sketch-based method for KDE, which does not require sampling to arrive at density estimates. [6] further explores the technique of LSH sketching for approximate nearest-neighbor search on streaming data. In these methods, the manifold considered is Rn. We instead disentangle the notion of the manifold on which data lies from the densities to be estimated on the manifold. We do not use kernel functions, and our estimates are piecewise-constant. Our primary interest is in manifold probability densities as an input and output format for neural models.

**Session-based recommenders.** Following [26] we compare our method to both simple baselines like S-KNN, V-SKNN and recent deep-learning based models i.e. Gru4Rec [16], NARM [24], STAMP [24], NextItNet [47], and SR-GNN [46].

**Top-k recommenders.** [9] find out that among 18 recently published algorithms, only 7 could be reproduced, and 6 of them can often be outperformed with simple heuristics. Thus, we compare our method to MultVAE [25], as it proves to be the only method among the 18 that could be both reproduced and is not outperformed by simple methods. We also implement two recent state-of-the- art methods MacridVAE [29] and EASE [39].

**Content-based recommenders.** Methods with explicit side-information are often tuned to specific datasets or tasks (e.g. news recommendation) [10] [17]. To the best of our knowledge, there has been no attempt at a generalized multi-modal system in session-based setting. The more general top- k recommender systems ingest image [15], audio [42] or knowledge base [48] [19] side information, or a selected combination of modalities [2]. [45] learn a joint representation of content information and collaborative filtering ratings. [13] extend SVD to enable incorporation of categorical side data. Although some of them are very recent, none of these methods has been shown to outperform the top performing uni-modal models such as EASE or MacridVAE, their evaluation is often done against weaker baselines.

Our work differs from cited papers in several aspects: 1) we outperform the best currently known recommendation systems, 2) we impose no restrictions on the task (top-k or session-based), or dataset, 3) we make no assumptions as to the type and number of utilized data modalities 4) we support partial modality coverage and do not discard items with missing modalities, and 5) we support a cold-user scenario, where test set can contain unseen users.

## 3 Algorithm

Intuitively, EMDE is a form of a piecewise-constant probability density estimator which can work on arbitrary Riemannian manifolds locally embedded in Rn, while efficiently scaling to extremely large dimensionalities. Our solution is inspired by two well known algorithms: locality sensitive hashing (LSH) [20] and count-sketch / count-min sketch (CMS) [8]. We utilize vector representa- tions coming from upstream metric representation learning methods on any modality of data to per- form multiple partitionings of the embedding manifold, using data-dependent LSH methods. The partitionings define regions of the manifold analogous to CMS hash function buckets. When multi- ple such partitionings are combined, intersection of regions from independent partitionings allows to describe sub-regions much smaller than the resolution of each individual partition. This way of representation constitutes a compressed map of the manifold, allowing to store and accumulate val- ues assigned to local regions. Pointwise estimates can then be retrieved efficiently by aggregating stored values over regions overlapping the query point.

### 3.1 Unsupervised similarity learning and Riemannian manifolds

The goal of (deep) metric representation learning is to learn a function hθ(x) : X → Rn mapping inputs from the data manifold in X onto points in Rn which are metrically close if and only if they are semantically similar. In practice, hθ(X) ⊂ Rn forms a Riemannian manifold locally embedded in euclidean space. Our method requires that the aforementioned property of *local similarity *under a locally-euclidean metric holds at least approximately. Not all methods of deep representation learn- ing follow the metric learning paradigm, even though they are very popular in practice, optimizing e.g. skip-gram, masked-language-model, next sentence prediction, CPC, InfoNCE or DeepInfoMax objectives [31][11][43][18]. Nonetheless, [23] show that all the aforementioned self-supervised objectives correspond to InfoNCE, while [40] observe that InfoNCE has a direct formulation in terms of metric learning. Thus, most existing representation learning methods produce embeddings indirectly optimized for local similarity, and can be utilized by our method.

The starting point of our algorithm is a manifold M := hθ(X) locally embedded in R . Our goal is to create a compressed, approximate, piecewise-constant representation for any smooth probability density on the manifold, from point samples.

### 3.2 Efficient Manifold Density Estimator

Given samples from a data manifold M ⊂ Rn and parameters N, K ∈ N+ we first construct a density-dependentmappingfunctionV(x):M→(e1,...,eN)whereei ∈{1,...,2K}assigning inputs x ∈ M to multiple local regions of the manifold. To this end, we utilize a modified version of LSH algorithm we call Density-dependent LSH (DLSH). We start with choosing K random vectors ri, then for v ∈ M we let hashi(v) = sgn(v · ri − bi), drawing the bias value bi from Qi(U ∼ Unif[0, 1]), where Qi is the quantile function of {v · ri : v ∈ M}. In contrast to LSH, this scheme is density-dependent, cutting the the manifold into non-empty parts, thus avoiding unutilized regions. We combine K independent binary hash values into bit strings, which we interpret as short integers, giving a partitioning of the input manifold into 2K regions. We perform the procedure N times, resulting in a *sketch structure *of *width *2K and *depth *N. Local similarity of the data manifold allows V (x) to assign semantically similar inputs to the same sketch regions frequently. This effect captures the local metric prior induced by the underlying representation learning method which produced the vectors M. In practice M can represent a single view or modality of information, such as text, image, audio or network interaction embeddings.

An empty sketch is instantiated as 2-dimensional array N × 2K and can be indexed by the outputs of V . Filled with zeros, it represents a degenerate zero density. We add samples x ∈ H weighted by w ∈ R+ from a smooth probability measure on the manifold M, where f is the probability density function, by performing sketchsamples[V (x)] := sketchsamples[V (x)] + wx for all x ∈ Hand+

their corresponding weights w ∈ R . Our final representation is sketch := sketchsamples/ wx. From the definitions of sketchsamples and wx it is clear that both are additive, can be merged with simple point-wise summation and can be constructed incrementally in a streaming setting. We call values stored in the final array *sketch content*.

To get an un-normalized point estimate f of the probability density function f , for any z ∈ M, we let f(z) = GeometricMean(sketch[V (z)]). We verify the choice of geometric mean empirically, while strong theoretical arguments behind its suitability can be found in [12] and [21], when we notice that every sketch contains N probability mass function estimates over all 2K regions of the manifold M each, effectively forming an ensemble of probability densities.

Due to their additive compositionality and fixed-size representation, our sketch structures are a nat- ural fit for representing real-valued multisets of vectors (i.e. sets of vectors with attached weights). Sketch *structure*, defined by the mapping function V (x), captures local semantic similarity of data

– inputs metrically close on the underlying manifold have a high probability to share buckets in the sketch array. Sketch *content *refers to the accumulated density estimates. For example user’s purchased, viewed, and searched items can all be represented in sketches with visual item simi- larity *structure*, but represent different *content*. Multiple such *structure*-*content *combinations are possible, enabling numerous usage scenarios. Sketches with different *structure *and *content *can be concatenated, while retaining their favorable properties.

### 3.3 Conditional and multi-modal estimation

A simple feed-forward neural network can be used as a conditional density estimator, where sketches are both the input and output format. This allows for both multi-modal input and output. We set the loss function for conditional estimation to be Softmax + CrossEntropy, where both are applied only width-wise – independently for each level of depth for every sketch, and the resulting values averaged. This approach is consistent with sketches representing an ensemble of PMFs over regions of the manifold.

### 3.4 Recommendation as conditional density estimation on manifolds

An example research field where multiple modalities of data are present and multisets of events must be aggregated into compact representations, is that of *recommender systems*. Items can be described by their interactions with users (sometimes interactions of different types, e.g. click, purchase, favorite), by their names, attributes, images – all of which constitute different modalities of input. Users are represented by their interactions with items. Assuming each item is described in several modalities X1, . . . , Xp, and via upstream unsupervised or self-supervised representation learning methods, and those modalities are encoded on manifolds M1 , . . . , Mp , we can construct sketches structurally representing the manifolds and concatenate them.

Recommendation problems have a natural interpretation as density estimation – user-item interac- tions can be considered to be samples from a user’s interest distribution over the space of all items. For top-k recommendations, we treat both input and output as simple sets, which can be represented as a uniform probability distribution over all inputs in the set. So in the density representation, we set all weights w = 1. For session-based recommendations, when items can occur multiple times, with timestamps or sequential ordering, we can utilize weights w to reflect their relative importance, e.g. setting w = occurrences(item) or w = λposition(item) to reflect decaying interest in items seen long ago. It is worth noting that other non-negative scalars accompanying interactions, e.g. amount of money spent on an item, duration of time an item was viewed, etc. can be trivially incorporated with our framework if available – via the weights W either in place of, or in addition to the proposed sketches via concatenation.

We apply this form of multi-modal encoding to both input and target of the learning problem, using a simple feed-forward neural network to learn conditional estimation in between. The input may have more sketches, e.g. differentiating between most recent and historic interactions, while the output needs to have at least one modality to perform item score retrieval. Retrieving item scores is performed by applying the point estimate procedure f ( z ) to item vectors z ∈ M. We pick iitems with the highest scores as output recommendations. Contrary to conventional recommendation systems, provided that item embeddings are given beforehand, computational complexity of the training phase is independent of the number of items, and depends exclusively on the number and dimensions of input and output sketches – item scores are retrieved only during inference or validation with recommendation-specific metrics.

### 3.5 Simple baseline for locally similar network embedding

In order to disentangle performance of our framework from the quality of common upstream rep- resentation learning methods, we utilize an extremely simple network embedding method with the desired property of local similarity for both item attributes and interaction data.

Given an interaction network (e.g. between users and items) with edges E, we define the random walk transition matrix M, where where eij is the number of edges running from i to j, as Mij = eij/di for i j ∈ E ; 0 f o r i j ∈/ E .

We initialize the embedding matrix T0 ∈ RN ×d ∼ U (−1, 1), where d is the dimensionality. Then for q iterations, we perform: Ti = M · Ti−1; L2 normalize rows of Ti. Tq is our final embedding matrix. The method can be interpreted as iterated L2 normalized weighted averaging of neighbor- ing nodes’ representations. After just 1 iteration, nodes with similar 1-hop neighborhoods in the network will have similar representations. Further iterations extend the same principle to q-hop neighborhoods. As our experiments show, despite its extreme simplicity, the method works well enough for EMDE.

## 4 Experiments

We report results for unimodal EMDE (no multimodal data, only basic user-item interactions), and EMDE MM – configurations where selected multimodal channels are present. For each experiment, we carefully fine-tune the baselines or use the best configurations reported by the authors. We implement all algorithms in the same frameworks of [9] [27], keeping to their selected performance measures and datasets. We simplify our experimental setting by using modality embeddings obtained with our embedding algorithm described in 3.5, which is the same for representing both text and interaction networks (for text we create a graph of item-word edges). We leave experiments with elaborate embeddings such as BERT [11] for future research. We explore only a subset of all possible dataset modalities. In Tables 1 and 2 we show the results while detailed descriptions of the datasets can be found in the Appendix.

### 4.1 EMDE as a Session-Based Recommender System

We compare against benchmark methods from [27], adding one recent graph neural model [46].

**Datasets.** We conduct experiments on six datasets, three from the music domain (NOWP [35], 30MUS [41], AOTM) and three from the e-commerce domain (RETAIL, DIGI, RSC15). We use evaluation framework from [27], with already optimized hyper-parameters for other methods and a number of metrics evaluated on these datasets. Each dataset was split into five time-contiguous subsets to be able to make multiple measurements in order to minimize the risk of random effects.

**EMDE Configuration** On top of EMDE, we use a simple 3-layer feed forward network with leaky ReLU activations, batch normalization and residual connections between layers. Whenever possible, we incorporate multimodal item features such as additional metadata or data from different sources such as search queries, playlists, purchases etc. More details about training configuration for each dataset are available in Appendix 1.2.

**EMDE Performance.** As presentend in Table 1, EMDE significantly outperforms competing ap- proaches when multi-modal data is available (EMDE MM), and it is a very strong baseline in the uni-modal case (EMDE). [27] find that neural methods generally underperform compared to simple approaches (nearest-neighbors). In contrast, EMDE MM – a neural model – outperforms all competi- tors on RETAIL, DIGI and 30MU datasets by a large margin. We were unable to locate additional modalities of item data for 2 music datasets (AOTM and NOWP). This problem is rather artificial as in a practical setting each item is nearly always characterized with some multimodal data (i.e. prod- uct name and attributes at the very least). Still, EMDE achieves state-of-the-art results on NOWP and has the highest MRR on AOTM. Consistently with [27], the RSC15 dataset proves to be an outlier where EMDE is either the top performer or a close follower of NARM depending on metric.

### 4.2 EMDE as a top-k Recommender System

**Datasets.** We conduct experiments on two real-world, large-scale datasets: Netflix Prize [3] and MovieLens20M. The datasets contain movie ratings from viewers. Dataset characteristics and pre- processing are consistent with [9].

**Baselines.** We compare against recent state-of-the-art VAE-based neural models: MultVAE [25] and MacridVAE [29], and a non-neural algorithm EASE [39], as well as simpler baselines used by [9].

**EMDE Configuration** On top of EMDE, we use a simple one-hidden-layer feed-forward neural network with 12,000 neurons and leaky ReLU activations. As additional modalities we choose the interactions of users with disliked items (items which received a rating lower than 4). For Movie- Lens20M, we also experiment with textual descriptions of movies and lists of movie actors, treated as a network with our embedding method (edges between items and tokens, or items and cast mem- bers). We do not observe improvements with incorporation of these modalities, which we hypothe- size to be implicitly modeled by liked and disliked item interactions. For Netflix, there is no multi- modal data accessible apart from movie titles and movie production year (both have low relevance for user preferences). We put 80% of randomly shuffled user items into the input sketch, and the remaining 20% into the output sketch to reflect train/test split ratio of items for a single user.

**EMDE Performance**. We observe that our approach consistently and significantly outperforms the baselines especially for lower *k *values in the top-k recommended item rankings, which is consistent with CMS being a heavy-hitters estimator. The density of output sketches is determined by data distribution, e.g. the median user liked item count for Netflix is 60, so the median sketch density is 12 items (20% out of 60) - the sketch is not expected to decode items for much higher *k*. In practice, the very top recommended items are key for user satisfaction as they are given the most attention by users, considering the limitation of item display capabilities and user’s attention in the real world.

### 4.3 EMDE as a Multimodal Output Recommender System for cold item prediction

In the above experiments recommender output consisted of only one sketch with both structure and density based on user-item interactions. Nonetheless, we can utilize a concatentation of multi- modal sketches on the output. This approach permits additional functionalities, such as solving the problem of cold items. Such items can be few in mature datasets collected over a longer timespan (MovieLens20M only has 612 of them for 20,720 total items), so dataset manipulations to artificially sample cold items are necessary [13]. In practice however cold items can be numerous, especially in newly opened stores or quickly changing item stock. There are works which explicitly focus on the cold item scenario [2] [44], however we treat this functionality as a nice added benefit to other advantages of EMDE.

*Online Sales *Dataset. To truthfully reflect the scale of the problem in the real world top-k setting, we collect user-item interactions from a single store in our production system for one month. After filtering out users with less than 2 interactions, we obtain 12491 purchased items and 12958 users from which we sample 1500 users for both valid and test set. For the 3000 non-trainable users, we obtain 1675 cold items, which means that on average, 56% of users have bought a cold item. Additionally, as the second input/output modality, for each item we collect product titles, which are represented as a bag-of-words composed of hashed tokens. Whenever an item cannot be retrieved from the regular interactions sketch output, given no interactions in the training set, its score is retrieved from just the item title sketch. Non-cold items have their scores averaged from both in- teraction and title sketches. Our results show that this way many relevant cold-start items can be retrieved by multioutput EMDE (EMDE MM mout) (Table 2). Contrary to previous tendencies, here we obtain significant gains especially for large *k*. This is because densities in the titles sketch tend to be more uniform (i.e. lower for top results). We boost the scores of cold items by multiplying them by a constant of 1.3 estimated empirically on the validation set. Depending on practical need, the scores can be artificially boosted even further to suggest more recently added items.

Simple baselines generally outperform other neural methods on this dataset, probably because of the small size of this dataset which makes complex models hard to optimize. Especially MultVAE proved hard to converge, as the obtained results were close to 0 for all metrics irrespective of the parameter configuration.

### 4.4 Ablation studies

In order to understand the effects of crucial parameters for training and decoding of EMDE, we conduct additional experiments in session-based scenario. We run experiments on RETAIL dataset, treating the best configuration for this dataset as our baseline. We report the results for MRR@20 and P@20 in Table 3. Results for other metrics are available in the Appendix.

**Manifold partitioning.** In addition to our density-dependent LSH variant (DLSH) we verify the impact of partitioning the manifold with product quantization methods (PQ) which decompose high- dimensional space into the Cartesian product of low-dimensional subspaces which are quantized separately. We also analyze its enhanced and optimized version (OPQ) [14]. We can see that DLSH is a strong baseline, leading at MRR@20 in comparison to other coders. However, we note that OPQ achieves competitive results, which indicates potential for improvement in density-based manifold partitioning methods.

**Score ensembling.** We perform point estimates from the output sketch by ensembling independent probability mass functions across sketch depth using geometric mean. While arguments behind the choice of geometric mean for ensembling probability measures can be found in [12] and [21], we empirically confirm the choice, comparing with: minimum, arithmetic mean and harmonic mean.

**Sketch width and depth. **Considering the fact that we ensemble densities from all sketch regions depth-wise, the higher the *depth*, the lower is the variance of final values. Sketch *width *is responsible for the amount of regions the manifold is split into. Holding depth × width constant, we investigate the trade-off between them. Obtained results show that *width *of 128 achieves best results. In our experiments with other datasets, *width *of 128 seems to be a universally good value.

**Choice of manifold** We investigate how the choice of input embedding manifold influences performance of downstream recommendation tasks. We compare our multi-modal item sketches with uni-modal sketches built on item-user interaction embeddings, item attribute metadata embeddings, and random item embeddings without a metric prior. All sketches have the same dimensions for a fair comparison. Not only random item vectors have the lowest performance due to the lack of a metric prior, but the manifolds for item interaction similarity and item attribute similarity confer different benefits, outperforming each other in MRR@20 and P@20 respectively. The multi-modal sketch is a clear winner in both metrics.

## 5 Conclusions

We present an efficient multimodal density estimator on Riemannian manifolds, which achieves state-of-the-art results in recommendation system setting. EMDE is computionally efficient, scal- able and allows to efficiently encode multisets. In contrast to other recommendation systems, we can use one algorithm for session-based and top-k recommendations. Although we focused on rec- ommendations here, efficiency and flexibility of EMDE indicate that it could be applied to other machine learning tasks, which we leave for future research. In the next steps we plan to investigate EMDE interpretability, a reformulation for scalar field estimation and applications to cross-modality recommendation.

## 6 Broader Impact

We believe that our work has a generally positive overall impact. More accurate recommendations will allow consumers to receive suggestions more to their liking, reducing their time browsing shop inventories. It is known that exposition to too many irrelevant options leads to the *paradox of choice *– an action paralysis and poor final choice, followed by dissatisfaction [33]. With growing sizes of shop inventories, it is important to shield customers from this negative psychological effect. Greater customer exhibition to relevant items will also mean higher turnover in retail.

A negative effect might be that more shops will be forced to invest in appropriate hardware for running the more elaborate recommender systems. However, we believe that shops will be willing to invest as the overall effort of creating valid recommendations will be lowered by the use of an accurate digital recommender.

**References**

[1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In *Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data*, SIGMOD ’93, page 207–216, New York, NY, USA, 1993. Association for Computing Machinery.

[2] Oren Barkan, Noam Koenigstein, Eylon Yogev, and Ori Katz. Cb2cf: A neural multiview content-to-collaborative filtering model for completely cold item recommendations. In *Pro- ceedings of the 13th ACM Conference on Recommender Systems*, RecSys ’19, page 228–236, New York, NY, USA, 2019. Association for Computing Machinery.

[3] James Bennett, Stan Lanning, and Netflix Netflix. The netflix prize. In *In KDD Cup and Workshop in conjunction with KDD*, 2007.

[4] M. Charikar and P. Siminelakis. Hashing-based-estimators for kernel density in high dimen- sions. In *2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS)*, pages 1032–1043, 2017.

[5] Benjamin Coleman and Anshumali Shrivastava. Sub-linear race sketches for approximate ker- nel density estimation on streaming data. pages 1739–1749, 04 2020.

[6] Benjamin Coleman, Anshumali Shrivastava, and Richard G. Baraniuk. Race: Sub-linear mem- ory sketches for approximate near-neighbor search on streaming data, 2019.

[7] ColinCooper,SangLee,TomaszRadzik,andYiannisSiantos.Randomwalksinrecommender systems: exact computation and simulations. pages 811–816, 04 2014.

[8] Graham Cormode and S. Muthukrishnan. An improved data stream summary: The count- min sketch and its applications. In Martín Farach-Colton, editor, *LATIN 2004: Theoretical Informatics*, pages 29–38, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.

[9] Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In *Proceed- ings of the 13th ACM Conference on Recommender Systems*, RecSys ’19, page 101–109, New York, NY, USA, 2019. Association for Computing Machinery.

[10] Gabriel de Souza Pereira Moreira, Felipe Ferreira, and Adilson Marques da Cunha. News session-based recommendations using deep neural networks. In *Proceedings of the 3rd Work- shop on Deep Learning for Recommender Systems*, DLRS 2018, page 15–23, New York, NY, USA, 2018. Association for Computing Machinery.

[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In *Proceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)*, pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.

[12] Pierre Dognin, Igor Melnyk, Youssef Mroueh, Jerret Ross, Cicero Dos Santos, and Tom Sercu. Wasserstein barycenter model ensembling, 2019.

[13] Evgeny Frolov and Ivan Oseledets. Hybridsvd: When collaborative information is not enough. In *Proceedings of the 13th ACM Conference on Recommender Systems*, RecSys ’19, page 331–339, New York, NY, USA, 2019. Association for Computing Machinery.

[14] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization for approximate nearest neighbor search. In *2013 IEEE Conference on Computer Vision and Pattern Recognition*, pages 2946–2953, 2013.

[15] Ruining He and Julian McAuley. Vbpr: Visual bayesian personalized ranking from implicit feedback. In *Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence*, AAAI’16, page 144–150. AAAI Press, 2016.

[16] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-based recommendations with recurrent neural networks. In *4th International Conference on Learn- ing Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings*, 2016.

[17] Balázs Hidasi, Massimo Quadrana, Alexandros Karatzoglou, and Domonkos Tikk. Parallel re- current neural network architectures for feature-rich session-based recommendations. In *Pro- ceedings of the 10th ACM Conference on Recommender Systems*, RecSys ’16, page 241–248, New York, NY, USA, 2016. Association for Computing Machinery.

[18] Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Philip Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In *ICLR 2019*. ICLR, April 2019.

[19] Jin Huang, Wayne Xin Zhao, Hongjian Dou, Ji-Rong Wen, and Edward Y. Chang. Improving sequential recommendation with knowledge-enhanced memory networks. In *The 41st Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval*, SIGIR ’18, page 505–514, New York, NY, USA, 2018. Association for Computing Machinery.

[20] Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. *Conference Proceedings of the Annual ACM Symposium on Theory of Computing*, 604-613, 10 2000.

[21] Mitsuhiro Itoh and Hiroyasu Satoh. Geometric mean of probability measures and geodesics of fisher information metric, 2017.

[22] Iman Kamehkhosh, Dietmar Jannach, and Malte Ludewig. A comparison of frequent pattern techniques and a deep learning method for session-based recommendation. In *RecTemp@RecSys*, 2017.

[23] Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu, Wang Ling, Zihang Dai, and Dani Yo- gatama. A mutual information maximization perspective of language representation learning. In *International Conference on Learning Representations*, 2020.

[24] Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. Neural attentive session-based recommendation. In *Proceedings of the 2017 ACM on Conference on Informa- tion and Knowledge Management*, CIKM ’17, page 1419–1428, New York, NY, USA, 2017. Association for Computing Machinery.

[25] Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. Variational autoen- coders for collaborative filtering. In *Proceedings of the 2018 World Wide Web Conference*, WWW ’18, page 689–698, Republic and Canton of Geneva, CHE, 2018. International World Wide Web Conferences Steering Committee.

[26] MalteLudewigandDietmarJannach.Evaluationofsession-basedrecommendationalgorithms. *User Modeling and User-Adapted Interaction*, 28(4–5):331–390, December 2018.

[27] Malte Ludewig, Noemi Mauro, Sara Latifi, and Dietmar Jannach. Performance comparison of neural and non-neural approaches to session-based recommendation. In *Proceedings of the 13th ACM Conference on Recommender Systems*, pages 462–466, 2019.

[28] Malte Ludewig, Noemi Mauro, Sara Latifi, and Dietmar Jannach. Performance comparison of neural and non-neural approaches to session-based recommendation. In *Proceedings of the 13th ACM Conference on Recommender Systems*, RecSys ’19, page 462–466, New York, NY, USA, 2019. Association for Computing Machinery.

[29] Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. Learning disentangled representations for recommendation. In *NeurIPS*, 2019.

[30] Fei Mi and Boi Faltings. Context tree for adaptive session-based recommendation. *CoRR*, abs/1806.03733, 2018.

[31] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In *Advances in neural information processing systems*, pages 3111–3119, 2013.

[32] Xia Ning and George Karypis. Slim: Sparse linear methods for top-n recommender systems. In *Proceedings of the 2011 IEEE 11th International Conference on Data Mining*, ICDM ’11, page 497–506, USA, 2011. IEEE Computer Society.

[33] Antti Oulasvirta, Janne P. Hukkinen, and Barry Schwartz. When more is less: The paradox of choice in search engine use. In *Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval*, SIGIR ’09, page 516–523, New York, NY, USA, 2009. Association for Computing Machinery.

[34] Bibek Paudel, Fabian Christoffel, Chris Newell, and Abraham Bernstein. Updatable, accurate, diverse, and scalable recommendations for interactive applications. *ACM Trans. Interact. Intell. Syst.*, 7(1), December 2016.

[35] Asmita Poddar, Eva Zangerle, and yi-hsuan Yang. nowplaying-rs: A new benchmark dataset for building context-aware music recommender systems. 07 2018.

[36] Rodrigo Quian, Alexander Kraskov, Christof Koch, and Itzhak Fried. Explicit encoding of multimodal percepts by single neurons in the human brain. *Current biology : CB*, 19:1308–13, 08 2009.

[37] Rodrigo Quian, Leila Reddy, Gabriel Kreiman, Christof Koch, and Itzhak Fried. Invariant visual representation by single neurons in the human brain. *Nature*, 435:1102–7, 07 2005.

[38] Paris Siminelakis, Kexin Rong, Peter Bailis, Moses Charikar, and Philip Levis. Rehashing kernel evaluation in high dimensions. In *International Conference on Machine Learning*, pages 5789–5798, 2019.

[39] Harald Steck. Embarrassingly shallow autoencoders for sparse data. In *The World Wide Web Conference*, WWW ’19, page 3251–3257, New York, NY, USA, 2019. Association for Com- puting Machinery.

[40] Michael Tschannen, Josip Djolonga, Paul K. Rubenstein, Sylvain Gelly, and Mario Lucic. On mutual information maximization for representation learning. In *International Conference on Learning Representations*, 2020.

[41] Roberto Turrin, Massimo Quadrana, Andrea Condorelli, Roberto Pagano, and Paolo Cre- monesi. 30music listening and playlists dataset. In *RecSys Posters*, 2015.

[42] Aaron van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein- berger, editors, *Advances in Neural Information Processing Systems 26*, pages 2643–2651. Curran Associates, Inc., 2013.

[43] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2018.

[44] Manasi Vartak, Arvind Thiagarajan, Conrado Miranda, Jeshua Bratman, and Hugo Larochelle. A meta-learning perspective on cold-start recommendations for items. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, *Ad- vances in Neural Information Processing Systems 30*, pages 6904–6914. Curran Associates, Inc., 2017.

[45] Hao Wang, Naiyan Wang, and Dit-Yan Yeung. Collaborative deep learning for recommender systems. In *Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, KDD ’15, page 1235–1244, New York, NY, USA, 2015. Associ- ation for Computing Machinery.

[46] Shu Wu, Yuyuan Tang, Yanqiao Zhu, Liang Wang, Xing Xie, and Tieniu Tan. Session-based recommendation with graph neural networks. In *AAAI*, 2018.

[47] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, and Xiangnan He. A simple but hard-to-beat baseline for session-based recommendations. *CoRR*, abs/1808.05163, 2018.

[48] Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. Collabora- tive knowledge base embedding for recommender systems. In *Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, KDD ’16, page 353–362, New York, NY, USA, 2016. Association for Computing Machinery.

## Authors:

**Jacek Dabrowski**, Synerise **Michał Daniluk**, Synerise **Konrad Gołuchowski**, Synerise **Barbara Rychalska**, Synerise and Warsaw University of Technology **Dominika Basaj**, Synerise and Warsaw University of Technology **Piotr Bąbel**, Synerise **Andrzej Michalowski**, Synerise

Read our latest papers about: A Strong Baseline for Fashion Retrieval with Person Re-Identification models >>>