Artificial Intelligence company Synerise, one of the fastest growing in CEE, known for their AI Growth Ecosystem is open sourcing Cleora AI project – a machine learning tool that enables faster and hyper-easy production of graph embeddings for big graphs. Synerise has been working on the project during last few years and has developed an easy-to-use framework that can be applied to any data sets without limitations.
Cleora is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data. The framework embeds entities in n-dimensional spherical spaces utilizing extremely fast, stable and iterative random projections, which allows for unparalleled performance and scalability. The tool can embed extremely large graphs & hypergraphs on a single machine.
Types of data which can be embedded include for example:
- Heterogeneous directed and undirected graphs,
- Heterogeneous undirected hypergraphs,
- Text and other categorical array data,
- Any combination of the above.
At Synerise we believe that sharing knowledge and showing innovation in the open-source Cleora framework will help many companies to develop amazing, faster solutions in the AI field. – Barbara Rychalska – AI Research Scientist at Synerise.
Key technical features of Cleora embeddings
The embeddings produced by Cleora are different from those produced by Node2vec, Word2vec, DeepWalk or other systems in this class by a number of key properties:
- Efficiency - Cleora is two orders of magnitude faster than Node2Vec or DeepWalk,
- Inductivity - as Cleora embeddings of an entity are defined only by interactions with other entities, vectors for new entities can be computed on-the-fly,
- Updatability - refreshing a Cleora embedding for an entity is a very fast operation allowing for real-time updates without retraining,
- Stability - all starting vectors for entities are deterministic, which means that Cleora embeddings on similar datasets will end up being similar. Methods like Word2vec, Node2vec or DeepWalk return different results with every run,
- Cross-dataset compositionality - thanks to stability of Cleora embeddings, embeddings of the same entity on multiple datasets can be combined by averaging, yielding meaningful vectors,
- Dim-wise independence - thanks to the process producing Cleora embeddings, every dimension is independent of others. This property allows for efficient and low-parameter method for combining multi-view embeddings with Conv1d layers,
- Extreme parallelism and performance - Cleora is written in Rust utilizing thread-level parallelism for all calculations except input file loading. In practice this means that the embedding process is often faster than loading the input data.
The key usability features of Cleora embeddings from the end-user perspective can be summarized as:
- Heterogeneous relational tables can be embedded without any artificial data pre-processing,
- Mixed interaction and text data sets can be embedded with ease,
- The cold start problem for new entities is easily solved,
- Real-time updates of the embeddings do not require separate solutions,
- Multi-view embeddings work out-of-the-box,
- Incremental embeddings are stable with no need for re-alignment, rotations or other methods,
- Extremely large data sets are supported and can be embedded within seconds/minutes.
Key competitive advantages of Cleora:
- More than 197x faster than DeepWalk,
- ~4x-8x faster than Pytorch-BigGraph depends on use case
- Star expansion, clique expansion, and no expansion support for hypergraphs,
- Quality of results outperforming or competitive with other embedding frameworks like PyTorch-BigGraph, GOSH, DeepWalk, LINE,
- Can embed extremely large graphs & hypergraphs on a single machine.
Synerise it's positioning Cleora at the top together with tech giants. The quality of results outperform or are competitive with other embedding frameworks. The scientific paper with the details will be published in 2 weeks.
The source code of the solution is already available on the Synerise GitHub account.
A strong confirmation of the innovation of Cleora’s algorithms is that we use them as a part of the Synerise AI Growth Ecosystem. We’re leading the way not only in observing the reality happening around us, but in making a real contribution to science and in the process of creating it.
Jacek Dąbrowski - Chief Artificial Intelligence Officer at Synerise.
Cleora is used by Synerise for internal purposes, working together with Terrarium DB processing billions of datapoints in real-time and solving multi-modal challenges which involves graph data. Cleora algorithm is flexible and can be applied to different segments of the market, inter alia: retail, banking, and telco behavioral data at scale (billions of entities, trillions of interactions). Cleora is also used in input embeddings in EMDE (Efficient Manifold Density Estimation) that was also created by the Synerise Team.