A collection of $U(\in \mathbb{N})$ data vectors $\boldsymbol{X}_{\boldsymbol{i}}:=(\boldsymbol{x}_{i_1},\boldsymbol{x}_{i_2},\ldots,\boldsymbol{x}_{i_U}) \in \mathcal{X}^{U} \quad (\mathcal{X} \subset \mathbb{R}^p)$ is called a $U$-tuple, and the association strength $w_{\boldsymbol{i}}$ among the vectors of a tuple $\boldsymbol{X}_{\boldsymbol{i}}$ is termed as the hyperlink weight, that is assumed to be symmetric with respect to permutation of the entries in the index $\boldsymbol{i}=(i_1,i_2,\ldots,i_U)$. An example consisting of such tuples and hyperlink weights is co-authorship network, where data vector $\boldsymbol{x}_i$ represents the attributes in the researcher $i \in [n]$ such as number of publications in each journal, and the hyperlink weight $w_{\boldsymbol{i}} \in \{0,1,2,\ldots\} \subset \mathcal{S}( \subset \mathbb{R})$ represents the number of co-authored papers written by all the $U$ researchers indexed by $\boldsymbol{i}=(i_1,i_2,\ldots,i_U)$.

Okuno and Shimodaira (arXiv:1908.02573, submitted) proposes a general framework for hyper-relational learning named Bregman hyperlink regression (BHLR), which learns a user-specified symmetric similarity function $\mu_{\boldsymbol{\theta}}:\mathcal{X}^U \to \mathcal{S}(\subset \mathbb{R})$ such that it predicts the tuple's hyperlink weight from data vectors stored in the $U$-tuple. Nonlinear functions, such as neural networks, can be employed for the similarity function. BHLR is based on Bregman divergence (BD) and encompasses various existing methods in the following Table. We also demonstrate that, regardless of the choice of BD and $U \in \mathbb{N}$, the proposed BHLR is generally proved to possess the following properties (P-1) statistical consistency and (P-2) computational tractability. Consequently, BHLR including several existing methods, that have been examined only experimentally, is theoretically justified in a unified manner.

• (P-1) statistical consistency. Whereas some tuples in hyper-relational learning may be constrained as they may share some data vectors therein, our Theorem 1 generally proves that the similarity $\mu_{\hat{\boldsymbol{\theta}}_{\varphi,n}}(\boldsymbol{x}_{i_1},\ldots,\boldsymbol{x}_{i_U})$ estimated via BHLR asymptotically recovers the underlying true conditional expectation of the tuple's hyperlink weight, as the number of data vectors $n$ goes to infinity. Our Theorem 1 assumes that the similarity function $\mu_{\boldsymbol{\theta}}$ is correctly specified, i.e., $\exists \boldsymbol{\theta} \in \boldsymbol{\Theta}$ such that $\mu_{\boldsymbol{\theta}_*}=\mu_*$, but it is free from specifying the weight distribution.
• (P-2) computational tractability. Due to the non-negligible significant computational complexity for dealing with $O(n^U)$ hyperlink weights appeared in hyper-relational learning, we employ stochastic optimization algorithms using a novel generalized mini-batch sampling procedure for hyper-relations. The proposed procedure is a hyper-relational extension ($U \ge 2$) of negative-sampling used in skip-gram (Mikolov et al., 2013), that is often used for graph embedding $(U=2)$. Our numerical experiments empirically demonstrate that BHLR is efficiently computed by the stochastic optimization, and Theorem 2 also provides a theoretical guarantee for the entire optimization procedure, in the sense that the (non-stochastic) gradient of an objective function, evaluated at each step in the stochastic optimization, approaches $\boldsymbol{0}$ in probability as the number of iterations goes to infinity.