Thursday, 29 December 2016
CNN : deep symetries
We annotate "understanding deep convolutional networks", Mallat (Ma16)
Ma16 represents an essential generalization of the Mallat approach to pattern recognition over the past 10 years, based on the learning of a shallow invariance (2 levels) obtained by composition of elementary groups: translation, rotation, Deformations.
Ma16 proposes a link between CNN and semi-direct product of groups of symmetries.
‘The paper studies architectures as opposed to computational learning of network weights, which is an outstanding optimization issue’
def 1 (§5) : the layer \( j \) of a CNN represent signal \( x \) as \( x_j (u,k_j) \), where \( u\) is the translation variable, and \( k_j \) is the channel index.
The linear operator \( W_j \) and the pointsize non-linearity \( \rho \) are linked by the defining relation :
$$ x_j=\rho W_j x_{j-1} $$
def 2 (§7) : \( f(x) \) is the class of \( x \). We suppose it exists \( f_j \) such that \( f_j (x_j ) = f(x) \), and
\( \forall(x,x' ),|| x_{j-1}-x'_{j-1} || \geq \epsilon \quad if \quad f(x) \neq f(x') \)
to be compared with (in §2)
\( \forall(x,x' ),|| \Phi(x)-\Phi(x')|| \geq \epsilon \quad if \quad f(x) \neq f(x') \)
i.e., \( x_j \) are features, playing the same role as \( \Phi(x) \)
def 3 (§3) symmetries : We look for invertible operators which preserve the value of \( f \). A global symmetry is an invertible and often non-linear operator \( g \) from \( \Omega \) to \( \Omega \) , such that \( f(g.x) = f(x) \) for all \( x \in \Omega \). If \(g_1\) and \(g_2\) are global symmetries then \(g_1 g_2 \) is also a global symmetry, so products define groups of symmetries. Global symmetries are usually hard to find. We shall first concentrate on local symmetries. We suppose that there is a metric \( |g|_G \) which measures the distance between \(g\in G\) and the identity. A function \(f\) is locally invariant to the action of \(G \) if
\( \forall x \in \Omega , \exists C_x > 0 ,\quad \forall g \in G \quad with \quad |g|_G < C_x , \quad f(g.x) = f(x) \)\)
ex : translation + difféo : \( g.x(u) = x(u - g(u)) \quad with \quad g \in C^1 (R^n ) \).
other examples p14
def 1+2+3 : symetry \( \bar g \in G_{j-1} \) :
$$ f_{j-1} ( \bar g . x_{j-1} ) = f_{j-1} (x_{j-1} ) $$
\( \{ \bar g . x_{j-1} \}_{ \bar g \in G_j } \) is the orbit of \( x_{j-1} \).
parallel transport : Mallat makes the \( G_j \) operating via coordinates \(P_j\) :
$$ g. x_j (v) = x_j (g.v) $$
Suppose \( g \in G_j \) defined so that the folowing diagramme commutes
\begin{array}{cols} x_{j-1} & \rightarrow & \bar g x_{j-1} \\ \downarrow & & \downarrow \\ \rho W_j x_{j-1} & \rightarrow & g.[ \rho W_j x_{j-1} ] = \rho W_j [\bar g . x_{j-1}] \end{array}
Then \( g.x_j = g.[\rho W_j x_{j-1}] = \rho W_j [\bar g . x_{j-1} ] \)
but \( ||\rho W_j x_{j-1} -\rho W_j \bar g. x_{j-1} || < \epsilon \) puisque \( f_{j-1} (x_{j-1} ) = f_{j-1} ( \bar g .x_{j-1} ) \)
then \( || x_j-g.x_j || < \epsilon \)
so \( f_j (x_j )= f_j (g .x_j ) \)
we will see this forced commutation applied in prop 1
the same logic is used in « learning stable group invariant representations with conv net », Bruna, §3.3 :
\begin{array}{r c l}
z^{n+1} (u,\lambda_1,\lambda_2 ) & = & (z^n (u,∙) \star \psi_{\lambda_2 })(\lambda_1) \\
&=& \int (z^n (u,\lambda_1 - \lambda'_1) \psi_{λ_2} (\lambda'_1) d \lambda'_1 \\
g.z^{n+1} (u,\lambda_1,\lambda_2 ) &=& \int (g.z^n (u,\lambda_1 - \lambda'_1) \psi_{λ_2} (\lambda'_1) d \lambda'_1 \\
&=& \int (z^n (f(g,u),\lambda_1 - \lambda'_1 + \eta (g) ) \psi_{λ_2} (\lambda'_1) d \lambda'_1 \\
& = & z^{n+1} (f(g,u),\lambda_1+ \eta(g),\lambda_2 ) \\
\end{array}
The new coordinates \( \lambda_2 \) are thus unaffected by the action of \( G \). As a consequence, this property enables a systematic procedure to generate invariance to groups of the form \( G = G_1 \rtimes G_2 \rtimes ...\rtimes G_s \), where \( H_1 \rtimes H_2 \) is the semidirect product of groups. In this decomposition, each factor \(G_i\) is associated with a range of convolutional layers, along the coordinates where the action of \( G_i \) is perceived.
to be compared with (in "convolution", Wikipedia) : Suppose that S is a linear operator acting on functions which commutes with translations : \( S( \tau_x f) = \tau_x (Sf) \quad \forall x\). Then S is given as convolution with a function (or distribution) \( g_S \); that is \(Sf = g_S \star f\). Thus any translation invariant operation can be represented as a convolution.
in our case, we want the operator \( \rho W_j \) to commute with \(G_j\) so that we write it as a convolution on \(G_j (u-v \rightarrow g^{-1} v ) \)
Sifre (thesis) : ‘A major difference between the translation scattering and convolutional neural network as defined in (2.98) is that in (2.98), every output depth \( p_m \) is connected to every input depth \( p_{m-1} \). On the contrary, a scattering path \( p_{m}=(\theta_1,j_1,…,θ_{m},j_{m})\)is connected to only one previous path, its ancestor \( p_{m-1}=(\theta_1,j_1,…,θ_{m-1},j_{m-1})\). This implies that the translation invariance is built independently for different path, which can lead to information loss, as we shall explain in Section 4.2.’
The essential result is proposition 1, which shows that ‘hierarchical embedding implies that each \( W_j \) is a convolution on \( G'_{j-1} \).
with 5 we get \( x_j = \rho W_j x_{j-1}\), soit si \( v \in P_j \) :
$$ x_j (v) = \rho ( \sum_{v' \in P_{j-1} } x_{j-1} (v') w_{j,v} (v') ) $$
the idea is to parametrize \( v = \bar g. b, b\in P_j / G_j , \quad \bar g \in G_j \) : we get a ‘paving’ or rather ‘fiber’ of \( P_j \) following orbites of \( G_j \), CF Figure 4.
if we postulates that \( G_j = G_{j-1} \rtimes H_j,\quad \bar g=(g,h) \), and the commutation 9 is, with \( h=e_{H_j } \)
\begin{array}{r c l}
g.x_j (v) &=& \rho ( \sum_{v' \in P_{j-1} } g.x_{j-1} (v') w_{j,b} (v') ) \\
&=& \rho ( \sum_{v' \in P_{j-1} } x_{j-1} (g.v') w_{j,b} (v') ) \\
&=& \rho ( \sum_{v' \in P_{j-1} } x_{j-1} (v') w_{j,b} (g^{-1} v') ) \\
\end{array}
from \( g.x_{j-1} (v)=x_{j-1} (g.v) \), and after changing variable.
then Mallat writes : \( w_{j,\bar b} (v') = w_{j,(g,h).b} (v' )=h.w_{j,h.b} (g^{-1} v' ) \)
which seems erroneous (Mallat has in fact \( w_{j,(g,l).b} (v' ) \). should we read \( h.w_{j,b} (g^{-1} v' ) \) ? for me \( w_{j,h.b} (g^{-1} v' ) \) is just an hypothesis
the esssential idea is as in 9. : decoupling levels \( j-1 \) et \( j \)
formely \( \lambda _1 \leftrightarrow g, \lambda _2 \leftrightarrow h.b \)
"The filters \( w_{j,h.b} \)can be optimized so that variations of \( x_j (g,h,b) \) along \( h \) captures a large variance of \( x_{j-1} \) within each class. Indeed, this variance is then reduced by the next \( \rho W_{j+1} \). The generators of \( H_j \) can be interpreted as principal symmetry generators, by analogy with the principal directions of a PCA"
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment