Friday, 30 December 2016

learning as categorification IV


A paradigm of decision in the uncertain is the hierarchical / compositional approach, 'an idea that pervades almost all attempts to manage complexity' dixit Russel p426
This paradigm is a priori distinct from (and encompasses the) DL (deep learning) in the sense of compression paradigm, CF "compositional learning" 4.c, to which we shall return

We formally see a hierarchy as a tower of symmetries, generally 'orthogonal'.

We can therefore say that the standard category of learning is Grph, the category of graphs, CF Spivak.

Statistical learning Stat often considers features as given. Even in DL the initial processing of the data (Natural language  ...) or the structure of the network (CNN ...) contains priors of the designer.
We can distinguish three levels of learning, from the most idealized to the most realistic
a. Features are given
b. Features to be created
c. (Tower of) symmetries to discover
Real learning ranges from c to a.

We distinguish here clearly the notion of decision from the problematic of fit of statistical learning: the compositional paradigm does not require the notion of fit.

Consider for example a family of simple models \( M( \lambda ^{\nu} ,\alpha ^ {\mu} ) \), where \( \lambda ^{\nu}, \nu < \nu_0 \) are scalars, and \( \alpha^{\mu} ,\mu < \mu_0 \) are ‘features’.
Suppose that a person 𝔭 has at his disposal a model \( M \), and is led, in different contexts, to use this model. We can distinguish 3 heuristics
𝔭 chooses a \( \lambda \) for each \( \alpha^\mu \) : \( \lambda^{f_\mu } \) then choises among \( \lambda^{f_{\mu} } * \alpha^{\mu} \)
𝔭 chooses a \( \alpha^{\mu_0} \) then choises a \( \lambda^{\nu} * \alpha^{\mu_0} \)
𝔭 chooses a \( \lambda^{\nu}*\alpha^{\mu} \)
Which correspond respectively to the following three graphs  \( { a, b, c } \) :
\begin{array}{r c l}
\lambda & \rightarrow & \alpha \\
\alpha & \rightarrow & \lambda \\
\alpha & \simeq & \lambda \\
\end{array}
We can therefore consider that the three corresponding models are in Grph the category of graphs. In other words, we have 3 Objects \( { a, b, c } \) in Grph.
In Grph,  \( a \rightarrow b \) is a possible morphism: it is a kind of 'duality'.
There is, on the other hand, no morphism \( a \rightarrow c \) or \( b \rightarrow c \) in Grph.
Let us denote by \( C ^ 2 \) this category with three elements.
Note that if we worked in PrO, the category of the PreOrder, we would have this 'duality' only as a fonctor between \( C ^ 2 \) and \( C^{2op} \) :  \( a \rightarrow b \) is not a morphism in PrO.

Suppose now that 𝔭 gives itself more latitude in terms of choice of complexity, in the precise sense of the choice of \( (\nu_0, \mu_0) \): greater or lesser. ie if symbolizes this choice by an element \( \partial_ {xy} \) of the group \( \partial \) translations in \(Z^2 \): \( (x, y) * (\mu_0, \nu_0) \rightarrow (\mu_0 + x, \nu_0 + y) \), 𝔭 gives the possibility for example to increase the number of \( \lambda \) : \( \partial_{20} ( {\lambda_1, \lambda_2, \lambda_3}, \alpha^{\mu} ) = ( {\lambda_1, \lambda_2, \lambda_3, \lambda_4, \lambda_5}, \alpha^{\mu}) \).
But 𝔭 must above all order its triplet \( (\lambda, \alpha, \partial) \).
we can have
\begin{array}{r c l o t}
\partial & \leftarrow & \lambda & \leftarrow & \alpha, \quad equivalent \quad to \quad \partial \lambda \leftarrow \alpha \\
\lambda & \leftarrow & \partial & \leftarrow & \alpha , \quad equivalent \quad to \quad \lambda \leftarrow \partial \alpha \\
\partial & \leftarrow & \lambda & \leftarrow & \partial \leftarrow \alpha  , \quad equivalent \quad to \quad \partial \lambda \leftarrow \partial \alpha \\
\partial &\leftarrow & \alpha & \simeq &\lambda  , \quad equivalent to \quad \partial \leftarrow \lambda \alpha \\
\end{array}
We have thus defined a category \( C ^ 3 \) still in Grph

Each order corresponds to an 'environment symmetry', CF 'learning falacy' and SGII
\( \lambda \) dominance : scaling dominance
\( \alpha \) dominance : feature dominance
\( \partial \) dominance : complexity dominance
The general economics of decision theory is therefore obviously not the tradeoff bias variance or penalization, but the hierarchization of the different symmetries of the field studied.
We can represent the passage from one domain to another via a morphism within Grph.
In the spirit of "μεταφορά ", to put in relation these models can help to discover the good symmetries of our domain.

Thursday, 29 December 2016

CNN : deep symetries



We annotate "understanding deep convolutional networks", Mallat (Ma16)
Ma16 represents an essential generalization of the Mallat approach to pattern recognition over the past 10 years, based on the learning of a shallow invariance (2 levels) obtained by composition of elementary groups: translation, rotation, Deformations.
Ma16 proposes a link between CNN and semi-direct product of groups of symmetries.

‘The paper studies architectures as opposed to computational learning of network weights, which is an outstanding optimization issue’

def 1 (§5) : the layer \( j \) of a CNN represent signal \( x \) as \( x_j (u,k_j) \), where \( u\) is the translation variable, and  \( k_j \) is the channel index.
The linear operator \( W_j \) and the pointsize non-linearity \( \rho \) are linked by the defining relation :
$$ x_j=\rho W_j x_{j-1} $$

def 2 (§7) : \( f(x) \) is the class of \( x \). We suppose it exists \( f_j \) such that \( f_j (x_j ) = f(x) \), and
\( \forall(x,x' ),|| x_{j-1}-x'_{j-1} || \geq \epsilon  \quad if \quad f(x) \neq f(x') \)
to be compared with (in §2)
\( \forall(x,x' ),|| \Phi(x)-\Phi(x')|| \geq \epsilon \quad  if \quad f(x) \neq f(x') \)
i.e.,  \( x_j \) are features, playing the same role as \( \Phi(x) \)

def 3 (§3) symmetries : We look for invertible operators which preserve the value of \( f \). A global symmetry is an invertible and often non-linear operator \( g \) from \( \Omega \) to \( \Omega \) , such that \( f(g.x) = f(x) \) for all \( x \in \Omega \). If \(g_1\) and \(g_2\) are global symmetries then \(g_1 g_2 \) is also a global symmetry, so products define groups of symmetries. Global symmetries are usually hard to find. We shall first concentrate on local symmetries. We suppose that there is a metric \( |g|_G \) which measures the distance between \(g\in G\) and the identity. A function \(f\) is locally invariant to the action of \(G \) if
\( \forall x \in  \Omega , \exists C_x  > 0 ,\quad \forall g \in G \quad with \quad |g|_G  < C_x  , \quad f(g.x) = f(x) \)\)
ex : translation + difféo : \( g.x(u) = x(u - g(u)) \quad with \quad g \in C^1  (R^n  ) \).
other examples p14

def 1+2+3 : symetry \( \bar g \in G_{j-1} \) :
$$ f_{j-1} ( \bar g . x_{j-1} ) = f_{j-1} (x_{j-1} ) $$
\( \{ \bar g . x_{j-1} \}_{ \bar g \in G_j } \) is the orbit of \( x_{j-1} \).
parallel transport : Mallat makes the \( G_j \) operating via coordinates \(P_j\) :
$$ g. x_j (v) = x_j (g.v) $$
Suppose \( g \in G_j \) defined so that the folowing diagramme commutes
\begin{array}{cols} x_{j-1} & \rightarrow & \bar g x_{j-1} \\ \downarrow & & \downarrow \\ \rho W_j x_{j-1} & \rightarrow &  g.[ \rho W_j x_{j-1} ] = \rho W_j [\bar g . x_{j-1}] \end{array}

Then \( g.x_j = g.[\rho W_j x_{j-1}] = \rho W_j [\bar g . x_{j-1} ] \)
but \( ||\rho W_j x_{j-1} -\rho W_j \bar g. x_{j-1} || < \epsilon \) puisque \( f_{j-1} (x_{j-1} ) = f_{j-1}  ( \bar g .x_{j-1} ) \)
then \( || x_j-g.x_j || < \epsilon \)
so \( f_j (x_j )= f_j (g .x_j ) \)
we will see this forced commutation applied in prop 1
the same logic is used in « learning stable group invariant representations with conv net », Bruna, §3.3 :
 
\begin{array}{r c l}
 z^{n+1} (u,\lambda_1,\lambda_2 ) & = & (z^n (u,∙) \star \psi_{\lambda_2 })(\lambda_1)  \\

 &=& \int (z^n (u,\lambda_1 - \lambda'_1) \psi_{λ_2} (\lambda'_1) d \lambda'_1 \\

  g.z^{n+1} (u,\lambda_1,\lambda_2 ) &=& \int (g.z^n (u,\lambda_1 - \lambda'_1) \psi_{λ_2} (\lambda'_1) d \lambda'_1 \\

 &=& \int (z^n (f(g,u),\lambda_1 - \lambda'_1 + \eta (g) ) \psi_{λ_2} (\lambda'_1) d \lambda'_1 \\

& = & z^{n+1} (f(g,u),\lambda_1+ \eta(g),\lambda_2 ) \\

\end{array}

The new coordinates \( \lambda_2 \) are thus unaffected by the action of \( G \). As a consequence, this property enables a systematic procedure to generate invariance to groups of the form \( G = G_1 \rtimes G_2 \rtimes ...\rtimes G_s \), where \( H_1  \rtimes H_2 \) is the semidirect product of groups. In this decomposition, each factor \(G_i\) is associated with a range of convolutional layers, along the coordinates where the action of \( G_i \) is perceived.

to be compared with (in "convolution", Wikipedia) : Suppose that S is a linear operator acting on functions which commutes with translations : \( S( \tau_x f) = \tau_x (Sf) \quad \forall x\). Then S is given as convolution with a function (or distribution) \( g_S \); that is \(Sf = g_S \star f\). Thus any translation invariant operation can be represented as a convolution.
in our case, we want the operator \( \rho W_j \) to commute with \(G_j\) so that we write it as a convolution on \(G_j (u-v  \rightarrow g^{-1} v ) \)

Sifre (thesis) : ‘A major difference between the translation scattering and convolutional neural network as defined in (2.98) is that in (2.98), every output depth \( p_m \) is connected to every input depth \(  p_{m-1} \). On the contrary, a scattering path \( p_{m}=(\theta_1,j_1,…,θ_{m},j_{m})\)is connected to only one previous path, its ancestor \( p_{m-1}=(\theta_1,j_1,…,θ_{m-1},j_{m-1})\). This implies that the translation invariance is built independently for different path, which can lead to information loss, as we shall explain in Section 4.2.’

The essential result is proposition 1, which shows that ‘hierarchical embedding implies that each \( W_j \) is a convolution on \( G'_{j-1} \).
with 5 we get \( x_j = \rho W_j x_{j-1}\), soit si \( v \in P_j \)  :
$$ x_j (v) = \rho ( \sum_{v' \in P_{j-1} } x_{j-1} (v') w_{j,v} (v') ) $$
the idea is to parametrize \( v = \bar g. b, b\in P_j / G_j  , \quad  \bar g \in G_j \)  : we get a ‘paving’ or rather ‘fiber’ of \( P_j \) following orbites of \( G_j \), CF Figure 4.
if we postulates that \( G_j = G_{j-1} \rtimes H_j,\quad \bar g=(g,h) \), and the commutation 9 is, with \( h=e_{H_j } \)
\begin{array}{r c l}
g.x_j (v) &=& \rho ( \sum_{v' \in P_{j-1} } g.x_{j-1} (v') w_{j,b} (v') )  \\
  &=& \rho ( \sum_{v' \in P_{j-1} } x_{j-1} (g.v') w_{j,b} (v') )  \\
  &=& \rho ( \sum_{v' \in P_{j-1} } x_{j-1} (v') w_{j,b} (g^{-1} v') )  \\
  \end{array}
from \( g.x_{j-1} (v)=x_{j-1} (g.v) \), and after changing variable.
then Mallat writes : \( w_{j,\bar b} (v') = w_{j,(g,h).b} (v' )=h.w_{j,h.b} (g^{-1} v' ) \)
which seems erroneous (Mallat has in fact \( w_{j,(g,l).b} (v' ) \). should we read  \( h.w_{j,b} (g^{-1} v' ) \) ? for me \( w_{j,h.b} (g^{-1} v' ) \) is just an hypothesis
the esssential idea is as in 9. : decoupling levels \( j-1 \) et \( j \)
formely \( \lambda _1 \leftrightarrow g, \lambda _2 \leftrightarrow h.b \)
"The filters  \( w_{j,h.b} \)can be optimized so that variations of \( x_j (g,h,b) \) along \( h \) captures a large variance of \( x_{j-1} \) within each class. Indeed, this variance is then reduced by the next \( \rho W_{j+1} \). The generators of \( H_j \) can be interpreted as principal symmetry generators, by analogy with the principal directions of a PCA"

Wednesday, 28 December 2016

learning as categorification III


1. Lin & Tegmark (LT): "why does deep and cheap learning work so well? "
a. A categorical approach: p7 Fig 3, p12 Table I
b. Proposes a duality cheap / deep
i. Cheap: 'simple polynomials which are sparse, symmetric and / or low-order play a special role in physics' + II.D: low polynomial order, locality, symmetry
ii. Deep: 'One of the most striking features of the physical world is its hierarchical structure.'
c. Is interested in some papers centered on: RG + deep linear nn

2. Symptomatically, the examples given do not correspond to self-learning type DL, but to concatenation / composition of 'symmetries', in the sense of sparse, CF remark 3

3. We have a category sparce graph SpGr, and categories physics Phys and Image classification ImCl, and functors

PhysSpGr
ImClSpGr

4. Remark 1: the old Kant question: is this 'special role' subjective or objective?
Do we really discover low dimensional symmetries or do we discover what we can discover?
a. Fundamental Ockham / generalization bias: CF "against Vapnik"
b. Computational computational stress (CF "μεταφορά ", 2): the 'vicious' circle of learning:
New symmetry → more data → new symmetry → ...

5. Remark 2: How is symmetry learned?
a. Laborans: over time [not on a particular dataset]
b. Various fields bring to light large classes of symmetry: fundamental physics, algebraic geometry, biology (cf PPI), AI, information systems, cognition, engineering, ... CF" reading Building Machines that learn and think like people" (RB)

6. Remark 3: the heuristics-symmetries point of view:
a. To learn is to build a catalog à la Polya (CF RB) of good heuristics, that is to say good symmetries.
b. Distributivity [Bengio] ↔ sparsity [Bach] ↔ heuristic / symmetries
c. Deep is not a mysterious second / 'dual' dimension of learning: just another symmetry: recursivity / sequencial
d. There is an equivalence between sequencial learning and hierarchical learning, via a 'rotation' time ↔ space (depth)
e. In fine, the question is to see the notion of symmetry as much more general than these classical declensions (groups, CF "SGII") or "reductive" (distributivity / sparsity): the theory of categories seems an interesting attempt in this direction. See also heuristics towers in RB

μεταφορά III

1. Bias towards difference... where innovation mean rather learn through comparison / differentiation (on line learn classification cf Brown)

2. Remind μεταφορά :

Analogy ↔ functor

Now on this Cat does not help : analogy is your guide, but this operation is anything but automatic ...
[Reminder: initially the first stake of Cat consists in natural transformations ...]
CF remarks Spivak in (ProMat) http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0023911
"It is important to note that ologs can be constructed on modeling and simulation, experimental studies, or theoretical considerations that essentially result in the understanding necessary to formulate the olog. This has been done for the proteins considered here on the basis of the results from earlier work which provided sufficient information to arrive at the formulation of the problem as shown in Figure 3"

3. In ProMat Spivak emphasizes hierarchical and functional aspects.
(Functors) such as Cat ~ Sch (CF Spivak 5.4), or type PhysSpGr CF "learning as categorification III", or type of those in "learning as categorification", or word2vec (word → linear spaces) .

4. I nevertheless suspect that the most useful / deep functors are symmetries, in the sense that:
Symmetry ↔ structure

Structure taken in its mathematical meaning. These are rather few ...: linear spaces, groups ... the difficulty is to see one of these structures in the domain studied. Most of the time this is not obvious.


5. Enforcing comparison thus remains the objective:
a.ProMat
b. Http://web.mit.edu/mbuehler/www/papers/BioNanoScience_2011_3.pdf
c. "The term" log "(like a scientist's log book) alludes to the fact that such a study is never really complete, and that a study is only as valuable as it is connected in the network of human understanding. In this paper, we present the results of this study. » Spivak
Http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0023911

But how to do it precisely is anything but obvious ...

6. A paradigm that emerges greatly pushes the traditional boundaries of AI:
DeepMind: "I would like to see a science where an AI would be a research assistant doing all the tedious work of finding interesting articles, identifying a structure in a vast amount of data to get back to the human experts and scientists who could move faster "
This paradigm fits precisely in the vision of Spivak in 5.c. Note the presence of the word 'structure'. My best guess would be to take the term in its strong sense: mathematical (and not the sense in which DeepMind may hear it, its 'statistical' meaning: pattern)


7. In an inescapable race for abstraction, CF 4 in "learning categorification III", it can be seen that the AI, after having long been engaged in recognizing material forms, might seek to (learn) Abstract forms: structures / categories. It is thus necessary to understand the evolution which leads from the linear regression (linear spaces), to the trees, then to the NN, then to the DNN, then to combinations of DNN (CF "SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS ", Graves)

Natural language / neuro economics

1. Universal economic constraint
a. Learning: a fable
i. Suppose an intelligence confronted with two vital tasks
1. learn
2. Decide in uncertainty
ii. It is subjected to a constraint of 'finitude': to produce models has a cost of complexity
iii. i + ii brings a fundamental compromise: simplicity / generalization
b. The solution to 1.a.iii seems to have been (in our world) universally algorithmical
i. algorithm
ii. Mathematical: (logic of) categories, CF μεταφορά
iii. Natural Language (NL)
c. Here 'universal' undoubtedly has an economic logic: tradeoff tractability-expressiveness, remix of simplicity-generalization: CF SGI, II


2. NL between philosophy and AI
One finds surprisingly little trace of the economic problem that the natural languages ​​(NL) have to solve, where one would expect to find the concept as a preamble to any philosophy of language.
The exit out of the essentialist conception of language took time ... we must wait Frege and Wittgenstein to leave it, CF Bouveresse.
Wittgenstein talks about language games both for NL and math. The notion of rule (convention) exhausts their 'philosophy', 'end point' seems to say W.
The first order logic (FOL) is 'combining the best of formal and NL', dixit AI Russell Norvig: gratuitous and fortunately false claim, CF 'paradox of learning' or μεταφορά, but the applications of FOL are numerous, starting with our 'FOL for mining news'.

3. NL in cognition and evolution theory
a. The language represents a remarkable solution to 1.c: a machine to create models, flexible and constrained
b. The rules of use are limits to linguistic computation
c. They allow compromise in the saying 'anything': between everything and nonsense
d. CF :'Linguistic structure is an evolutionary trade-off between simplicity and expressivity' Kirby, and 'The origins of syntax in visually grounded robotic agents', Steel
e. Note that expressivity ≠ creativity (= inference calculus): Kirby's paper only deals with the first, not the second. Now the discovery of rules, more precisely structures, is what matters to us in L2L: CF SGII, μεταφορά

4. NL ~ Cat?
The equivalence NL ~ FOL is obviously false, the language makes it easy to speak of sets of sets and so on. So we would rather try the guess NL ~ Cat, especially with regard to 'creativity'
(RDF in Cat, in "Category for the sciences", Spivak, 6.2.2)

5. Neuro ~ Cat?
Beyond language, one can even wonder if Cat could not constitute a paradigm for the cognition in neurosciences

μεταφορά and ἀναλογία

1. Is maths magic?
Is it right to draw inspiration from it, such as Jules Vuillemin wishing to transpose them to philosophy ? Or does mathematics / physics mimic a universal form of learning, whose language already offers the model, CF NL economics?
The modern mathematical point of view, that of structures (including functors and categories, CF 'on categories'), marks the triumph of algebra (CF algebraic geometry), i.e. rules / calculation algorithms. The constraints represented by these algorithms, the symmetries they encode, seem to represent a good compromise between tractability and demonstration power.
As already suggested in Symmetry Generalization I, II, in good science, tractability dominates the question of expressivity: tool conditions the explorable

2. the right point of view
As already discussed, CF 'against Vapnik' 13, learning is (art) to find the right point of view.
In a way, the good overview trivialises the field studied: the groups trivialize the resolution of the algebraic equations, the Game Theory trivialises most of the 'economic' problems, CF 'No equilibrium theorem'.
As if, by economic fact, calculation could not move us away from where we start, or that the point of arrival can hardly be more 'distant' than a 'rotation' of the point of departure. To be compared with the local / global exploration dilemma in optimization.
We find this trait throughout "récoltes et semailles" (RS), it is a well-known trademark of Grothendieck (RS p 669, Illusie https://lejournal.cnrs.fr/billets/grothendieck-and-dynamics-impressive, http : //www.cnrs.fr/insmi/IMG/pdf/Alexandre-Grothendieck.pdf)
In the Category paradigm, the good point of view is that of comparison and morphisms

3. Poincaré and the analogy
« Les faits mathématiques dignes d’être étudiés, ce sont ceux qui, par leur analogie avec d’autres faits, sont susceptibles de nous conduire à la connaissance d’une loi mathématique de la même façon que les faits expérimentaux nous conduisent à la connaissance physique. Ce sont ceux qui nous révèlent des parentés insoupçonnées entre d’autres faits, connus depuis longtemps, mais qu’on croyait à tort étrangers les uns aux autres ».
CF : « L’analogie algébrique au fondement de l’analysis situs », Herreman, in ‘L'analogie dans la démarche scientifique : Perspective historique’

4. The learning engine (learning to learn L2L) is therefore the comparison
Μεταφορά: trans-port
Ἀναλογία: (according to) ratio (ratio): proportion

5. 'Comparison' and 'Analogy' are fundamental aspects of knowledge acquisition, in
'Category: An abstract setting for analogy and comparison', Brown & Porter

6. ex 1: the concept of 'symmetry', CF SGII, must be understood first in the sense of group symmetry, ie of group morphism: a symmetry transports (rotates) a solid, or connects 2 positions of this solid , i.e. compares them

7. ex 2: homotopy / homology: from space to group
Here it is the comparison initiated by Poincaré / Betti between the topological spaces and the groups

8. ex 3: Galois theory: from (algebraic equations') fields to group
The Galois theory compares (in its Dedekindian version) (algebraic equations, more specifically) extension of fields and groups

9. The main point is perhaps less to be surprised to discover groups in topological spaces or algebraic extensions than to reaffirm Klein's point of view (and his Erlangen program) that one learns when one connects objects: when one studies its symmetries, or more generally its morphisms

10. enforcing comparison: this is the spirit of category theory

Friday, 23 December 2016

symmetry-generalization II

1. In 'Learning deep architectures for AI', Bengio revolves around the notion of symmetry without ever uttering the word!
Geometry appears 7 times, manifold 14 times
Generalization 80 times
Bengio insists heavily on the limitations of 'local' approaches: 100 occurrences, and opposes 'distributed representation', 51 oc.
2. The Bach team at Cachan on Vision spent time on looking for good priors. In hindsight, it passed by the deep CNN
3. The ecological rationality of Gigerenzer masks the environments symmetries
4. Mallat and al. have sought to join sequential learning and 'groups': translation, rotation, weak deformations (diffeomorphisms), CF CNN : deep symmetries
5. Many groups are Lie groups: manifolds
6. Https://en.wikipedia.org/wiki/Symmetry_(physics)
7. Let S be a system endowed with certain articulations or degrees of freedom
Any transformation T whose result is known on S:
T * S = S
provides strong constraints on (a model of) S
When T is a group, T * S is called a group operation
8. For example, since the Hamiltonian H is spherically symmetric for the system of the electron around the hydrogen nucleus, H commutes with the 3 components of the angular momentum J on the proper space E of H, so that SO ( 3) [group of 3D space rotations] on the ket 𝜓 solution of the Schrödinger equation:
 R(𝜃) * 𝜓 (x) = 𝜓(x) * R(𝜃)
We obtain (more easily) the solution 𝜓= Ylm(𝜃, 𝜙) fn(r), l = 0,1, ..., n- 1 and  m = -1, ..., l \)

9. An example of non-spatial / temporal symmetry in physics:
 a. Isospin (associated group: SU (2))
 b. Gauge symmetry: the local invariance constraint of the Dirac action implies the existence of the EM field and its interaction with the charged particles
10. On the other hand, the statistical instability of a relation y ~ x can be seen as an unknown transformation law:
given a set D on which it seems that y = 𝛃x
y and x the returns to t and t-1 of two instruments
When a new set D' is presented, we find y = 𝛃' x
Our law is therefore not invariant
In reality, we lack a dimension, or variable z, which would allow us to see that
 y = e (z) x
The line y ~ x rotates along z
In other words, the transformation of z corresponds to a transformation of the ratio y / x:
 z → z' 
 y/x → y'/x
Where does z come from ?
In the example of Stoikov, there is no coupling between activities at bid and ask, simply because the world is reduced to one market maker, not to a market maker + insider system as in Kyle85
11. The symmetries impose strong constraints and considerably reduce the field of the possible
CF for physics Zee 'fearful symmetry' (eg p209)

Symmetry-generalization conjecture

1. The tradeoff bias variance is illusory: one always chooses the bias against the variance, because the 'human Learner' has a natural bias towards the ability to generalize (CF Gigerenzer, Bengio)
2. learning, before being in layer, is mainly sequential: CF 'AI exponential cognitive growth' etc: principle of innovation
3. Symmetry-generalization conjecture: a good learning path is such that, in step k
a.The symmetries Sk appear 'naturally'
b.Are sufficient to 'fix' the geometry of the transition k → k + 1
4. Einstein: "Everything Should Be Made as Simple as Possible, But Not Simpler"
5. Example 1: CNN for image: it is indeed the simplest natural solution under constraints of the symmetries of the problem
6. Example 2: Dirac's equation: from
a.Model simplicity (first order derivatives)
b.Physical invariance (Lorentz group)
 emerges a new geometry: a Clifford algebra
7. Example 3: machine reading I (Synonyms) -> II x' = x1  + x2 -> III x'' =x’1 + x’2 -> IV ...
8. Example 4: Gigerenzer decision tree for classifying incoming heart attack patients
Http://psy2.ucsd.edu/~mckenzie/ToddGigerenzer2000BBS.pdf
I suspect that this 'tree' actually reveals a layered logic

learning fallacy

1. The paradigm of statistical learning is the following: one declare
a. data D
b. A learner with degrees of freedom
c. We do a regularized fit via cross validation on D
2. This assumes that D is representative, but how to know that this is the case? Cross validation does not change the argument.
3. From Taleb: "Statistical regress argument (or the problem of the circularity of statistics): We need data to discover a probability distribution. How do we know if we have enough? From the probability distribution. If it is a Gaussian, then a few points of data will suffice. How do we know it is a Gaussian? From the data. So we need the data to tell us what probability distribution to assume, and we need a probability distribution to tell us how much data we need. This causes a severe regress argument, which is somewhat shamelessly circumvented by resorting to the Gaussian and its kin."
4. distinguishing between risk and uncertainty
See also Handbook of Game theory, Petyon & Zamir, p10-12 (Savage small world)
5. learn to distinguish a cat from a dog with 10^10 pictures has more chance to succeed even with 10^4 parameters than predict a trend on 10^5 points with 10^2 parameters
6. approach 1 essentially overfits: learning is focused on one type of data, and even on a limited sample of this type of data. At best we have a perfect specialization, when the data is stable (eg: pattern recognition: DL / image). Learning can be long (DL:> 20 years ...)
In addition, 1 potentially suffers from self-justification bias (time-to-market is important), ecological maladjustment (if the law of decreasing returns is valid in the world), opportunity cost
7. the learning process itself is not the fit, but
a. The choice of the heuristic (= {learner, choice of metaparameters})
b. Beyond, the iterative learning of this exploration
8. 7.a, b, are today at best in the AI program, certainly not in the Machine Learning in its most common sense which is the pattern recognition (last avatar: Deep Learning)
9. The power of generalization is the key concept. The bias / variance compromise of the theory of statistical learning is only an (anecdotal) mode of this concept as soon as D is not guaranteed as representative (often the most realistic hypothesis)
10. This power of generalization can be seen as an embedding of D ⊂ D_ ⊂ D':
a. D_ area of ​​interest (financial Time Series ...)
b. D' Domain of generalization of the heuristic (RFIM? ...)
11. a heuristic is a rule of thumb acquired in the long term, in relation to an (noisy) environment, CF Gerd Gigerenzer, Henry Brighton "Homo Heuristicus: Why Biased Minds Make Better Inferences"
Http://library.mpib-berlin.mpg.de/ft/gg/GG_Homo_2009.pdf
12. a heuristic tested on D comes from a meta-learning in a rather large (and not accessible) D' to guarantee its robustness.
13. the structure of the environment, its invariants, is a key element of the problem.
We can think (in Gigerenzer) that the natural human environment has favored the emergence of heuristics such as Take-the-best or Tallying. "Cognitive science is increasingly stressing the senses in which the cognitive system performs remarkably well when generalizing from few observations, so much so that human performance is often as optimal" (in Gerenzer)
The spontaneous use of these heuristics hides an essential prior: we already know that they work in the contexts where they are used.
14. In this sense, one can think of looking primarily at the 'natural' priors of the financial world
15. we can use "as if" a heuristic from D' hoping that its power of generalization fits D_
16. Anecdote (?): Harry Markowitz received the Nobel prize in economics for finding the optimal solution, the mean-variance portfolio. When he made his own retirement investments, however, he did not use his optimizing strategy, but instead relied on a simple heuristic: 1 / N, that is, allocate your money equally to each of N alternatives

Thursday, 22 December 2016

learning in physics

1. The concept of symmetry has become increasingly important in theoretical physics since the beginnings of Relativity with Lorentz, Poincaré, Hilbert and Einstein electromagnetism
The notion of least action has taken precedence over classical 'equations of motion' approaches
2. The physics of a phenomenon is summed up by the constraints of invariance or symmetry imposed on the action:
a. Ex 1 : Galilean symmetry: S = ∫ dt (1/2 m(dq/dt -V(q))
b. Ex 2 : lorentzian invariance : S = ∫ - mc²√𝜂ₗₛdxˡdxˢ
c. Ex 3 : particle in a field : S = ∫ - mc²√𝜂ₗₛdxˡdxˢ - eAₛdxˢ
d. Ex 4 : gauge invariance: equation of the EM field : S = ∫ dx⁴ FˡˢFₗₛ
e. Ex 5 : Principle of Equivalence: Hilbert-Einstein action : S = ∫ dx⁴ R√-g
f. Ex 6 : Quantum Field Theory (QFT)…
g. Ex 7 : non-abelian gauge invariance: Yang Mills lagrangian
h. Ex 8 : Supersymetry…
3. A first attempt to make the Schrödinger Lorentz equation invariant leads to Klein Gordon's equation. The problem of the negative current density which emerges from this equation leads Dirac to try an equation of order 1, which has a solution only for a non-scalar field: Dirac falls on this occasion on Clifford algebras without knowing it.
4. The theory of representations of the Lie algebra of the SO (3) group (rotation in space) is a valuable aid in the solution of the Schrödinger equation for the electron in a spherically symmetric potential.
5. The link between physical stat and QFT occurs naturally, via the formal equivalence between partition function Z and action S.
6. In QFT, for example for the condensed matter theory (CondMat), we seek an effective representation, at the Landau-Ginzburg, guided by the reasonable symmetries of the Lagrangian.
7. In CondMat again, the Renormalization Group (RG) plays an important role in understanding the transition from one scale to another
8. Recently, it has been proposed to see a profound link between RG and Deep Learning "An exact mapping between the Variational Renormalization Group and Deep Learning", Mehta, Schwab (MS)
9. MS shows that the unsupervised learning of the Deep Belief Network mimics the RG, except that the RG seeks to minimize the gap between the Free Energy of the two layers, where the DBN uses the Kullback-Leibler measure.
10. The perfect mapping between RG and DBN goes through the equivalence between T_λ and E, the energy or 'Hamiltonian' of the DBN.
11. The key to learning is therefore the lucky guess of E, which rests on a good understanding of its symmetries.
12. In the case of images, these symmetries are expressed very directly in the CNN.

Against Vapnik

1.       The theory of statistical risk lacks the essential
2.       Paradox: on the one hand the human capacities of generalization appear as the graal in machine learning, on the other one pins a theory that needs to order 0 at the foot of the inductive wall
3.       Rare are the real cases where the data constrains the model: anything fits, almost always
4.       Dreams of generalization: the human theorizes (has priors): sees symmetries
5.       Statistics is an historical fiction: CF Pascal, Taleb (mediocristan)
of no practical use for real problems
6.       The statistical reasoning is basically erroneous, and at best affects to discover a theory actually already known, at worst raves (overfit)
The MAB approach, and beyond Reinforcement Learning, is the only theoretical answer common to this worm-eaten foundation.
See also Taleb’s convex heuristics, a logic of the decision
7.       Gigerenzer is one of the very few authors interested in human capacity for generalization, CF 'learning fallacy'
Do not be confuse on the use of the statistical risk approach in the article: the penalty, or Occam's razor, is only one of the 2 'priors' of learning : the second being the search for symmetries.
8.       learning means theory, and more exactly a theoretical unscrewing: a tower of Representations / Theories {T(k)}
9.       The DL is an ersatz of this design
10.   T(k) encapsulates much more than the data which 'validates' it (CF, for example, the theory of gravitation and precession of the perihelion of mercury)
11.   In physics, T = L, the Lagrangian

12.   Most important is the innovation T(k)->T(k+1), based on certain symmetries that must be guessed

Wednesday, 21 December 2016

reading « Building Machines That Learn and Think Like People »

1.       "Building Machines That Learn and Think Like People", Lake (BM)
2.       By simple syntactic counting, five conceptual fields emerge in the article:
a.       Learning (420), learning-to-learn (26), cognit (218)
b.      Model: generative (41), generaliz (34), model (265), theor (51),
c.       Transfer (18), reuse (4), new (137), novel (26), re- (3)
d.      Concept: compositional (45), structur (59), abstrac (11), inductive (28), concept (92)
e.      Few data (23), fast (16), one-shot (12)
3.       Which reads: learning-to-learn is the human learning / cognitive mode, which builds cross-domain transferable models / theories, precisely thanks to abstract / conceptual approach, requiring little data to learn a new domain
4.       BM essentially gives two examples, which we identify with 2 categories:
a.       Lego (characters challenge): parts or components and their concatenation rules (CF cat Monoid in Spivak’s “Category Theory for the Sciences”)
b.      Agnt (Frostbite): agents, endowed with objectives, intentional and rational actors
5.       BM opposes:
a.       Deep Learning : learning by heart, agnostic, data-intensive and computational, non-transferable
b.      Human: conceptual, theoretical / modeling, reusable in other fields
6.       the DQN learning Frostbite undoubtedly has high-level features equivalent to agents (hostile or not) but it does not have the general conceptual grasp of what is an agent, which would allow it to move easily from one game to another (transfer), unlike a human
7.       What the human does is exactly to construct (more or less easily) a functor of a new (for him) domain towards a more or less abstract category; this is what happens in the functor
Characters challenge → Lego
This structural point of view, the one defended in the Erlangen program, puts relations between objects above the objects themselves, whereas Deep Learning does not explicitly distinguish objects and relations.
8.       As already said, 7 is not automatic, and this search goes through heuristics
9.       Solving problems' heuristics: special case of maths:
a.       Polya "how to solve it"
b.      Terence Tao "solving mathematical problems: a personal perspective": 62 occurrences of the word "strategy" (in fact Tao prepared for the Olympics by reading Polya)
c.       "Learning mathematics using heuristic approach", Hoon
d.      "Methodix" collection
e.     
10.   Let us attach an euPEDIa category, and see it as essentially isomorphic to Grph.
For each student, facing a pb bp, there is thus an optimization of the sequence of heuristics {h(t)} adapted to his personality :
Learn: (student, pb) → {h(t)}
11.   Although the idea of ​​"school according to Watson (India)" makes its way

It is not obvious to find an automatic learning as in 10