Dao to learn: text representation and summary

La recherche d’information dans un texte est une des tâches classiques du text mining.

Un résumé peut consister à mettre en avant les concepts souvent répétés dans un texte : c’est une approche en fréquence absolue.

Ou l’on peut se baser sur une distribution de référence et mettre en avant les concepts du texte relativement plus fréquents que la référence.

On peut encore itérer cette approche, et regarder les n-grams relativement plus fréquents.

Mais ces n-grams peuvent être ou non constitués de concepts accolés dans la phrase (modulo les stop words que l’on aura pris soin de retirer). En général, on peut combiner ces deux cas.

Prenons à titre d’exemple le texte de Polya bien connu, ‘How to solve it’.

Par ordre de fréquence relative par rapport à un benchmark (ici wikipédia), on a les différents concepts et leurs associations de type \(x \rightarrow y\), là aussi dans l’ordre fréquentiel.

Par exemple \( auxiliari\_problem \rightarrow origin\_problem\) , etc.

On voit apparaitre des concepts qui sont bien connu des lecteurs de ce livre : auxiliari-problem, sign of progress, variate the problem, heuristic reasonning...

Polya donne à voir comme personne un art de la découverte qu’il n’hésite pas rapprocher des Grandes Traversées du XVe s.

Cette représentation du texte permet aussi de créer un résumé automatique, en ordonnant les phrases représentant le mieux les structures \( x_i \rightarrow Y_i=\{ Y_{j}^{i} \} \):

AUXILIARI_PROBLEM

The auxiliary problem was, as a special case, in fact much less ambitious than the original problem

To sum up, we used the less difficult, less ambitious, special, auxiliary problem as a stepping stone in solving the more difficult, more ambitious, general, original problem

ORIGIN_PROBLEM

Convertible reductions are, in a certain respect, more important and more desirable than other ways to introduce auxiliary problems, but auxiliary problems which are not equivalent to the original problem may also be very useful

SIGN_PROGRESS

The day before that memorable date on which they sighted the island of San Salvador, as the floating objects in the water became so frequent, they thought: "It looks Signs of Progress as if we were approaching some land”

Our undertaking may be important or unimportant, our problem of any kind when we are working intensely, we watch eagerly for signs of progress as Columbus and his companions watched for signs of approaching land

WORK_BACKWARD

Modern Heuristic : There are articles discussing methodical questions often important in elementary mathematics, as pappus, WORKING BACKWARDS (already quoted under 3) , reductio AD ABSURDUM AND INDIRECT PROOF, INDUCTION AND MATHEMATICAL INDUCTION, SETTING UP EQUATIONS, TEST BY DIMENSION, and WHY PROOFS

Analysis is neatly defined by pappus, and it is a useful term, describing a typical way of devising a plan, starting from the unknown (or the conclusion) and working backwards, toward the data (or the hypothesis)

LOOK_UNKNOWN

There are, however, questions and suggestions which are frequently helpful, as look at the unknown

There is a suggestion that puts our finger on an essential common point: Look at the unknown

VARIAT_PROBLEM

Variation of the problem may lead us to auxiliary ELEMENTS, or to the discovery of a more accessible auxiliary PROBLEM

Variation of the problem may lead to some appropriate auxiliary problem: // you cannot solve the proposed problem, try to solve first some related problem

DECOMPOS_RECOMBIN

Many questions aim at the variation of the problem by specified means, as going back to the definition, using analogy, generalization, SPECIALIZATION, DECOMPOSING AND RECOMBINING

There are certain modes of varying the problem which are typically useful, as going back to the definition, DECOMPOSING AND RECOMBINING, introducing AUXILIARY ELEMENTS, GENERALIZATION, SPECIALIZATION, and the use of ANALOGY

USE_RESULT

Using the result of the auxiliary problem we easily solve our original problem (we have to complete the parallelogram)

We may use the result of the auxiliary problem

DRAW_FIGUR

We start the detailed consideration of such a problem by drawing a figure containing the unknown and the data, all these elements being assembled as it is prescribed by the condition of the problem

HEURIST_REASON

It is concerned with the nature of heuristic reasoning and, by extension, with a kind of reasoning which is nondemonstrative although important and which we shall call, for lack of a better term, plausible reasoning

We could call the reasoning that underlies this kind of evidence "heuristic reasoning" or "inductive reasoning" or (if we wish to avoid stretching the meaning of existing terms) "plausible reasoning

BRIGHT_IDEA

A sudden advance toward the solution is called a bright idea, a good idea, a happy thought, a brain-wave (in German there is a more technical term, Einfalt)

" Bright idea, or "good idea," or "seeing the light," is a colloquial expression describing a sudden advance toward the solution

SPECIAL_CASE

This auxiliary problem is a special case of the original problem (the extreme special case in which one of the two ships is at rest)

INTRODUC_AUXILIARI_ELEMENT

In general, having recollected a formerly solved related problem and wishing to use it for our present one, we must often ask: Should we introduce some auxiliary element in order to make its use possible

We aim at such an effect when, thinking about the possible use of a formerly solved related problem, we ask: Should you introduce some auxiliary element in order to make its use possible

KNOW_RELAT_PROBLEM

Setting a routine problem, the teacher thrusts under the nose of the student an immediate and decisive answer to the question: Do you know a related problem

Let us go back to the situation as it presented itself at the beginning of section 10 when the question was asked: Do you know a related problem

ANALOG

We may vary the problem by decomposing and RECOMBiNiNG its elements, or by going back to the definition of certain of its terms, or we may use the great resources of generalization, specialization, and analogy

Aller plus loin dans la compression de la représentation peut se faire en projetant Y sur \( X= \{ x_i, \; i<N \} \), i.e. en restreignant les \(Y_{j}^{k}\) aux \( x_i, \; i<N \), se souvenant que les \( x_i \) sont ordonnés selon leur fréquence relative.

Plus précisément, on cherche les occurrences des \(x_i \) dans chaque \( \{Y_{j}^{k}, j<cut \} \), imposant donc que \(x_i \) soit ‘proche’ de \(x_k \) (dans les \( cut \) premiers \( \{ Y_{j}^{k} \}_j \) ). D’où une nouvelle table \( x_i \rightarrow Z^i = {x_k^i} \) où l’on ordonne sur \( \| Z^i \| \) : le concept \(x_0 \) est le plus ‘central’ en ce qu’il est le plus connecté.

On supprime ensuite les \( x_i \) quand ils se trouvent dans les \( Z^k, \; i>k \). On choisit enfin de supprimer les \( x_i \) si 50% de \( Z^i \) se trouvent dans un \( Z^k \; i>k\). On obtient pour ‘How to solve it’ (partant de près de 1000 \(x_i \), cut = 10) :

On constate que la première thématique, ‘origin_problem’, contient 169 voisins, dont : 'auxiliari_problem',

'use_result',

'special_case',

'simpler_analog_problem',

'introduc_auxiliari',

'less_ambiti',

'restat',

'devis',

'reconsid',

'tri',

'step',

'familiar',

'simpler',

'vari',

'deriv',

'easier',

'auxiliari',

'various',

'combin',

'passag',

'modifi',

qui sont bien les voisins attendus de la thématique Variation-Comparaison.

On trouve aussi la thématique ‘sign_progress’, qui pointe sur ‘progress_achiev’, ‘approach_land’, mais aussi sur un versant psychologique : ‘subconsci_work’, ‘suspect’, ‘mental’…

La thématique ‘part_condit’ renvoie à ‘decompos_recombin’, ‘general_special’, mais aussi sur ‘restat’, ‘tri’ qu’on retrouve dans le premier cluster ‘origin_problem’.

‘technic_term’ pointe sur ‘heuristic_reason’, ‘bright_idea’ ,’relat’.

‘auxiliari_element’ pointe sur 'auxiliari_problem', ‘variat_problem’,’analog’, ‘familiar’, ‘relat’, et semble donc assez proche de ‘origin_problem’.

Autre cluster intéressant ‘plausibl_reason’ pointe sur 'heurist_reason', 'point_view', 'heurist_syllog', 'induct', 'infer'.

On obtient un peu plus de clusters avec cut = 5 :

Utilisant le logiciel de représentation de graphe et clustering Delphi , on retrouve essentiellement les mêmes résultats. ci dessous la partie haute puis basse du même graphe.

Dao to learn

Friday, 9 March 2018

text representation and summary

No comments:

Post a Comment

Blog Archive

About Me