Friday, 9 March 2018

text representation and summary

La recherche d’information dans un texte est une des tâches classiques du text mining.
Un résumé peut consister à mettre en avant les concepts souvent répétés dans un texte : c’est une approche en fréquence absolue.
Ou l’on peut se baser sur une distribution de référence et mettre en avant les concepts du texte relativement plus fréquents que la référence.
On peut encore itérer cette approche, et regarder les n-grams relativement plus fréquents.
Mais ces n-grams peuvent être ou non constitués de concepts accolés dans la phrase (modulo les stop words que l’on aura pris soin de retirer). En général, on peut combiner ces deux cas.
Prenons à titre d’exemple le texte de Polya bien connu, ‘How to solve it’.
Par ordre de fréquence relative par rapport à un benchmark (ici wikipédia), on a les différents concepts et leurs associations  de type \(x \rightarrow y\), là aussi dans l’ordre fréquentiel.
Par exemple \( auxiliari\_problem \rightarrow origin\_problem\) , etc.
On voit apparaitre des concepts qui sont bien connu des lecteurs de ce livre : auxiliari-problem, sign of progress, variate the problem, heuristic reasonning...
Polya donne à voir comme personne un art de la découverte qu’il n’hésite pas rapprocher des Grandes Traversées du XVe s.
Cette représentation du texte permet aussi de créer un résumé automatique, en ordonnant les phrases représentant le mieux les structures \( x_i \rightarrow Y_i=\{ Y_{j}^{i} \} \):
AUXILIARI_PROBLEM
 The auxiliary problem was, as a special case, in fact much less ambitious than the original problem
 To sum up, we used the less difficult, less ambitious, special, auxiliary problem as a stepping stone in solving the more difficult, more ambitious, general, original problem
ORIGIN_PROBLEM
 Convertible reductions are, in a certain respect, more important and more desirable than other ways to introduce auxiliary problems, but auxiliary problems which are not equivalent to the original problem may also be very useful
SIGN_PROGRESS
 The day before that memorable date on which they sighted the island of San Salvador, as the floating objects in the water became so frequent, they thought: "It looks  Signs of Progress  as if we were approaching some land”
 Our undertaking may be important or unimportant, our problem of any kind when we are working intensely, we watch eagerly for signs of progress as Columbus and his companions watched for signs of approaching land
WORK_BACKWARD
  Modern Heuristic : There are articles discussing methodical questions often important in elementary mathematics, as pappus, WORKING BACKWARDS (already quoted under 3) , reductio AD ABSURDUM AND INDIRECT PROOF, INDUCTION AND MATHEMATICAL INDUCTION, SETTING UP EQUATIONS, TEST BY DIMENSION, and WHY PROOFS
 Analysis is neatly defined by pappus, and it is a useful term, describing a typical way of devising a plan, starting from the unknown (or the conclusion) and working backwards, toward the data (or the hypothesis)
LOOK_UNKNOWN
 There are, however, questions and suggestions which are frequently helpful, as look at the unknown
 There is a suggestion that puts our finger on an essential common point: Look at the unknown
VARIAT_PROBLEM
 Variation of the problem may lead us to auxiliary ELEMENTS, or to the discovery of a more accessible auxiliary PROBLEM
 Variation of the problem may lead to some appropriate auxiliary problem: // you cannot solve the proposed problem, try to solve first some related problem
DECOMPOS_RECOMBIN
 Many questions aim at the variation of the problem by specified means, as going back to the definition, using analogy, generalization, SPECIALIZATION, DECOMPOSING AND RECOMBINING
 There are certain modes of varying the problem which are typically useful, as going back to the definition, DECOMPOSING AND RECOMBINING, introducing AUXILIARY ELEMENTS, GENERALIZATION, SPECIALIZATION, and the use of ANALOGY
USE_RESULT
 Using the result of the auxiliary problem we easily solve our original problem (we have to complete the parallelogram)
 We may use the result of the auxiliary problem
DRAW_FIGUR
 We start the detailed consideration of such a problem by drawing a figure containing the unknown and the data, all these elements being assembled as it is prescribed by the condition of the problem
HEURIST_REASON
 It is concerned with the nature of heuristic reasoning and, by extension, with a kind of reasoning which is nondemonstrative although important and which we shall call, for lack of a better term, plausible reasoning
 We could call the reasoning that underlies this kind of evidence "heuristic reasoning" or "inductive reasoning" or (if we wish to avoid stretching the meaning of existing terms) "plausible reasoning

BRIGHT_IDEA
 A sudden advance toward the solution is called a bright idea, a good idea, a happy thought, a brain-wave (in German there is a more technical term, Einfalt)
" Bright idea, or "good idea," or "seeing the light," is a colloquial expression describing a sudden advance toward the solution
SPECIAL_CASE
 This auxiliary problem is a special case of the original problem (the extreme special case in which one of the two ships is at rest)
INTRODUC_AUXILIARI_ELEMENT
 In general, having recollected a formerly solved related problem and wishing to use it for our present one, we must often ask: Should we introduce some auxiliary element in order to make its use possible
 We aim at such an effect when, thinking about the possible use of a formerly solved related problem, we ask: Should you introduce some auxiliary element in order to make its use possible
KNOW_RELAT_PROBLEM
 Setting a routine problem, the teacher thrusts under the nose of the student an immediate and decisive answer to the question: Do you know a related problem
 Let us go back to the situation as it presented itself at the beginning of section 10 when the question was asked: Do you know a related problem
ANALOG
 We may vary the problem by decomposing and RECOMBiNiNG its elements, or by going back to the definition of certain of its terms, or we may use the great resources of generalization, specialization, and analogy


Aller plus loin dans la compression de la représentation peut se faire en projetant Y sur \( X= \{ x_i, \; i<N \} \), i.e. en restreignant les \(Y_{j}^{k}\) aux \( x_i, \; i<N \), se souvenant que les \( x_i \) sont ordonnés selon leur fréquence relative. 
Plus précisément, on cherche les occurrences des  \(x_i \) dans chaque  \( \{Y_{j}^{k}, j<cut \} \), imposant donc que  \(x_i \) soit ‘proche’ de  \(x_k \) (dans les \( cut \) premiers \( \{ Y_{j}^{k} \}_j \) ). D’où une nouvelle table \( x_i \rightarrow Z^i = {x_k^i} \) où l’on ordonne sur \( \| Z^i \| \) : le concept \(x_0 \) est le plus ‘central’ en ce qu’il est le plus connecté.
On supprime ensuite les \( x_i \) quand ils se trouvent dans les \(  Z^k, \; i>k \). On choisit enfin de supprimer les \( x_i \) si 50% de \( Z^i \)  se trouvent dans un \( Z^k  \; i>k\). On obtient pour ‘How to solve it’ (partant de près de  1000 \(x_i \), cut = 10) :

On constate que la première thématique, ‘origin_problem’, contient 169 voisins, dont : 'auxiliari_problem',
 'use_result',
 'special_case',
 'simpler_analog_problem',
 'introduc_auxiliari',
 'less_ambiti',
 'restat',
 'devis',
 'reconsid',
 'tri',
 'step',
 'familiar',
 'simpler',
 'vari',
 'deriv',
 'easier',
 'auxiliari',
 'various',
 'combin',
 'passag',
 'modifi',
qui sont bien les voisins attendus de la thématique Variation-Comparaison.

On trouve aussi la thématique ‘sign_progress’, qui pointe sur ‘progress_achiev’, ‘approach_land’, mais aussi sur un versant  psychologique : ‘subconsci_work’, ‘suspect’, ‘mental’…

La thématique ‘part_condit’ renvoie à ‘decompos_recombin’, ‘general_special’, mais aussi sur ‘restat’, ‘tri’ qu’on retrouve dans le premier cluster ‘origin_problem’.
‘technic_term’ pointe sur ‘heuristic_reason’, ‘bright_idea’ ,’relat’.

‘auxiliari_element’ pointe sur 'auxiliari_problem', ‘variat_problem’,’analog’, ‘familiar’,  ‘relat’, et semble donc assez proche de ‘origin_problem’.

Autre cluster intéressant ‘plausibl_reason’ pointe sur 'heurist_reason', 'point_view', 'heurist_syllog', 'induct', 'infer'.

On obtient un peu plus de clusters avec cut = 5 :


Utilisant le logiciel de représentation de graphe et clustering Delphi , on retrouve essentiellement les mêmes résultats. ci dessous la partie haute puis basse du même graphe.






No comments:

Post a Comment