A natural language approach to automated cryptanalysis of two-time pads

A natural language approach to automated cryptanalysis of two-time pads
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  A Natural Language Approach to Automated Cryptanalysisof Two-time Pads Joshua Mason Johns Hopkins University  josh@cs.jhu.eduKathryn Watkins Johns Hopkins University kwatkins@jhu.eduJason Eisner Johns Hopkins University  jason@cs.jhu.eduAdam Stubblefield Johns Hopkins University ABSTRACT While keystream reuse in stream ciphers and one-time padshas been a well known problem for several decades, the riskto real systems has been underappreciated. Previous tech-niques have relied on being able to accurately guess wordsand phrases that appear in one of the plaintext messages,making it far easier to claim that “an attacker would neverbe able to do  that  .” In this paper, we show how an adver-sary can automatically recover messages encrypted underthe same keystream if only the  type   of each message is known(e.g. an HTML page in English). Our method, which is re-lated to HMMs, recovers the most probable plaintext of thistype by using a statistical language model and a dynamicprogramming algorithm. It produces up to 99% accuracy onrealistic data and can process ciphertexts at 200ms per byteon a $2,000 PC. To further demonstrate the practical effec-tiveness of the method, we show that our tool can recoverdocuments encrypted by Microsoft Word 2002 [22]. Categories and Subject Descriptors E.3 [ Data ]: Data Encryption General Terms Security Keywords Keystream reuse, one-time pad, stream cipher 1 Introduction Since their discovery by Gilbert Vernam in 1917 [20], streamciphers have been a popular method of encryption. In astream cipher, the plaintext,  p , is exclusive- or ed ( xor ed)with a keystream,  k , to produce the ciphertext,  p ⊕ k  =  c .A special case arises when the keystream is truly random:the cipher is known as a one-time pad, proved unbreakableby Shannon [18]. Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. CCS’06,  October 30–November 3, 2006, Alexandria, Virginia, USA.Copyright 2006 ACM 1-59593-518-5/06/0010 ... $ 5.00. It is well known that the security of stream ciphers restson never reusing the keystream  k  [9]. For if   k  is   reused to en-crypt two different plaintexts,  p  and  q  , then the ciphertexts  p ⊕ k  and  q  ⊕ k  can be  xor ed together to recover  p ⊕ q  . Thegoal of this paper is to complete this attack by recovering  p and  q   from  p ⊕ q  . We call this the “two-time pad problem.”In this paper we present an automated method for re-covering  p  and  q   given only the “type” of each file. Morespecifically, we assume that  p  and  q   are drawn from someknown probability distributions. For example,  p  might bea Word document and  q   might be a HTML web page. Theprobability distributions can be built from a large corpus of examples of each type (e.g. by mining the Web for docu-ments or web pages). Given the probability distributions,we then transform the problem of recovering  p  and  q   into a“decoding” problem that can be solved using some modifiedtechniques from the natural language processing community.Our results show that the technique is extremely effectiveon realistic datasets (more than 99% accuracy on some filetypes) while remaining efficient (200ms per recovered byte).Our attack on two-time pads has practical consequences.Proofs that keystream reuse leaks information hasn’t stoppedsystem designers from reusing keystreams. A small sam-pling of the systems so affected include Microsoft Office [22],802.11 WEP [3], WinZip [11], PPTP [17], and Soviet diplo-matic, military, and intelligence communications intercepted [2,21]. We do not expect that this problem will disappearany time soon: indeed, since NIST has endorsed the CTRmode for AES [7], effectively turning a block cipher intoa stream cipher, future systems that might otherwise haveused CBC with a constant IV may instead reuse keystreams.The WinZip vulnerability is already of this type.To demonstrate this practicality more concretely, we showthat our tool can be used to recover documents encryptedby Microsoft Word 2002. The vulnerability we focus onwas known before this work [22], but could not be exploitedeffectively. 1.1 Prior Work Perhaps the most famous attempt to recover plaintexts thathave been encrypted with the same keystream is the Na-tional Security Agency’s VENONA project [21]. The NSA’sforerunner, the Army’s Signal Intelligence Service, noticedthat some encrypted Soviet telegraph traffic appeared toreuse keystream material. The program to reconstruct themessages’ plaintext began in 1943 and did not end until1980. Over 3,000 messages were at least partially recovered.The project was partially declassified in 1995, and many of the decryptions were released to the public [2]. However, theciphertexts and cryptanalytic methods remain classified.  There is a “classical” method of recovering  p  and  q   from  p ⊕ q   when  p  and  q   are known to be English text. First guessa word likely to appear in the messages, say  the . Then, at-tempt to  xor  the  with each length-3 substring of   p  ⊕  q  .Wherever the result is something that “looks like” Englishtext, chances are that one of the messages has  the  in thatposition and the other message has the result of the  xor . Byrepeating this process many times, the cryptanalyst buildsup portions of plaintext. This method was somewhat for-malized by Rubin in 1978 [15].In 1996, Dawson and Nielsen [5] created a program thatuses a series of heuristic rules to automatically attemptthis style of decryption. They simplified matters by as-suming that the plaintexts used only 27 characters of the256-character ASCII set: the 26 English uppercase lettersand the space. Given  p ⊕ q  , this assumption allowed themto unambiguously recover non-coinciding spaces in  p  and  q  ,since in ASCII, an uppercase letter  xor ed with a space cannot be equal to any two uppercase letters  xor ed together.They further assumed that two characters that  xor ed to 0were both equal to space, the most common character. Todecode the words between the recovered spaces, they em-ployed lists of common words of various lengths (and a few“tricks”). They chose to test their system by running iton subsets of the same training data from which they hadcompiled their common-word lists (a preprocessed versionof the first 600,000 characters of the English Bible). Theycontinued adding new tricks and rules until they reached theresults shown in Figure 1. It is important to note that therules they added were  specifically designed to get good results on the examples they were using for testing   (hence are notguaranteed to work as well on other examples).We were able to re-attempt Dawson and Nielsen’s experi-ments on the King James Bible 1 using the new methodologydescribed in this paper  without any special tuning or tricks  .Dawson and Nielsen even included portions of all three testpassages they used, so our comparison is almost completelyapples-to-apples. Our results are compared with theirs inFigure 1. 2 Our Method Instead of layering on heuristic after heuristic to recover spe-cific types of plaintext, we instead take a more principledand general approach. Let  x  be the known  xor  of the twociphertexts. A feasible solution to the two-time pad prob-lem is a string pair (  p,q  ) such that  p ⊕ q   =  x . We assumethat  p  and  q   were independently drawn from known proba-bility distributions Pr 1  and Pr 2 , respectively. We then seekthe most probable of the feasible solutions: the (  p,q  ) thatmaximizes Pr 1 (  p ) · Pr 2 ( q  ).To define Pr 1  and Pr 2  in advance, we adopt a parametricmodel of distributions over plaintexts—known as a  language model  —and estimate its parameters from known plaintextsin each domain. For example, if   p  is known to be an Englishwebpage, we use a distribution Pr 1  that has previously beenfit against a  corpus   (naturally occurring collection) of En-glish webpages. The parametric form we adopt for Pr 1  andPr 2  is such that an exact solution to our search problem istractable.This kind of approach is widely used in the speech and nat-ural language processing community, where recovering the 1 We used the Project Gutenberg edition which matches theexcerpts from [5], available at most probable plaintext  p  given a speech signal  x  is actuallyknown as “decoding.” 2 We borrow some well-known tech-niques from that community: smoothed  n -gram languagemodels, along with dynamic programming (the “Viterbi de-coding” algorithm) to find the highest-probability path througha hidden Markov model [14]. 2.1 Smoothed  n -gram Language Models If the plaintext string  p  = (  p 1 ,p 2 ,...p  ) is known to havelength   , we wish Pr 1  to specify a probability distributionover strings of length   . In our experiments, we simply usean  n -gram character language model (taking  n  = 7), whichmeans definingPr 1 (  p ) =  Y i =1 Pr 1 (  p i  |  p i  . − n +1 ,...p i − 2 ,p i − 1 ) (1)where  i . −  n  denotes max( i  −  n, 0). In other words, thecharacter  p i  is assumed to have been chosen at random,where the random choice may be influenced arbitrarily bythe previous  n  −  1 characters (or  i  −  1 characters if   i <n ), but is otherwise independent of previous history. Thisindependence assumption is equivalent to saying that thestring  p  is generated from an ( n − 1)st order Markov process.Equation (1) is called an  n -gram model because the nu-merical factors in the product are derived, as we will see,from statistics on substrings of length  n . One obtains thesestatistics from a training corpus of relevant texts. Obvi-ously, in practice (and in our experiments) one must selectthis corpus without knowing the plaintexts  p  and  q  . 3 How-ever, one may have side information about the  type   of plain-text (“genre”). One can create a separate model for eachtype of plaintext that one wishes to recover (e.g. Englishcorporate email, Russian military orders, Klingon poetry inMicrosoft Word format). For example, our HTML languagemodel was derived from a training corpus that we built bysearching Google on common English words and crawlingthe search results. 2 That problem also requires knowing the distribution Pr( x |  p ), which characterizes how text strings tend to be renderedas speech. Fortunately, in our situation, the comparableprobability Pr( x  |  p,q  ) is simply 1, since the observed  x  isa deterministic function (namely  xor ) of   p,q  . Our methodcould easily be generalized for imperfect (noisy) eavesdrop-ping by modeling this probability differently. 3 It would be quite possible in future work, however, tochoose or build language models based on information about  p  and  q   that our methods themselves extract from  x . Asimple approach would try several choices of (Pr 1 , Pr 2 ) anduse the pair that maximizes the probability of observing  x .More sophisticated and rigorous approaches based on [1, 6]would use the imperfect decodings of   p  and  q   to  reestimate  the parameters of their respective language models, startingwith a generic language model and optionally iterating untilconvergence. Informally, the insight here is that the initialdecodings of   p  and  q  , particularly in portions of high confi-dence, carry useful information about (1) the genres of   p  and q   (e.g., English email), (2) the particular topics covered in  p and  q   (e.g., oil futures), and (3) the particular  n -grams thattend to recur in  p  and  q   specifically. For example, for (2), onecould use a search-engine query to retrieve a small corpusof documents that appear similar to the first-pass decod-ings of   p  and  q  , and use them to help build “story-specific”language models Pr 1  and Pr 2  [10] that better predict the  n -grams of documents on these topics and hence can retrievemore accurate versions of   p  and  q   on a second pass.  (a) Correct pair recovered Incorrect pair recovered Not decrypted[5] This work [5] This work [5] This work P  0 ⊕ P  1  62.7% 100.0% 17.8% 0% 20.5% 0% P  1 ⊕ P  2  61.5% 99.99% 17.6% 0.01% 20.9% 0% P  2 ⊕ P  0  62.6% 99.96% 17.9% 0.04% 19.5% 0%(b) Correct when keystream Incorrect when keystream Not decryptedused three times used three times[5] This work [5] This work [5] This work P  0  75.2% 100.0% 12.3% 0% 12.5% 0% P  1  76.3% 100.0% 11.4% 0% 12.3% 0% P  2  75.4% 100.0% 11.8% 0% 12.8% 0% Figure 1:  These tables show a comparison between previous work [5] and this work. All results presented for previouswork are directly from [5]. Both systems were trained on the exact same dataset (the first 600,000 characters of theKing James Version of the Bible, specially formatted as in [5] — all punctuation other than spaces were removedand all letters converted to upper case) and were tested on the same three plaintexts (those used in [5], which wereincluded in the training set). Unlike the prior work, our system was tuned  automatically   on the training set, and nottuned at all for the test set. (a) The first table shows the results of recovering the plaintexts from the listed xorcombinations. The reported percentages show the recovery status for the pair of characters in each plaintext position, without necessarily being in the correct plaintext  . For example, the recovered  P  0  could contain parts of   P  0  and parts of   P  1 .(b) The second table shows the results when the same keystream is used to encrypt all three files, and  P  0  ⊕  P  1  and P  1  ⊕  P  2  are fed as inputs to the recovery program  simultaneously  . Here the percentages show whether a character wascorrectly recovered in the  correct   file. It is tempting to define Pr 1 ( s  |  h , o , b , n , o , b ) as the frac-tion of occurrences of   hobnob  in the Pr 1  training corpus thatwere followed by  s : namely  c ( hobnobs ) /c ( hobnob? ), where c ( ... ) denotes count in the training corpus and  ?  is a wild-card. Unfortunately, even for a large training corpus, such afraction is often zero (an underestimate!) or undefined. Evenpositive fractions are unreliable if the denominator is small.One should use standard “smoothing” techniques from nat-ural language processing to obtain more robust estimatesfrom corpora of finite size.Specifically, we chose parametric Witten-Bell backoff smooth-ing, which is about the state of the art for  n -gram models[4]. This method estimates the 7-gram probability by inter-polating between the naive count ratio above and a recur-sively smoothed estimate of the 6-gram probability Pr 1 ( s  | o , b , n , o , b ). The latter, known as a “backed-off” estimate, isless vulnerable to low counts because shorter contexts suchas  obnob  pick up more (albeit less relevant) instances. Theinterpolation coefficient favors the backed-off estimate if ob-served 7-grams of the form  hobnob?  have a low count onaverage, indicating that the longer context  hobnob  is insuf-ficiently observed.Notice that the first factor in equation (1) is simply Pr(  p 1 ),which considers no contextual information at all. This is ap-propriate if   p  is an arbitrary packet that might come fromthe middle of a message. If we know that  p  starts at thebeginning of a message, we prepend a special character  bom to it, so that  p 1  =  bom . Since  p 2 ,...p n  are all conditionedon  p 1  (among other things), their choice will reflect thisbeginning-of-message context. Similarly, if we know that  p ends at the end of a message, we append a special charac-ter  eom , which will help us correctly reconstruct the finalcharacters of an unknown plaintext  p . Of course, for thesesteps to be useful, the messages in the training corpus mustalso contain  bom  and  eom  characters. Our experiments onlyused the  bom  character. 2.2 Finite-State Language Models Having estimated our probabilities, we can regard the 7-gram language model Pr 1  defined by equation (1) as a verylarge edge-labeled directed graph,  G 1 , which is illustratedin Figure 2d, Figure 2a. Each vertex or “state” of   G 1  repre-sents a context—not necessarily observed in training data—such as the 6-gram  not se .Sampling a string of length    from Pr 1  corresponds to arandom walk on  G 1 . When the random walk reaches somestate, such as  hobnob , it next randomly follows an outgo-ing edge; for instance, it chooses the edge labeled  s  withindependent probability Pr 1 ( s  |  h , o , b , n , o , b ). Followingthis edge generates the character  s  and arrives at a new6-gram context state  obnobs . Note that the  h  has beensafely forgotten since, by assumption, the choice of the  next  edge depends only on the 6 most recently generated char-acters. Our random walk is defined to start at the empty,0-gram context, representing ignorance; it proceeds imme-diately through 1-gram, 2-gram, ...contexts until it entersthe 6-gram contexts and continues to move among those.The probability of sampling a particular string  p  by thisprocess, Pr 1 (  p ), is the probability of the (unique) path la-beled with  p . (A path’s label is defined as the concatenationof its edges’ labels, and its probability is defined as the prod-uct of its edges’ probabilities.)In effect, we have defined Pr 1  using a probabilistic finite-state automaton. 4 In fact, our attack would work for anylanguage models Pr 1 , Pr 2  defined in this way, not just  n -gram language models. In the general finite-state case, dif-ferent states could remember different amounts of context—or non-local context such as a “region” in a document. Forexample,  n -gram probabilities might be significantly differ- 4 Except that  G 1  does not have final states; we simply stopafter generating    characters, where    is given. This is re-lated to our treatment of   bom  and  eom .  obnobrobnobsobnobthobnobstbbb (a) r16 17(b,c)(c,b)(d,e)(e,d)(r,s)(t,u) . . .(s,r)(u,t) . . .nconcrnconcsnconctinconcstccc (b) r (c) (b,c)(b,c)(b,c)obnobsnconcr17nconcuobnobt17obnobrnconcs17inconchobnob16(s,r)(r,s)(t,u) (d) Figure 2:  Fragments of the graphs built lazily by our algorithm. (a) shows  G 1 , which defines Pr 1 .  If   we are everin the state  hobnob  (a matter that is yet to be determined), then the next character is most likely to be  b ,  s , space,or punctuation—as reflected in arc probabilities not shown—though it could be anything. (b) similarly shows  G 2 . inconc  is most likely to be followed by  e ,  i ,  l ,  o ,  r , or  u . (c) shows  X  , a straight-line automaton that encodes theobserved stream  x  =  p xor q . The figure shows the unlikely case where  x  = ( ..., 1 , 1 , 1 , 1 , 1 , 1 , 1 ,... ) : thus all arcs in  X   arelabeled with  (  p i ,q i )  such that  p i  ⊕  q i  =  x i  = 1 . All paths have length  | x | . (d) shows  G x . This produces exactly the samepair sequences of length  | x |  as  X   does, but the arc probabilities now reflect the product of the two language models,requiring more and richer states.  (16 , hobnob , inconc )  is a reachable state in our example since  hobnob  ⊕  inconc  = 111111 .Of the 256 arcs  (  p 17 ,q 17 )  leaving this state, the only reasonably probable one is  s , r , since both factors of its probabilityPr 1 ( s  |  hobnob ) · Pr 2 ( r  |  inconc )  are reasonably large. Note, however, that our algorithm might choose a less probable arc(from this state or from a competing state also at time 16) in order to find the  globally   best path of   G x  that it seeks. ent in a message header vs. the rest of the message, or anHTML table vs. the rest of the HTML document. Beyondremembering the previous  n  −  1 characters of context, astate can remember whether the previous context includesa  <table>  tag that has not yet been closed with  </table> .Useful non-local properties of the context can be manuallyhard-coded into the FSA, or learned automatically from acorpus [1]. 2.3 Cross Product of Language Models We now move closer to our goal by consructing the jointdistribution Pr(  p,q  ). Recall our assumption that  p  and  q   aresampled  independently   from the genre-specific probabilitydistributions Pr 1  and Pr 2 . It follows that Pr(  p,q  ) = Pr 1 (  p ) · Pr 2 ( q  ). Replacing Pr 1 (  p ) by its definition (1) and Pr 2 ( q  ) byits similar definition, and rearranging the factors, it followsthatPr(  p,q  ) =  Y i =1 Pr(  p i ,q  i  |  p i  . − n +1 ,...p i − 2 ,p i − 1 ,q  i  . − n +1 ,...q  i − 2 ,q  i − 1 ) (2)wherePr(  p i ,q  i  |  p i  . − n +1 ,...,p i − 1 ,q  i  . − n +1 ,...,q  i − 1 )= Pr 1 (  p i  |  p i  . − n +1 ,...,p i − 1 ) · Pr 2 ( q  i  | q  i  . − n +1 ,...,q  i − 1 ) (3)We can regard equation (2) as defining an even largergraph,  G  (similar to Figure 2d), which may be constructedas the cross product of   G 1  = ( V  1 ,E  1 ) and  G 2  = ( V  2 ,E  2 ).That is,  G  = ( V  1  × V  2 ,E  ), where  E   contains the labelededge ( u 1 ,u 2 ) ( char 1 ,char 2 ) :  prob 1 · prob 2 −−−−−−−−−−−−−−→ ( v 1 ,v 2 ) iff   E  1  contains u 1 char 1  :  prob 1 −−−−−−→ v 1  and  E  2  contains  u 2 char 2  :  prob 2 −−−−−−→ v 2 . The weight  prob 1 ·  prob 2  of this edge is justified by (3). Again, we never explicitly   construct this enormous graph, which has morethan 256 14 edges (for our situation of   n  = 7 and a characterset of size 256).This construction of   G  is similar to the usual constructionfor intersecting finite-state automata [8], the difference beingthat we obtain a (weighted) automaton over character  pairs  would still apply even if, as suggested at the end of theprevious section, we used finite-state language models otherthan  n -gram models. It is known as the “same-length crossproduct construction.” 2.4 Constructing and Searching The Space of Feasible Solutions Given  x  of length   , the feasible solutions (  p,q  ) correspondto the paths through  G  that are compatible with  x . A path e 1 e 2  ...e   is compatible with  x  if for each 1  ≤  i  ≤   , theedge  e i  is labeled with some (  p i ,q  i ) such that  p i ⊕ q  i  =  x i .As a special case, if   p i  and/or  q  i  is known to be the specialcharacter  bom  or  eom , then  p i ⊕ q  i  is unconstrained (indeedundefined).  We now construct a new weighted graph,  G x , that rep-resents just the feasible paths through  G . All these pathshave length   , so  G x  will be acyclic. We will then find themost probable path in  G x  and read off its label (  p,q  ).The construction is simple.  G x , shown in Figure 2d, con-tains precisely all edges of the form( i − 1 ,  (  p i  . − n +1  ...,p i − 1 ) ,  ( q  i  . − n +1  ...,q  i − 1 )) ( p i ,q i ) :  prob −−−−−−→  ( i,  (  p i  . − n +2  ...,p i ) ,  ( q  i  . − n +2  ...,q  i ))(4)such that  p j ⊕ q  j  =  x j  for each  j  ∈ [ i . − n  + 1 ,i ] and  prob  =Pr 1 (  p i  |  p i  . − n +1 ,...p i − 2 ,p i − 1 ) · Pr 2 ( q  i  | q  i  . − n +1 ,...q  i − 2 ,q  i − 1 ). G x  may also be obtained in finite-state terms as follows.We represent  x  as a graph  X   (Figure 2c) with vertices 0,1, ...  . From vertex  i − 1 to vertex  i , we draw 256 edges, 5 labeled with the 256 (  p i ,q  i ) pairs that are compatible with x i , namely (0 , 0  ⊕  x i ) ,... (255 , 255  ⊕  x i ). We then com-pute  G x  = ( V  x ,E  x ) by intersecting  X   with the language-pair model  G  as one would intersect finite-state automata.This is like the cross-product construction of the previoussection, except that here, the edge set  E  x  contains ( i  − 1 ,u )  ( char 1 ,char 2 ) : 1 · prob −−−−−−−−−−−−−−→ ( i,v ) iff the edge set of   X   contains( i − 1) ( char 1 ,char 2 ) : 1 −−−−−−→  i  and  E   contains  u ( char 1 ,char 2 ) :  prob −−−−−−→  v .Using dynamic programming, it is now possible in  O (  )time to obtain our decoding by finding the best length-   pathof   G x  from the initial state (0,(),()). Simply run a single-source shortest-path algorithm to find the shortest path toany state of the form ( ,... ), taking the length of each edgeto be the negative logarithm of its probability, so that min-imizing the sum of lengths is equivalent to maximizing theproduct of probabilities. 6 It is not even necessary to use thefull Dijkstra’s algorithm with a priority queue, since  G x  isacyclic. Simply iterate over the vertices of   G x  in increas-ing order of   i , and compute the shortest path to each ver-tex ( i,... ) by considering its incoming arcs from vertices( i − 1 ,... ) and the shortest paths to  those   vertices. This isknown as the Viterbi algorithm; it is guaranteed to find theoptimal path.The trouble is the size of   G x . On the upside, because x j  constrains the pair (  p j ,q  j ) in equation (4), there are atmost   · 256 6 states and   · 256 7 edges in  G x  (not   · 256 12 and  · 256 14 ). Unfortunately, this is still an astronomical num-ber. It can be reduced somewhat if Pr 1  or Pr 2  places hardrestrictions on characters or character sequences in  p  and  q  ,so that some edges have probability 0 and can be omitted.As a simple example, perhaps it is known that each (  p j ,q  j )must be a pair of   printable   (or even alphanumeric) charac-ters for which  p j ⊕ q  j  =  x j . However, additional techniquesare usually needed.Our principal technique at present is to prune  G x  dras-tically, sacrificing the optimality guarantee of the Viterbialgorithm. In practice, as soon as we construct the states( i,... ) at time  i , we determine the shortest path from theinitial state to each, just as above. But we then keep onlythe 100 best of these time- i  states according to this metric(less greedy than keeping only the 1 best!), so that we needto construct at most 100 · 256 states at time  i  + 1. Theseare then evaluated and pruned again, and the decoding pro- 5 Each edge has weight 1 for purposes of weighted intersec-tion or weighted cross-product. This is directly related tofootnote 2. 6 Using logarithms also prevents underflow.ceeds. More sophisticated multi-pass or A* techniques arealso possible, although we have not implemented them. 7 2.5 Multiple Reuse If a keystream  k  is used more than twice, the method workseven better. Assume we now have three plaintexts to re-cover,  p ,  q  , and  r , and are given  p ⊕ q   and  p ⊕ r  (note that q   ⊕  r  adds no further information). A state of   G  or  G x now includes a triple of language model states, and an edgeprobability is a product of 3 language model probabilities.The Viterbi algorithm can be used as before to find thebest path through this graph given a pair of outputs (thosecorresponding to  p ⊕ q   and  p ⊕ r ). Of course, this techniquecan be extended beyond three plaintexts in a similar fashion. 3 Implementation Our implementation of the probabilistic plaintext recoverycan be separated cleanly into two distinct phases. First,language models are built for each of the types of plaintextthat will be recovered. This process only needs to occur onceper type of plaintext since the resulting model can be reusedwhenever a new plaintext of that type needs to be recovered.The second phase is the actual plaintext recovery.All our model building and cracking experiments wererun on a commodity Dell server (Dual Xeon 3.0 GHz, 8GBRAM) that cost under $2,000. The server runs a Linux ker-nel that supports the Xeon’s 64-bit extensions to the x86instruction set. The JVM used is BEA’s freely availableJRockit since Sun’s JVM does not currently support 64-bitmemory spaces on x86. 3.1 Building the Language Models To build the models, we used an open source natural lan-guage processing (NLP) package called LingPipe [4]. 8 Ling-Pipe is a Java package that provides an API for many com-mon NLP tasks such as clustering, spelling correction, andpart-of-speech tagging. We only used it to build a characterbased  n -gram model based on a large corpus of documents(see section 4 for details of the corpora used in our experi-ments). Internally, LingPipe stores the model as a trie withgreater length  n -grams nearer the leaves. We had LingPipe“compile” the model down to a simple lookup table basedrepresentation of the trie. Each row of the table, whichcorresponds to a single  n -gram, takes 18 bytes except forthe rows which correspond to leaf nodes (maximal length  n -grams) which take only 10 bytes. If an  n -gram is never seenin the training data, it will not appear in the table: insteadthe longest substring of the  n -gram that does appear in thetable will be used. The extra 8 bytes in these nodes specifieshow to compute the probability in this “backed-off” case.All probabilities in both LingPipe and our Viterbi implen-tation are computed and stored in log-space to avoid issueswith integer underflow. All of the language models used inthis paper have  n  = 7. The language models take several 7 If we used our metric to prioritize exploration of   G x  insteadof pruning it, we would obtain A* algorithms (known inthe speech and language processing community as “stackdecoders”). In the same A* vein, the metric’s accuracy canbe improved by considering right context as well as left: onecan add an estimate of the shortest path from ( i,... ) throughthe remaining ciphertext to the final state. Such estimatescan be batch-computed quickly by decoding  x  from end-to-beginning using smaller,  m -gram language models ( m < n ). 8 Available at:
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks