Instruction manuals

Text Processing in Python

of 435
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
  FRONTMATTER -- ACKNOWLEDGMENTS:------------------------------------------------------------------- Portions of this book are adapted from my column _Charming Python_ and other writing first published by -IBM developerWorks-, <>. I wish to thank IBM for publishing me, for granting permission to use this material, and most especially for maintaining such a general and useful resource for programmers. +++ The Python community is a wonderfully friendly place. I made drafts of this book, while in progress, available on the Internet. I received numerous helpful and kind responses, many that helped make the book better than it would otherwise have been. In particular, the following folks made suggestions and contributions to the book while in draft form. I apologize to any correspondents I may have omitted from the list; your advice was appreciated even if momentarily lost in the bulk of my saved email. +++ Sam Penrose <> UserDict string substitution hacks. Roman Suzi <> More on string substitution hacks. Samuel S. Chessman <> Helpful observations of various typos. John W. Krahn <> Helpful observations of various typos. Terry J. Reedy <> Found lots of typos and made good organizational suggestions. Amund Tveit <> Pointers to word-based Huffman compression for Appendix B. Pascal Oberndoerfer <> Suggestions about focus of parser discussion. Bob Weiner <> Suggestions about focus of parser discussion. Max M <>   Thought provocation about XML and Unicode entities. John Machin <> Nudging to improve sample regular expression functions. Magnus Lie Hetland <> Called use of default static arguments spooky code and failed to appreciate the clarity of the '<>' operator. Tim Andrews <> Found lots of typos in Chapters 3 and 2. Marc-Andre Lemburg <> Wrote [mx.TextTools] in the first place and made helpful comments on my coverage of it. Mike C. Fletcher <> Wrote [SimpleParse] in the first place and made helpful comments on my coverage of it. Lorenzo M. Catucci <> Suggested glossary entries for CRC and hash. David LeBlanc <> Various organizational ideas while in draft. Then he wound up acting as one of my technical reviewers and provided a huge amount of helpful advice on both content and organization. Mike Dussault <> Found an error in combinatorial HOFs and made good suggestions on Appendix A. Guillermo Fernandez <> Advice on clarifying explanations of compression techniques. Roland Gerlach <> Typos are boundless, but a bit less for his email. Antonio Cuni <> Found error in srcinal Schwartzian sort example and another in 'map()'/'zip()' discussion. Michele Simionato <> Acted as a nice sounding board for deciding on final organization of the appendices. Jesper Hertel <>   Was frustrated that I refused to take his well-reasoned advice for code conventions. Andrew MacIntyre <> Did not comment on this book, but has maintained the OS/2 port of Python for several versions. This made my life easier by letting me test and write examples on my favorite machine. Tim Churches <> A great deal of subversive entertainment, despite not actually fixing anything in this book. Moshe Zadka <> Served as technical reviewer of this book in manuscript and brought both erudition and an eye for detail to the job. Sergey Konozenko <> Boosted my confidence in final preparation with the enthusiasm he brought to his technical review--and even more so with the acuity with which he got my attempts to impose mental challenge on my readers.FRONTMATTER -- PREFACE------------------------------------------------------------------- Beautiful is better than ugly. Explicit is better than implicit. Simple is better than complex. Complex is better than complicated. Flat is better than nested. Sparse is better than dense. Readability counts. Special cases aren't special enough to break the rules. Although practicality beats purity. Errors should never pass silently. Unless explicitly silenced. In the face of ambiguity, refuse the temptation to guess. There should be one--and preferably only one--obvious way to do it. Although that way may not be obvious at first unless you're Dutch. Now is better than never. Although never is often better than *right* now. If the implementation is hard to explain, it's a bad idea. If the implementation is easy to explain, it may be a good idea. Namespaces are one honking great idea--let's do more of those! --Tim Peters, The Zen of Python SECTION 1 -- What is Text Processing?------------------------------------------------------------------- At the broadest level text processing is simply taking textual information and -doing something- with it. This doing might be restructuring or reformatting it, extracting smaller bits of   information from it, algorithmically modifying the content of the information, or performing calculations that depend on the textual information. The lines between text and the even more general term data are extremely fuzzy; at an approximation, text is just data that lives in forms that people can themselves read--at least in principle, and maybe with a bit of effort. Most typically computer text is composed of sequences of bits that have a natural  representation as letters, numerals, and symbols; most often such text is delimited (if delimited at all) by symbols and formatting that can be easily pronounced as next datum.  The lines are fuzzy, but the data that seems least like text--and that, therefore, this particular book is least concerned with--is the data that makes up multimedia  (pictures, sounds, video, animation, etc.) and data that makes up UI events (draw a window, move the mouse, open an application, etc.). Like I said, the lines are fuzzy, and some representations of the most nontextual data are themselves pretty textual. But in general, the subject of this book is all the stuff on the near side of that fuzzy line. Text processing is arguably what most programmers spend most of their time doing. The information that lives in business software systems mostly comes down to collections of words about the application domain--maybe with a few special symbols mixed in. Internet communications protocols consist mostly of a few special words used as headers, a little bit of constrained formatting, and message bodies consisting of additional wordish texts. Configuration files, log files, CSV and fixed-length data files, error files, documentation, and source code itself are all just sequences of words with bits of constraint and formatting applied. Programmers and developers spend so much time with text processing that it is easy to forget that that is what we are doing. The most common text processing application is probably your favorite text editor. Beyond simple entry of new characters, text editors perform such text processing tasks as search/replace and copy/paste, which--given guided interaction with the user--accomplish sophisticated manipulation of textual sources. Many text editors go farther than these simple capabilities and include their own complete programming systems (usually called macro processing ); in those cases where editors include Turing-complete macro languages, text editors suffice, in principle, to accomplish anything that the examples in this book can. After text editors, a variety of text processing tools are widely used by developers. Tools like File Find under Windows, or grep on Unix (and other platforms), perform the basic chore of -locating- text patterns. Little languages  like sed and awk perform basic text manipulation (or even nonbasic). A large number of utilities--especially in Unix-like environments--perform small custom text processing tasks: 'wc', 'sort', 'tr', 'md5sum', 'uniq', 'split', 'strings', and many others. At the top of the text processing food chain are general-purpose programming languages, such as Python. I wrote this book on
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks