== name == Porter Word Stemmer == category == Text Processing == author == Steven M. Haflich mailto(smh@franz.com) wrote the CL version of the original stemmer algorithm written and maintained by Martin Porter. == author-image == == short-description == A word stemmer takes a word as a string and removes common morphological endings, etc., in order to find a root stem for a word. == long-description == Stemmers are used in various information retrieval, indexing, and web scraping applications. For example, a search engine for a documentation set might want to collapse and index the words "compile", "compiler", "compiling", and "compilation" all into the same root "compil". Assuming uses of these words are related to the same concept, a search for any of them (actually, a search for the common stem) would find references to all of them. Stemmers do sometimes collapse unrelated words into the same root. This causes false positives in searches. Stemmers work best where false positives do not greatly harm usability of the results. Stemming a language like English (which is very irregular in its morphology) is particularly difficult as the algorithm must find the right balance between handling many special cases yet not collapsing too many unrelated words into the same stem. The stemming algorithm is a marvelous collection of transformational twists and tricks that must be executed in exactly the right way and in exactly the right order. Martin Porter devised this particular stemmer in 1980 and he and others has\ve since created versions for it in many programming languages. Steve Haflich coded the Common Lisp version a few years ago and offerred it back to Porter, who maintains an official site for the stemmer. Good descriptions of the algorithm as well as a test suite are available on Porter's site. Don't expect to understand the stemming algorithm without some serious study, but it is quite fast and works really well. The Lisp implementation passes Porters test suite and was tested creating a useful word root index of the entire Allegro CL documentation set. == examples == cl-user: (mapcar #'stem '("compile" "compiler" "compilers" "compiled" "compiling" "compilings" "compilation" "compilations")) ("compil" "compil" "compil" "compil" "compil" "compil" "compil" "compil") cl-user: (stem "furious") "furiou" cl-user: (stem "furies") "furi" == instructions == Simply compile and load the single file "stem.cl". If you want it in some package other than the cl-user package, change the in-package form. There is only a single function intended to be called externally: (stem ) returns another string that is the stem of the single argument. Exception: If the length of the argument string is two or less, the argument string is returned immediately, not a copy. This stemmer works only on lower-case strings, and only on English words. There are stemmers for other languages elsewhere. Porter has since done newer work on a multi-language-capable grammar-driven stemmer. It would be great if someone wanted to port this newer work. See href(http://snowball.tartarus.org,http://snowball.tartarus.org). == tutorial == Too simple for a tutorial -- see the Instructions section. == home-url == http://www.tartarus.org/~martin/PorterStemmer == doc-url == http://www.tartarus.org/~martin/PorterStemmer == license == The authors Porter and Haflich (of Franz Inc.) treat this as open source code and no license is necessary. == book == None. == references == Martin Porter maintains a web page for the href(Porter Stemming Algorithm,http://www.tartarus.org/~martin/PorterStemmer/) with sources, documentation, and links to related resources. == source-fooball == Download href(stem.cl,/source/stem.cl) from our site or from href(Martin Porter's site,http://www.tartarus.org/~martin/PorterStemmer). Further descriptions of the algorithm and test suites are available there. == release-date == 21 March, 2002. == release-version == 1.01 == status == Stable. == history == See the "Version history" in the source. == acl-dependencies == The stemmer consists entirely of very basic string-manipulation code and should work reliably in any Common Lisp. There is essentially no place for system dependencies to rear up. == other-dependencies == None. == platform == Should work in any version of Allegro CL, or any reasonably-conformant ANSI Common Lisp. == ad ==