Offline Tool for finding identical/similar phrases in a text?
I'm searching for a tool which finds and lists identical phrases in a long text like a dissertation.
The goal is to find repeating texts which have been created accidentally by copy/paste.
It necessarily has to be an offline tool, as I don't want to use an online tool, where my text is processed and possibly stored on a server under someone else's control.
3 Comments
Sorted by latest first Latest Oldest Best
It depends how much you want to get into it and how big the project is. If it's finding only identical phrases or text, most word processors support this, and finally a lot of programming ide's support regular-expression searching (http://en.wikipedia.org/wiki/Regular_expression).
On the other hand if you are looking to find reworded or similar paragraphs and/or excerpts and want to automate this (rather than substringing and searching for variations manually), than I would recommend you look at www.nltk.org/
NLTK is a toolkit that integrates a whole range of language manipulation, sorting and tagging tools - I realize it may be more in depth than you want to go (but it is quite easy to learn and has a very good beginner friendly documentation www.nltk.org/book/).
I'll let others comment on more generic and perhaps more 'use out of the box' type tools, as I have no experience with those.
As @user3467349 states, regular expressions (regexes) are your (complicated) friend. There's quite a learning curve involved, but it's worth it if you have to do any significant amount of text searching or modification. Many tools support their use (and some use slightly different dialects for their syntax).
If you have access to a Linux system (where it will almost always be installed by default - I expect it must also be installed on OS X, but I don't use that OS) or to a package of tools for your OS (such as Cygwin for Windows), then the place to start is with the grep (global regular expression print) command. It can find almost anything once you know the basics of regexes. There is also a more obscure command, agrep, which specializes in finding things "like" other things, but I haven't used it yet.
Another alternative is if you have or can install the programming language perl on your system (should be available for almost all OSs), it supports it's own dialect of regular expressions which I believe is the most powerful version. It's easy to access the regular expressions feature of perl without knowing a lot about the rest of the language.
If you do have access to a Linux system, then the command
info sed
will bring up instructions for using the sed tool (which you don't need at the moment). But, if you scroll way down in this help file there's a fairly detailed section explaining how to use regular expressions. This will transfer directly to using them with grep as well.
This is one of the main functions of the program, ClicheCleaner, which highlights passages in your text that are either cliches, other overly-used common expressions, or phrases of your own that you have repeatedly used within the same document. ClicheCleaner includes a list of nearly 7000 unique cliches and common expressions that are compared against your text.
Currently it only works on text files; a new release currently in progress will allow editing of text, Word, and PDF files within the document.
It runs on all versions of Windows.
Disclosure: I am the author of this program.
Terms of Use Privacy policy Contact About Cancellation policy © freshhoot.com2025 All Rights reserved.