Power searching using UNIX grep

UNIX grep is a command tool for searching text strings inside files. (One should not confuse it with find which matches filenames and properties). In this blog post there are some hints how to use grep to search from files fast and efficiently.

Example how to search Plone source tree for “content-core” examples in page template files

Some notes about grep

  • Grep can search multiple files and directory trees
  • Grep can be tuned to be faster
  • Grep output can be friendly and colorized

As with many UNIX tools, due to legacy and backwards compatibility, grep doesn’t do these things out of the box and simply provides you an plain barebone interface.

1. Install GNU grep

GNU grep supports plenty of options, like better coloring, over BSD grep which is shipped with BSD based operating systems like OSX. You can install GNU grep from grep package of Macports. See ztanesh README for example sudo port install command.

2. Searching multiple files

Below is an example how to search case-insensitive (-i) match, recursively (-R) from a folder, only including (–include) .py files. I.e. It searches all Python files in the source tree for “foobar” word:

grep -Ri --include="*.py" foobar ~/code/mixnap/krusovice-src

3. Using colors

You can colorize things in grep output like filename, linenumber, highlighted match and lines around the result.

Below is my example for setting GREP_COLORS environment variable

GREP_OPTIONS="--color=always"
GREP_COLORS="ms=01;37:mc=01;37:sl=:cx=01;30:fn=35:ln=32:bn=32:se=36"

Note: Use GREP_COLORS, not deprecated GREP_COLOR environment variable, as the former provides much more options.

4. Search as ASCII

By default, grep will decode incoming text files in encoding set in environment variables. This will take CPU cycles. If you are searching plain ASCII match, like with programming language source code files, you can gain much speed by disabling the decoding. Override LC_CTYPE environment variable when running grep:

LC_CTYPE=POSIX grep....

This is a GNU grep bug and fixed in 2.7.

5. Show lines around the match

You can specify –before-context and –after-context options which show the text snippet around the matching line. Also –line-number is very useful switch when dealing with source code files.

6. ZSH shell search alias

This wraps it all together. We define a ZSH function search which will give us a shortcut for searching multiple files in a folder tree:

# Search ASCII-string from multiple files in the currect working directory
# E.g.
# search "foobar" "*.html"
# search "foobar" "*.html" myfolder
# By default we excluse dotted files and directoves (.git, .svn)
function search() {

        if [[ ! -n "$1" ]] ; then
                echo "Usage: search \"pattern\" \"*.filemask\" \"path\""
                return
        fi

        # Did we get path arg
        if [[ ! -n "$3" ]] ;
        then
                search_path="."
        else
                search_path="$3"
        fi

        # LC_CTYPE="posix" 20x increases performance for ASCII search
        # https://twitter.com/jlaurila/status/86750682094374912

        # We use specially tuned GREP colors - make sure you have GNU grep on OSX
        # https://github.com/miohtama/ztanesh/blob/master/README.rst

        GREP_COLORS="ms=01;37:mc=01;37:sl=:cx=01;30:fn=35:ln=32:bn=32:se=36" LC_CTYPE=POSIX \
        grep -Ri "$1" --line-number --before-context=3 --after-context=3 --color=always --include="$2" --exclude=".*" "$search_path"/*
}

This, and other ZSH goodies, are available in ztanesh package on Github.

7. Turn off OS native file indexing

If you use grep as your primary search tool I suggest you turn off your operating system search indexing operations like OSX Spotlight. These just take space and CPU cycles.

\"\" Subscribe to RSS feed Follow me on Twitter Follow me on Facebook Follow me Google+

18 thoughts on “Power searching using UNIX grep

  1. For what it’s worth, my default tool for doing this is ‘grin’ (see PyPI) and at least on my system, the speed comparison between it and grep (even when seeming tuned) is unbelievable. I think my tests were somewhere around a few seconds for grin to complete and grep was still going over a minute later (on an SSD too). Unsure why this is the case, but it works.

    There’s also ack too – produces similar performance to grin.

  2. Mikko, I suggest that to a user, it doesn’t matter if it’s written in Perl or not. Unless you’re going to use the –output option, where you can use Perl’s regex variables to output portions of your match, you won’t even notice.

    Another neat little trick about ack: It knows about file types other than just by the file extension. Say you have a Python script that doesn’t end in “.py”. grep –include=”*.py” won’t know to search it. So long as it starts with a proper shebang line of “#!/usr/bin/python” or similar, ack’s –python option will still know to search it.

    @David: I suspect that the speedup could be from grin and ack not searching version control files and other non-source code files.

  3. Andy: Why so serious? 🙂 Perl is not obstacle – just joking around. I’ll try ack in some point, though grep can cater my needs pretty perfectly currently.

  4. Hi Mikko,

    Mac OS Lion seems to came with GNU grep:
    $ /usr/bin/grep –version
    grep (GNU grep) 2.5.1

    Another option to install newer versions of GNU grep is rudix, from a great Brazilian pythonist Rudá:
    http://rudix.org — pre-built packages
    sudo rudix install grep

    The version 2.5.1 shipped with OSX Lion doesn’t support –color=”always”.

    I also recommend install coreutils to better color outputs in your terminal in OSX.
    sudo rudix install coreutils.

  5. Tried Ack for some time but went back to grep. GREP_OPTIONS with the relevant exclude-dir(s) just feels better to me.

    I’d be interested in a real benchmark of grep/grin/ack though…

  6. Directories are directories only when looked from the perspective of system programming i.e. how C level data structures behave)

    Folders are much better real-life analogue introduced us by Windows 95. There is no reason to stick with bad analogues due to history of computer usage being closer to computer programming what it was in early days.

  7. Pingback: Links 4/6/2012: Linux Advocacy and More | Techrights

  8. Confucius say’s always label reality correctly.

    A folder in reality is a directory…

  9. I’ve been calling them “File Boxes” for two decades. Is that why nobody at work will have lunch with me?

  10. Admins and power users in Linux/Unix tend to call them directories because the commands we use have flags that refer to them as directories and in fact the whole OS refers to them as directories as well.

    Example:

    find -type d (d for directory, f would be for file)
    dirname (short for directory name)
    cd (short for change directory)

    Even a simple “ls” command will return the following error if it does not find what it’s looking for:

    $ ls foo
    ls: cannot access foo: No such file or directory

    So, while I can see people wanting to follow the Windows paradigm for calling directories “folders”, it is not only unpopular in Unix/Linux… it is incorrect if you are trying to stay in line with the platform you are in, and simply confusing if you don’t.

    Both make sense really, directories are accurate for files in a file system because it symbolizes an index-able list that helps to locate a resource or file. This is a concept similar to a office complex or company’s directory that allows you to “look up” a suite or phone extension by a company or person’s name or a telephone directory that help you look up a phone number by a persons name. All files are stored on disk and indexed by inode number in Linux. The directories are way to separate and list files by name so that we can get to the inode number that describes to the OS where it is on disk. We never have to worry about remembering the inode number this way or even that it HAS an inode. Much like we never have to remember a suite number, we just look it up and there it is!

    However that said, a folder is also a good concept that correlates nicely too. In an office you would have folders that contain files and they are organized in a manor that allows you to find these files easier. Thousands of files can be stored in thousands of folders and yet if organized and labeled well we will get visual clues to draw our attention to the correct folder and eventually to the correct file. We don’t have to remember the exact folder as long as we can see the label (folder name) we can work out usually if the file is in it.

    They are simply two separate ways to address a the same issue. How do we find our files more easily?

    Directories and Folders both have easy visual clues (folder names) and they both are designed to aide us in getting to our files faster.

    The only thing that makes it right or wrong to call them one or the other is being consistent with the OS you are using. IMHO “right” or “wrong” is even pretty harsh. More like what makes better sense and is less confusing.

    Hope this helps some of you that are confused by the whole folder/directory confusion. 🙂

  11. Pingback: Recursive Greps | Irreal

  12. Just my 2 cents, but I want to point out that FILES reside inside FOLDERS whereas a DIRECTORY tells where to find them. The sad fact was that Windows 95 had folders that really physically contained the file record, whereas Unix has always had directories that allow us to have nice things such as hard links 😉

  13. Pingback: Tracing and fixing Python buildout version conflicts and dependencies

Leave a Reply

Your email address will not be published. Required fields are marked *