Skip to main content

3 must-know Linux commands for text manipulation

You may be familiar with grep, sed, and awk, but you may not know everything they can do.

Sysadmins use an untold number of command-line tools, and you probably regularly use the three discussed in this article: grep, sed, and awk. But do you know all the ways you can use them to manipulate text? If not (or you're not sure), continue reading.

Before I get started, here are the origins of the commands' names:

  1. grep: According to Wikipedia, the name "comes from the ed command g/re/p (globally search for a regular expression and print matching lines), which has the same effect." ed is a "line-oriented text editor." Even for someone who likes the command line, editing files line-by-line seems too old-fashioned, but people had to start with something in ancient times ).
  2. sed: The name comes from its main use, as a stream editor.
  3. awk: Its name comes from its authors' initials (Aho, Weinberger, and Kernighan). If the name Kernighan rings any bells (pun intended) for you, it is because this Canadian computer scientist contributed to the creation of Unix and co-authored the first book about the C language.

It's excellent to trace the commands' genealogical tree, but what really matters is that these commands are pretty helpful for text manipulation.

In the following examples, I will use a file named quotes.txt to illustrate how to use the commands. Here are the contents of this file:

$ cat quotes.txt

"God does not play dice with the universe."
- Albert Einstein, The Born-Einstein Letters 1916-55

"Not only does God play dice but... he sometimes throws them where they cannot be seen."
- Stephen Hawking

"I regard consciousness as fundamental..."
- Max Planck

"The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself."
- Carl Sagan

"[T]he atoms or elementary particles themselves are not real; they form a world of potentialities or possibilities rather than one of things or facts."
- Werner Heisenberg

grep

The simplest way to use grep is:

$ grep universe quotes.txt

"God does not play dice with the universe."
"The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself."

This example provides the string to search for (universe) and the place to look for it (quotes.txt).

If there are spaces in the string you want to search, you must put quotes around it:

$ grep "the universe" quotes.txt

"God does not play dice with the universe."
"The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself."

Some common variations when using grep are:

  • Ignore case: grep -i string-to-search filename
  • Search in multiple files: grep -i string-to-search *.txt

You can search for a regular expression:

$ grep "191[0-9]" quotes.txt

- Albert Einstein, The Born-Einstein Letters 1916-55

If you want to enable extended regexp patterns to use symbols like +, ?, or |, you can use the egrep command, which is a shortcut for adding the -E flag to grep. This also enables you to search for multiple strings:

$ egrep -i "albe|hawk" quotes.txt

- Albert Einstein, The Born-Einstein Letters 1916-55
- Stephen Hawking

To show lines that include the word "universe" plus the next line (in order to include the author's name):

$ grep -i universe -A 1 quotes.txt

"God does not play dice with the universe."
- Albert Einstein, The Born-Einstein Letters 1916-55
--
"The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself."
- Carl Sagan

As you can probably guess, you could display more lines by passing a different number. Or you could show the lines before by using the flag -B.

So far, I've showed grep running alone, but it is very common to have it in a chain of commands:

$ echo "Authors who mentioned 'universe'"; cat quotes.txt | grep -i universe -A 1 | grep "^-"

- Albert Einstein, The Born-Einstein Letters 1916-55
- Carl Sagan

[ You might also be interested in reading 11 Linux commands I can't live without. ]

sed

My favorite use for sed is to replace strings in files. For example:

$ cat quotes.txt | sed 's/universe/Universe/g'

This will replace universe with Universe and send the result to stdout. The g flag means "replace all occurrences of the string in each line."

Some variations for this are:

  • Replace the string only if it's found in the first three lines:
    sed '1,3 s/universe/Universe/g' quotes.txt
  • Replace the n-th occurrence of a pattern in a line (for example, the second occurrence):
    sed 's/universe/Universe/2' quotes.txt

These examples don't change the original file. If you want sed to change the file in place, use -i:

$ sed -i 's/universe/Universe/g' quotes.txt

If you use the -i flag, make sure that you know exactly what and how many occurrences will be affected, as it will modify the original file. To find out, you can run a grep and search for the pattern first.

[ Want to test your sysadmin skills? Take a skills assessment today. ]

awk

The awk utility is very powerful, offering many options for processing text files.

Most of the situations where I use awk involve processing files with a structure (columns) that is reasonably predictable, including the character used as a column separator.

When awk processes a file, it splits each line using the "field separator" (internal variable FS, which by default is the space character). Each field is assigned to positional variables ($1 contains the first field, $2 contains the second, and so forth. $0 represents the full line).

You can also apply filters to each line. For example:

$ cat quotes.txt | awk '/universe/ { print NR " - " $0 }'

1 - "God does not play dice with the universe."
10 - "The cosmos is within us. We are made of star-stuff. We are a way for the universe to know itself."

The commands passed to awk use single quotes (it is like passing a mini-program to be interpreted):

  • The /universe/ part tells awk to select only the lines that match this pattern.
  • The "main" program goes between the curly brackets.
  • NR is the internal variable that contains the number of the current record, for example, the current line number.
  • I added the " -" string for aesthetics.

The internal variables in awk are:

  • NR: The total number of input records seen so far by the command
  • NF: The number of fields in the current input record
  • FS: The input field separator (a space by default)

Here is an example using a more "predictable" file format:

$ cat /etc/passwd | awk '/nologin/ { FS=":"; print $1 }'

(output omitted)
...
redis
akmods
cjdns
haproxy
systemd-oom

In this last example:

  • /nologin/ selects only the lines that contain this pattern.
  • FS=": "; sets the field separator to : instead of the default (space).
  • print $1 prints the first field in each line (considering that the separator is :).

Learn more

Those were some simple examples for using grep, sed, and awk.

If you read the man pages for each, you will notice plenty of additional parameters and uses for these handy commands.

For simple use cases and things you do only once in a while, it is always good to have tools like these in your toolbox.

If the required action is more complex, it is worth considering if these tools still make sense for you to use. For a corporate use case or managing "everything-as-code," I recommend using Ansible. Ansible modules have similar features that let you emulate the operations described above, with the advantage that Ansible modules usually have idempotency and that the full process will be documented somewhere (such as in your internal Git repo).

Topics:   Command line utilities   Linux  
Author’s photo

Roberto Nozaki

Roberto Nozaki (RHCSA/RHCE/RHCA) is an Automation Principal Consultant at Red Hat Canada where he specializes in IT automation with Ansible. More about me

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.