Wednesday, June 3, 2015

How to match newlines in sed

Original: http://backreference.org/2009/12/23/how-to-match-newlines-in-sed/

Lots of sed newcomers ask why the following snippets of code, or some variation thereof, don't work (they actually work as expected; it's just that the results are not what they think should be):
# All these do NOT produce the expected result!
sed 's/\n//g'             # remove all newline characters
sed 's/PATTERN\n//        # if the line ends in PATTERN, join it with the next line
sed 's/FOO\nBAR/FOOBAR/'  # if a line ends in FOO and the next starts with BAR, join them
To understand why those "don't work", it's necessary to look at how sed reads its input.
Basically, sed reads only one line at a time and, unless you perform special actions, there is always a single input line in the pattern space at any time. That line does NOT have a trailing newline characters, because sed removes it. When the line is printed at the end of the cycle, sed adds back a newline character, but while the line is in the pattern buffer, there's simply no \n in it. Now it's easy to see why none of the above programs will do what you think: the lhs (left hand side) will never match what's in the pattern space, so no replacement will be performed. However, sed does add a newline when you ask it to perform certain commands.
So the next question is: how to do the things that the above programs wrongly attempted to do?
Three not-so well-known commands that are useful for these applications are NP and D.
  • N reads in another line of input and appends it to the current pattern space, separated by a newline;
  • P prints the contents of the pattern space, up to the first newline (or to the end if there is no newline);
  • D deletes the contents of the patterns space, up to the first newline (or to the end if there is no newline), and starts a new cycle. The latter means that any commands that come after the D in the sed program will not be executed if D itself is executed.
So let's put these commands to good use:
sed ':begin;$!N;s/\n//;tbegin'                   # deletes all newlines except the last; see also tr -d '\n'
sed ':begin;$!N;s/\n/ /;tbegin'                  # same as before, but replaces newlines with spaces; see also tr '\n' ' '
sed ':begin;$!N;s/\(PATTERN\)\n/\1/;tbegin;P;D'  # if the line ends in PATTERN, join it with the next line
sed ':begin;$!N;/PATTERN\n/s/\n//;tbegin;P;D'    # same as above
sed ':begin;$!N;s/FOO\nBAR/FOOBAR/;tbegin;P;D'   # if a line ends in FOO and the next starts with BAR, join them
The programs that join lines, above, keep joining lines as long as the conditions for joining with the next line are true. Note that the mentioned solutions based on trare not exactly equivalent, in that they will remove or replace the very last newline of the input too, meaning that the output won't be terminated by \n.
For more information, see the sed FAQ, especially this section, and the sed oneliners.

No comments:

Post a Comment