Sunday, February 1, 2009

Extract a word from a file between two words

Many a times its a common requirement to extract a set of words from a file.
Some simple use cases would be like extract a set of jar entries from a build.xml.
If the file is a properly formatted one with proper columns like that of /var/log/messages we can easily extract the word using the awk command,else we have to do a careful use of the grep or egrep commands.

Here is a simple case we want extract a set of words from an xml file ,
Here we extract all the servlets entries from a web.xml file

grep '' web.xml | sed 's//~/' | sed 's/<\/servlet-name>/~/' | cut -d "~" -f2 | sort

The logic used is as follows

1.Find the lines containning the word "" in the fine web.xml (grep '' web.xml)
[bsurnida@localhost ~]$ grep '' web.xml
welcome
ServletErrorPage
IncludedServlet

2.Replace the word "" with a special symbol like ~ using sed (sed 's//~/' ).
[bsurnida@localhost ~]$ grep '' web.xml | sed 's//~/'
~welcome

~ServletErrorPage

~IncludedServlet


3.Replace the word "" with the same special symbo again using sed
[bsurnida@localhost ~]$ grep '' web.xml | sed 's//~/' | sed 's/<\/servlet-name>/~/'
~welcome~
~ServletErrorPage~
~IncludedServlet~

4.Cut the word now between the special symbol using cut ()
[bsurnida@localhost ~]$ grep '' web.xml | sed 's//~/' | sed 's/<\/servlet-name>/~/' | cut -d "~" -f2
welcome
ServletErrorPage
IncludedServlet

5.You can now summarize the result with sort and can use -u for unique values.
[bsurnida@localhost ~]$ grep '' web.xml | sed 's//~/' | sed 's/<\/servlet-name>/~/' | cut -d "~" -f2 | sort
ForwardedServlet
ForwardedServlet
IncludedServlet

No comments: