Tuesday, November 15, 2011

Cleaning HTML, removing unwanted tags.

sed \
   -e 's/background-color:[ a-z]*; //g'  \
   -e 's/>/>\n/g' \
   inputfile > outputfile

In the above example, I remove any  background-color tags.

I am using \ to separate what would normally be on a single line in to multiple lines.

  In the first line it removes the tags,  background-color:[ a-z]*;   
   in the next line it adds a new line to the end of each HTML tag.  This is for some web content that's just all packed in to a single line.

