Convert Microsoft Word documents with Antiword and Bash scripts

Q I have a bunch of directories with some Microsoft Office Word files on a Gentoo system, and I need to use Antiword to change them to text files. I have written a script that does it for a given directory:

for i in 'ls *.doc' ; do antiword $i >${i/doc/txt}; done

There are probably some bugs in the line (like going down subdirectories) but I will iron them out. My main problem is that some of the files have a space in their name, such as 'file 1.doc'. I end up with errors like:

file 'file' does not exist, cannot convert file '1.doc'

How I could turn around this problem? It would also be useful to be able to delete the DOC files once they are successfully converted.

A You need to put quotes around the variables, so bash treats 'file 1.doc' as a single file and not as two files ('file' and '1.doc'). They must be double quotes, not singles. Bash interprets the contents of single quotes as literal, whereas it will expand the values of variables within double quotes. You do not need to use 'ls', as '*.doc' will match on files in the current directory by itself. It is also best to add '-i 1' to prevent Antiword outputting image data into your text file. Your command then becomes:

for i in *.doc ; do antiword -i 1 "${i}" >"${i/doc/txt}"; done

To recurse though directories, use find:

find . -name '*.doc' | while read i; do
antiword -i 1 "${i}" >"${i/doc/txt}";

You could also use find to remove the DOC files afterwards, thus:

find . -name '*.doc' -exec rm "{}" \;

This would remove all DOC files, even if Antiword failed to convert them. To convert the files and remove them after successful conversion, use this:

find . -name '*.doc' | while read i; do
antiword -i 1 "${i}" >"${i/doc/txt}"
&& rm "${i}"; done

Find outputs a list of matching files, one per line, which are read by read; then Antiword converts each file. The && means that the rm command is only run if the previous command (antiword) ran without error.

Follow us on or Twitter

Username:   Password: