Saturday, October 22, 2011

Free XLS/XLSX to CSV/TEXT file converter

I found the utility bundled in the perl module Spreadsheet::Read is pretty good. It actually uses Spreadsheet::ParseExcel or Spreadsheet::XLSX to read Microsoft Excel files. Its main utility is xlscat and a wrapper xls2csv. It can read both XLS (excel 2003 and before) and XLSX (excel 2007 and beyond) files but not XLSM (macro enabled excel workbook) files yet. It can output in different text file formats

usage: xlscat [-s ] [-L] [-n] [-A] [-u] [ Selection ] file.xls
              [-c | -m]                 [-u] [ Selection ] file.xls
               -i                            [ -S sheets ] file.xls
    Generic options:
       -v[#]       Set verbose level (xlscat)
       -d[#]       Set debug   level (Spreadsheet::Read)
       -u          Use unformatted values
       --noclip    Do not strip empty sheets and
                   trailing empty rows and columns
        -e     Set encoding for input and output
        -b     Set encoding for input
        -a     Set encoding for output
    Input CSV:
       --in-sep=c  Set input sep_char for CSV
    Input XLS:
       --dtfmt=fmt Specify the default date format to replace 'm-d-yy'
                   the default replacement is 'yyyy-mm-dd'

    Output Text (default):
       -s     Use separator . Default is TAB \t, \n allowed
       -L          Line up the columns
       -n          Number lines (prefix with column number)
       -A          Show field attributes in ANSI escapes
    Output Index only:
       -i          Show sheet names and size only
    Output CSV:
       -c          Output CSV, separator = ','
       -m          Output CSV, separator = ';'

    Output HTML:
       -H          Output HTML
    Selection:
       -S Only print sheets . 'all' is a valid set
                   Default only prints the first sheet
       -R    Only print rows    . Default is 'all'
       -C    Only print columns . Default is 'all'
       -F    Only fields e.g. -FA3,B16

Therefore to read rows 5-100 in first worksheet of file in.xlsx and output as semi-colon delimited file out.csv
xlscat -m -R 5-100 in.xlsx > out.csv
When there isn't a output format option, xlscat defaults to use "|" as delimiter and I found it's more convenient to use TAB (\t) as the delimiter. Therefore I changed the program of /usr/local/bin/xlscat at line 129 to
defined $opt_s or $opt_s = "\t";
After the above change, I can use the following command to read rows 50-100 in first worksheet of file in.xlsx and output as TAB (\t) delimited file out.csv
xlscat -R 50-100 in.xlsx > out.csv

No comments: