Splitting large CSV files with awk
I recently learned of a useful technique when processing large text files that have a consistent separator using awk.
awk
is a text processing programming language dating all the
way back from 1977. While the format string passed to awk
technically represents a full
programming language, it is most typically used directly from the command line or from shell
scripts.
The command to split a file is easily represented as:
awk -F\, '{print>$1}' input.csv
This command contains a field separator specification (in this example it is comma, escaped just in
case - but it can be any character), and a very small Awk directive: {print>$1}
. This directive
takes the current line, and writes it to a file named after the first field. Unlike normal shell
programming, >
in awk
will append to a file if it already exists, unlike the sh >
operator,
which typically overwrites the file. $1
simply represents the first column based on the field
separator. If you need to reference any other column, you can simply use an incremented placeholder - $2
, $3
and so on.