top of page

Data Scientist Program


Free Online Data Science Training for Complete Beginners.

No prior coding knowledge required!

Pandas techniques - part 1 - read_csv( )

For a data scientist, the first step is to import data to the workplace. In python, most of us use the `read_csv` method for importing tabular data. I have written the `read_csv( )` method a thousand times but I only passed the `file path` parameter and did not aware of other parameters. There are many parameters you can pass. Here is full documentation. The followings are the most useful parameters for this method, I think.

The data are created by the nano editor and the cat command shows the text (CSV) file. For example,

cat datasets/example.csv

It is the actual text file that can be seen in the terminal or in the text editor. I only used the parameter for the file path.

   a   b   c   d       e
0  1   2   3   4   hello
1  5   6   7   8  python
2  9  10  11  12  pandas

Pandas imports the first line as header as default. If we want to import selected columns, we can pass the `usecols` parameter.

pd.read_csv('datasets/example.csv', usecols=['a','b','e'])
   a   b       e
0  1   2   hello
1  5   6  python
2  9  10  pandas

We can arrange the order of columns by adding a list of column orders.

 pd.read_csv('datasets/example.csv', usecols=['a','b','e'])[['e','b','a']]
        e   b  a
0   hello   2  1
1  python   6  5
2  pandas  10  9

If we do not want to use the first row as a header, we can use the `header` parameter sets to `None`.

pd.read_csv('datasets/example.csv', header=None)
   0   1   2   3       4
0  a   b   c   d       e
1  1   2   3   4   hello
2  5   6   7   8  python
3  9  10  11  12  pandas

If we import the data with no header, the column names will be set to numbers. If we want to specify the column names, we can use the `names` parameter.

pd.read_csv('datasets/example.csv', header=None,
  col1 col2 col3 col4    col5
0    a    b    c    d       e
1    1    2    3    4   hello
2    5    6    7    8  python
3    9   10   11   12  pandas

We can set the index column with the `index_col` parameter.

pd.read_csv('datasets/example.csv', header=None,
          index_col = 'col5')
       col1 col2 col3 col4
e         a    b    c    d
hello     1    2    3    4
python    5    6    7    8
pandas    9   10   11   12

We can want to form a hierarchical index from multiple columns, pass a list of column numbers or names.

cat datasets/2-example.csv
pd.read_csv('datasets/2-example.csv', index_col=['key1','key2'])
           value1  value2
key1 key2                
one  a          1       2
     a          3       4
     b          5       6
     b          7       8
two  a          1       2
     a          3       4
     b          5       6
     b          7       8

In some cases, the table might not have a fixed delimiter, using spaces, tabs, or other patterns to separate values. In that case, you can use a regular expression as a delimiter.

cat datasets/3-example.csv
aaa	123	234	345
bbb	124	454	545
ccc	454	785	975

Here I use `tab` to separate values.

pd.read_csv('datasets/3-example.csv', sep='\t')
  Unnamed: 0  AAA  BBB  CCC
0        aaa  123  234  345
1        bbb  124  454  545
2        ccc  454  785  975

Sometimes, the data contains additional comment lines; for example,

cat datasets/4-example.csv
# this is an example how skip rows
# for pd.read_csv() method.
# this is also interrupted line

You can skip the first, second, and fourth rows of that file with `skiprows`.

pd.read_csv('datasets/4-example.csv', skiprows=[0,1,3])
   a  b  c  d       e
0  1  2  3  4   hello
1  5  6  7  8  python
2  2  4  6  8  pandas

Or you can also use the `comment` parameter passing to a specific comment.

pd.read_csv('datasets/4-example.csv', comment='#')

Output is the same. Handling missing values are important and missing values may present with an empty string or be marked by some sentinel values. By default, '#N/A', '-1.#IND', '#QNAN', '-NaN','-nan','NA', 'NaN', 'n/a','null', etc values are interpreted as NaN by default. For example,

cat datasets/5-example.csv

NA, NULL, -1.#IND values will be interpreted as NaN value.

       A    B  C    D    E
0    one  1.0  2  3.0  NaN
1    two  4.0  5  NaN  6.0
2  three  NaN  5  6.0  7.0

You can set other values to consider as missing values bypassing the na_values parameter, which can take either a list or set of strings.

pd.read_csv('datasets/5-example.csv', na_values = ['one','6.0'])

The values 'one' and '6.0' will be considered as missing values and the output looks like this.

       A    B  C    D    E
0    NaN  1.0  2  3.0  NaN
1    two  4.0  5  NaN  NaN
2  three  NaN  5  NaN  7.0

You can also specify the values according to columns.

pd.read_csv('datasets/5-example.csv', na_values={'A': ['one','three'],'C':[5]})

Then, the value of 'one' and 'three' from column 'A' and value '5' from column 'C' will be considered as missing values.

     A    B    C    D    E
0  NaN  1.0  2.0  3.0  NaN
1  two  4.0  NaN  NaN  6.0
2  NaN  NaN  NaN  6.0  7.0

When processing very large files or do not know which arguments/columns to use, you can read only a small piece of data or iterate through smaller chunks of the file. Here I create random data set that contains 2500 rows and 4 columns.

arr = np.random.randn(10000).reshape(2500,4)
pd.read_csv('datasets/6-example.csv', nrows=5, header=None)

With the `nrows` parameter with the value, 5 will import only 5 rows of the data.

          0         1         2         3
0  0.671909 -0.700201 -1.961235 -0.295577
1  1.278847 -0.275689 -0.282427 -0.086346
2 -0.232476 -1.221327 -0.687825 -0.107018
3  0.421533  1.217264  0.040075 -1.483822
4  1.158106 -0.894026 -0.058134 -0.075550

Or you can use the `chunksize` parameter to specify as the number of rows.

chunk = pd.read_csv('datasets/6-example.csv', 
        header=None, chunksize=5)
< at 0x7f8d2e0605e0>

The TextParser object is returned and you can iterate over the parts of the file according to the chunk size. For example, `next(chunk)` will give you the first chunk with 5 rows as above. It is an iterator and you can also iterate through loops.

There are many parameters left for the read_csv method.

Here is the summary of the above methods.




String indicating file system location, URL, or file-like object

sep or delimiter

Character sequence of regular expression to use to split fields in each row


Row number to use as column names. Default to 0 (first row), None if there is no header row


Column numbers or names to use as the row index in the result. Can be a single name/number or a list for a hierarchical index


List of column names for result, combine with header=None


Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip


Sequence of values to replace with NA


Characters to split comments off the end of lines


Number of rows to read from beginning of file


For iteration, size of file chunks


Recent Posts

See All


bottom of page