Accessing data is a necessary first step for using most of the tools in this book. I’m going to be focused on data input and output using pandas, though there are numerous tools in other libraries to help with reading and writing data in various formats.
Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.
pandas features a number of functions for reading tabular data as a DataFrame
object. Table 6-1 summarizes some of
them, though read_csv and read_table are likely the ones you’ll use the
most.
I’ll give an overview of the mechanics of these functions, which are meant to convert text data into a DataFrame. The optional arguments for these functions may fall into a few categories:
Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.
This includes the user-defined value conversions and custom list of missing value markers.
Includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.
Support for iterating over chunks of very large files.
Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.
Because of how messy data in the real world can be, some of the data
loading functions (especially read_csv) have grown very
complex in their options over time. It’s normal to feel overwhelmed by the
number of different parameters (read_csv has over 50 as
of this writing). The online pandas documentation has many examples about
how each of them works, so if you’re struggling to read a particular file,
there might be a similar enough example to help you find the right
parameters.
Some of these functions, like pandas.read_csv,
perform type inference, because the column data types are not part of the data
format. That means you don’t necessarily have to specify which columns are
numeric, integer, boolean, or string. Other data formats, like HDF5,
Feather, and msgpack, have the data types stored in the format.
Handling dates and other custom types can require extra effort. Let’s start with a small comma-separated (CSV) text file:
In[8]:!catexamples/ex1.csva,b,c,d,message1,2,3,4,hello5,6,7,8,world9,10,11,12,foo
Here I used the Unix cat shell
command to print the raw contents of the file to the screen. If you’re
on Windows, you can use type instead
of cat to achieve the same
effect.
Since this is comma-delimited, we can use read_csv to read it into a DataFrame:
In[9]:df=pd.read_csv('examples/ex1.csv')In[10]:dfOut[10]:abcdmessage01234hello15678world29101112foo
We could also have used read_table and specified the delimiter:
In[11]:pd.read_table('examples/ex1.csv',sep=',')Out[11]:abcdmessage01234hello15678world29101112foo
A file will not always have a header row. Consider this file:
In[12]:!catexamples/ex2.csv1,2,3,4,hello5,6,7,8,world9,10,11,12,foo
To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:
In[13]:pd.read_csv('examples/ex2.csv',header=None)Out[13]:0123401234hello15678world29101112fooIn[14]:pd.read_csv('examples/ex2.csv',names=['a','b','c','d','message'])Out[14]:abcdmessage01234hello15678world29101112foo
Suppose you wanted the message
column to be the index of the returned DataFrame. You can either indicate
you want the column at index 4 or named 'message' using the index_col argument:
In[15]:names=['a','b','c','d','message']In[16]:pd.read_csv('examples/ex2.csv',names=names,index_col='message')Out[16]:abcdmessagehello1234world5678foo9101112
In the event that you want to form a hierarchical index from multiple columns, pass a list of column numbers or names:
In[17]:!catexamples/csv_mindex.csvkey1,key2,value1,value2one,a,1,2one,b,3,4one,c,5,6one,d,7,8two,a,9,10two,b,11,12two,c,13,14two,d,15,16In[18]:parsed=pd.read_csv('examples/csv_mindex.csv',....:index_col=['key1','key2'])In[19]:parsedOut[19]:value1value2key1key2onea12b34c56d78twoa910b1112c1314d1516
In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this:
In[20]:list(open('examples/ex3.txt'))Out[20]:[' A B C\n','aaa -0.264438 -1.026059 -0.619500\n','bbb 0.927272 0.302904 -0.032399\n','ccc -0.264273 -0.386314 -0.217601\n','ddd -0.871858 -0.348382 1.100491\n']
While you could do some munging by hand, the fields here are
separated by a variable amount of whitespace. In these cases, you can pass
a regular expression as a delimiter for read_table. This can be expressed by the regular
expression \s+, so we have then:
In[21]:result=pd.read_table('examples/ex3.txt',sep='\s+')In[22]:resultOut[22]:ABCaaa-0.264438-1.026059-0.619500bbb0.9272720.302904-0.032399ccc-0.264273-0.386314-0.217601ddd-0.871858-0.3483821.100491
Because there was one fewer column name than the number of data
rows, read_table infers that the first
column should be the DataFrame’s index in this special case.
The parser functions have many additional arguments to help you
handle the wide variety of exception file formats that occur (see a
partial listing in Table 6-2). For
example, you can skip the first, third, and fourth rows of a file with
skiprows:
In[23]:!catexamples/ex4.csv# hey!a,b,c,d,message# just wanted to make things more difficult for you# who reads CSV files with computers, anyway?1,2,3,4,hello5,6,7,8,world9,10,11,12,fooIn[24]:pd.read_csv('examples/ex4.csv',skiprows=[0,2,3])Out[24]:abcdmessage01234hello15678world29101112foo
Handling missing values is an important and frequently nuanced part
of the file parsing process. Missing data is usually either not present (empty string) or
marked by some sentinel value. By
default, pandas uses a set of commonly occurring sentinels, such as
NA and NULL:
In[25]:!catexamples/ex5.csvsomething,a,b,c,d,messageone,1,2,3,4,NAtwo,5,6,,8,worldthree,9,10,11,12,fooIn[26]:result=pd.read_csv('examples/ex5.csv')In[27]:resultOut[27]:somethingabcdmessage0one123.04NaN1two56NaN8world2three91011.012fooIn[28]:pd.isnull(result)Out[28]:somethingabcdmessage0FalseFalseFalseFalseFalseTrue1FalseFalseFalseTrueFalseFalse2FalseFalseFalseFalseFalseFalse
The na_values option can take
either a list or set of strings to consider missing values:
In[29]:result=pd.read_csv('examples/ex5.csv',na_values=['NULL'])In[30]:resultOut[30]:somethingabcdmessage0one123.04NaN1two56NaN8world2three91011.012foo
Different NA sentinels can be specified for each column in a dict:
In[31]:sentinels={'message':['foo','NA'],'something':['two']}In[32]:pd.read_csv('examples/ex5.csv',na_values=sentinels)Out[32]:somethingabcdmessage0one123.04NaN1NaN56NaN8world2three91011.012NaN
Table 6-2 lists some frequently
used options in pandas.read_csv and
pandas.read_table.
When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.
Before we look at a large file, we make the pandas display settings more compact:
In[33]:pd.options.display.max_rows=10
Now we have:
In[34]:result=pd.read_csv('examples/ex6.csv')In[35]:resultOut[35]:onetwothreefourkey00.467976-0.038649-0.295344-1.824726L1-0.3588931.4044530.704965-0.200638B2-0.5018400.659254-0.421691-0.057688G30.2048861.0741341.388361-0.982404R40.354628-0.1331160.283763-0.837063Q.................99952.311896-0.417070-1.409599-0.515821L9996-0.479893-0.6504190.745152-0.646038E99970.5233310.7871120.4860661.093156K9998-0.3625590.598894-1.8432010.887292G9999-0.096376-1.012999-0.657431-0.5733150[10000rowsx5columns]
If you want to only read a small number of rows (avoiding reading
the entire file), specify that with nrows:
In[36]:pd.read_csv('examples/ex6.csv',nrows=5)Out[36]:onetwothreefourkey00.467976-0.038649-0.295344-1.824726L1-0.3588931.4044530.704965-0.200638B2-0.5018400.659254-0.421691-0.057688G30.2048861.0741341.388361-0.982404R40.354628-0.1331160.283763-0.837063Q
To read a file in pieces, specify a chunksize as a number of rows:
In[37]:chunker=pd.read_csv('examples/ex6.csv',chunksize=1000)In[38]:chunkerOut[38]:<pandas.io.parsers.TextFileReaderat0x7fb741fe7160>
The TextParser object
returned by read_csv
allows you to iterate over the parts of the file according to the
chunksize. For example, we can
iterate over ex6.csv, aggregating the
value counts in the 'key' column like
so:
chunker=pd.read_csv('examples/ex6.csv',chunksize=1000)tot=pd.Series([])forpieceinchunker:tot=tot.add(piece['key'].value_counts(),fill_value=0)tot=tot.sort_values(ascending=False)
We have then:
In[40]:tot[:10]Out[40]:E368.0X364.0L346.0O343.0Q340.0M338.0J337.0F335.0K334.0H330.0dtype:float64
TextParser is also equipped
with a get_chunk method that enables you to read pieces of an arbitrary size.
Data can also be exported to a delimited format. Let’s consider one of the CSV files read before:
In[41]:data=pd.read_csv('examples/ex5.csv')In[42]:dataOut[42]:somethingabcdmessage0one123.04NaN1two56NaN8world2three91011.012foo
Using DataFrame’s to_csv
method, we can write the data out to a comma-separated
file:
In[43]:data.to_csv('examples/out.csv')In[44]:!catexamples/out.csv,something,a,b,c,d,message0,one,1,2,3.0,4,1,two,5,6,,8,world2,three,9,10,11.0,12,foo
Other delimiters can be used, of course (writing to sys.stdout so it
prints the text result to the console):
In[45]:importsysIn[46]:data.to_csv(sys.stdout,sep='|')|something|a|b|c|d|message0|one|1|2|3.0|4|1|two|5|6||8|world2|three|9|10|11.0|12|foo
Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:
In[47]:data.to_csv(sys.stdout,na_rep='NULL'),something,a,b,c,d,message0,one,1,2,3.0,4,NULL1,two,5,6,NULL,8,world2,three,9,10,11.0,12,foo
With no other options specified, both the row and column labels are written. Both of these can be disabled:
In[48]:data.to_csv(sys.stdout,index=False,header=False)one,1,2,3.0,4,two,5,6,,8,worldthree,9,10,11.0,12,foo
You can also write only a subset of the columns, and in an order of your choosing:
In[49]:data.to_csv(sys.stdout,index=False,columns=['a','b','c'])a,b,c1,2,3.05,6,9,10,11.0
Series also has a to_csv method:
In[50]:dates=pd.date_range('1/1/2000',periods=7)In[51]:ts=pd.Series(np.arange(7),index=dates)In[52]:ts.to_csv('examples/tseries.csv')In[53]:!catexamples/tseries.csv2000-01-01,02000-01-02,12000-01-03,22000-01-04,32000-01-05,42000-01-06,52000-01-07,6
It’s possible to load most forms of tabular data from disk using functions
like pandas.read_table. In some
cases, however, some manual processing may be necessary. It’s not
uncommon to receive a file with one or more malformed lines that trip
up read_table. To
illustrate the basic tools, consider a small CSV file:
In[54]:!catexamples/ex7.csv"a","b","c""1","2","3""1","2","3"
For any file with a single-character delimiter, you can use
Python’s built-in csv module. To
use it, pass any open file or file-like object to csv.reader:
importcsvf=open('examples/ex7.csv')reader=csv.reader(f)
Iterating through the reader like a file yields tuples of values with any quote characters removed:
In[56]:forlineinreader:....:(line)['a','b','c']['1','2','3']['1','2','3']
From there, it’s up to you to do the wrangling necessary to put the data in the form that you need it. Let’s take this step by step. First, we read the file into a list of lines:
In[57]:withopen('examples/ex7.csv')asf:....:lines=list(csv.reader(f))
Then, we split the lines into the header line and the data lines:
In[58]:header,values=lines[0],lines[1:]
Then we can create a dictionary of data columns using a dictionary
comprehension and the expression zip(*values), which
transposes rows to columns:
In[59]:data_dict={h:vforh,vinzip(header,zip(*values))}In[60]:data_dictOut[60]:{'a':('1','1'),'b':('2','2'),'c':('3','3')}
CSV files come in many different flavors. To define a new format
with a different delimiter, string quoting convention, or line
terminator, we define a simple subclass of csv.Dialect:
classmy_dialect(csv.Dialect):lineterminator='\n'delimiter=';'quotechar='"'quoting=csv.QUOTE_MINIMAL
reader=csv.reader(f,dialect=my_dialect)
We can also give individual CSV dialect parameters as keywords to
csv.reader without having to define a
subclass:
reader=csv.reader(f,delimiter='|')
The possible options (attributes of csv.Dialect) and what they do can be found in Table 6-3.
| Argument | Description |
|---|---|
delimiter | One-character string to separate fields; defaults to
','. |
lineterminator | Line terminator for writing; defaults to '\r\n'. Reader ignores this and
recognizes cross-platform line terminators. |
quotechar | Quote character for fields with special characters (like
a delimiter); default is '"'. |
quoting | Quoting convention. Options include csv.QUOTE_ALL (quote all fields),
csv.QUOTE_MINIMAL (only
fields with special characters like the delimiter), csv.QUOTE_NONNUMERIC, and csv.QUOTE_NONE (no quoting). See
Python’s documentation for full details. Defaults to QUOTE_MINIMAL. |
skipinitialspace | Ignore whitespace after each delimiter; default is
False. |
doublequote | How to handle quoting character inside a field; if
True, it is doubled (see
online documentation for full detail and behavior). |
escapechar | String to escape the delimiter if quoting is set to csv.QUOTE_NONE; disabled by
default. |
For files with more complicated or fixed multicharacter
delimiters, you will not be able to use the csv module. In those cases, you’ll have to
do the line splitting and other cleanup using string’s split method or the regular expression method re.split.
To write delimited files manually, you can
use csv.writer. It accepts an open,
writable file object and the same dialect and format options as csv.reader:
withopen('mydata.csv','w')asf:writer=csv.writer(f,dialect=my_dialect)writer.writerow(('one','two','three'))writer.writerow(('1','2','3'))writer.writerow(('4','5','6'))writer.writerow(('7','8','9'))
JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more free-form data format than a tabular text form like CSV. Here is an example:
obj="""{"name": "Wes","places_lived": ["United States", "Spain", "Germany"],"pet": null,"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},{"name": "Katie", "age": 38,"pets": ["Sixes", "Stache", "Cisco"]}]}"""
JSON is very nearly valid Python code with the exception of its
null value null and
some other nuances (such as disallowing trailing commas at the end of
lists). The basic types are objects (dicts), arrays (lists), strings,
numbers, booleans, and nulls. All of the keys in an object must be
strings. There are several Python libraries for reading and writing JSON
data. I’ll use json here, as it is
built into the Python standard library. To convert a JSON string to
Python form, use json.loads:
In[62]:importjsonIn[63]:result=json.loads(obj)In[64]:resultOut[64]:{'name':'Wes','pet':None,'places_lived':['United States','Spain','Germany'],'siblings':[{'age':30,'name':'Scott','pets':['Zeus','Zuko']},{'age':38,'name':'Katie','pets':['Sixes','Stache','Cisco']}]}
json.dumps, on the other hand, converts a Python object back to JSON:
In[65]:asjson=json.dumps(result)
How you convert a JSON object or list of objects to a DataFrame or some other data structure for analysis will be up to you. Conveniently, you can pass a list of dicts (which were previously JSON objects) to the DataFrame constructor and select a subset of the data fields:
In[66]:siblings=pd.DataFrame(result['siblings'],columns=['name','age'])In[67]:siblingsOut[67]:nameage0Scott301Katie38
The pandas.read_json can automatically convert JSON datasets in specific
arrangements into a Series or DataFrame. For example:
In[68]:!catexamples/example.json[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
The default options for pandas.read_json assume
that each object in the JSON array is a row in the table:
In[69]:data=pd.read_json('examples/example.json')In[70]:dataOut[70]:abc012314562789
For an extended example of reading and manipulating JSON data (including nested records), see the USDA Food Database example in Chapter 7.
If you need to export data from pandas to JSON, one way is to use
the to_json methods on Series and DataFrame:
In[71]:(data.to_json()){"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}In[72]:(data.to_json(orient='records'))[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]
Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. Examples include lxml, Beautiful Soup, and html5lib. While lxml is comparatively much faster in general, the other libraries can better handle malformed HTML or XML files.
pandas has a built-in function, read_html,
which uses libraries like lxml and Beautiful Soup to
automatically parse tables out of HTML files as DataFrame objects. To
show how this works, I downloaded an HTML file (used in the pandas
documentation) from the United States FDIC government agency showing bank failures.1 First, you must install some additional libraries used by
read_html:
conda install lxml pip install beautifulsoup4 html5lib
If you are not using conda, pip install lxml
will likely also work.
The pandas.read_html function has a number of
options, but by default it searches for and attempts to parse all
tabular data contained within <table> tags. The
result is a list of DataFrame objects:
In[73]:tables=pd.read_html('examples/fdic_failed_bank_list.html')In[74]:len(tables)Out[74]:1In[75]:failures=tables[0]In[76]:failures.head()Out[76]:BankNameCitySTCERT\0AlliedBankMulberryAR911TheWoodburyBankingCompanyWoodburyGA112972FirstCornerStoneBankKingofPrussiaPA353123TrustCompanyBankMemphisTN99564NorthMilwaukeeStateBankMilwaukeeWI20364AcquiringInstitutionClosingDateUpdatedDate0Today's Bank September 23, 2016 November 17, 20161UnitedBankAugust19,2016November17,20162First-CitizensBank&TrustCompanyMay6,2016September6,20163TheBankofFayetteCountyApril29,2016September6,20164First-CitizensBank&TrustCompanyMarch11,2016June16,2016
Because failures has many columns, pandas
inserts a line break character \.
As you will learn in later chapters, from here we could proceed to do some data cleaning and analysis, like computing the number of bank failures by year:
In[77]:close_timestamps=pd.to_datetime(failures['Closing Date'])In[78]:close_timestamps.dt.year.value_counts()Out[78]:20101572009140201192201251200825...2004420014200732003320002Name:ClosingDate,Length:15,dtype:int64
XML (eXtensible Markup Language) is another common structured data format supporting hierarchical, nested data with metadata. The book you are currently reading was actually created from a series of large XML documents.
Earlier, I showed the pandas.read_html
function, which uses either lxml or Beautiful Soup under the hood to
parse data from HTML. XML and HTML are structurally similar, but XML
is more general. Here, I will show an example of how to use lxml to
parse data from a more general XML format.
The New York Metropolitan Transportation Authority (MTA) publishes a number of data series about its bus and train services. Here we’ll look at the performance data, which is contained in a set of XML files. Each train or bus service has a different file (like Performance_MNR.xml for the Metro-North Railroad) containing monthly data as a series of XML records that look like this:
<INDICATOR><INDICATOR_SEQ>373889</INDICATOR_SEQ><PARENT_SEQ></PARENT_SEQ><AGENCY_NAME>Metro-North Railroad</AGENCY_NAME><INDICATOR_NAME>Escalator Availability</INDICATOR_NAME><DESCRIPTION>Percent of the time that escalators are operational systemwide. The availability rate is based on physical observations performed the morning of regular business days only. This is a new indicator the agency began reporting in 2009.</DESCRIPTION><PERIOD_YEAR>2011</PERIOD_YEAR><PERIOD_MONTH>12</PERIOD_MONTH><CATEGORY>Service Indicators</CATEGORY><FREQUENCY>M</FREQUENCY><DESIRED_CHANGE>U</DESIRED_CHANGE><INDICATOR_UNIT>%</INDICATOR_UNIT><DECIMAL_PLACES>1</DECIMAL_PLACES><YTD_TARGET>97.00</YTD_TARGET><YTD_ACTUAL></YTD_ACTUAL><MONTHLY_TARGET>97.00</MONTHLY_TARGET><MONTHLY_ACTUAL></MONTHLY_ACTUAL></INDICATOR>
Using lxml.objectify, we
parse the file and get a reference to the root node of the XML file
with getroot:
fromlxmlimportobjectifypath='datasets/mta_perf/Performance_MNR.xml'parsed=objectify.parse(open(path))root=parsed.getroot()
root.INDICATOR returns a
generator yielding each <INDICATOR> XML element. For each
record, we can populate a dict of tag names (like YTD_ACTUAL) to data values (excluding a few
tags):
data=[]skip_fields=['PARENT_SEQ','INDICATOR_SEQ','DESIRED_CHANGE','DECIMAL_PLACES']foreltinroot.INDICATOR:el_data={}forchildinelt.getchildren():ifchild.taginskip_fields:continueel_data[child.tag]=child.pyvaldata.append(el_data)
Lastly, convert this list of dicts into a DataFrame:
In[81]:perf=pd.DataFrame(data)In[82]:perf.head()Out[82]:AGENCY_NAMECATEGORY\0Metro-NorthRailroadServiceIndicators1Metro-NorthRailroadServiceIndicators2Metro-NorthRailroadServiceIndicators3Metro-NorthRailroadServiceIndicators4Metro-NorthRailroadServiceIndicatorsDESCRIPTIONFREQUENCY\0Percentofcommutertrainsthatarriveatthei...M1Percentofcommutertrainsthatarriveatthei...M2Percentofcommutertrainsthatarriveatthei...M3Percentofcommutertrainsthatarriveatthei...M4Percentofcommutertrainsthatarriveatthei...MINDICATOR_NAMEINDICATOR_UNITMONTHLY_ACTUAL\0On-TimePerformance(WestofHudson)%96.91On-TimePerformance(WestofHudson)%952On-TimePerformance(WestofHudson)%96.93On-TimePerformance(WestofHudson)%98.34On-TimePerformance(WestofHudson)%95.8MONTHLY_TARGETPERIOD_MONTHPERIOD_YEARYTD_ACTUALYTD_TARGET0951200896.9951952200896952953200896.3953954200896.8954955200896.695
XML data can get much more complicated than this example. Each tag can have metadata, too. Consider an HTML link tag, which is also valid XML:
fromioimportStringIOtag='<a href="http://www.google.com">Google</a>'root=objectify.parse(StringIO(tag)).getroot()
You can now access any of the fields (like href) in the tag or the link text:
In[84]:rootOut[84]:<Elementaat0x7fb73962a088>In[85]:root.get('href')Out[85]:'http://www.google.com'In[86]:root.textOut[86]:'Google'
One of the easiest ways to store data (also known as
serialization) efficiently in binary format is using Python’s built-in
pickle serialization. pandas objects all have a to_pickle method
that writes the data to disk in pickle format:
In[87]:frame=pd.read_csv('examples/ex1.csv')In[88]:frameOut[88]:abcdmessage01234hello15678world29101112fooIn[89]:frame.to_pickle('examples/frame_pickle')
You can read any “pickled” object stored in a file by using the
built-in pickle directly, or even more conveniently
using pandas.read_pickle:
In[90]:pd.read_pickle('examples/frame_pickle')Out[90]:abcdmessage01234hello15678world29101112foo
pickle is only recommended as a
short-term storage format. The problem is that it is hard to guarantee
that the format will be stable over time; an object pickled today may
not unpickle with a later version of a library. We have tried to
maintain backward compatibility when possible, but at some point in the
future it may be necessary to “break” the pickle format.
pandas has built-in support for two more binary data formats: HDF5 and MessagePack. I will give some HDF5 examples in the next section, but I encourage you to explore different file formats to see how fast they are and how well they work for your analysis. Some other storage formats for pandas or NumPy data include:
A compressable column-oriented binary format based on the Blosc compression library.
A cross-language column-oriented file format I designed with the R programming community’s Hadley Wickham. Feather uses the Apache Arrow columnar memory format.
HDF5 is a well-regarded file format intended for storing large quantities of scientific array data. It is available as a C library, and it has interfaces available in many other languages, including Java, Julia, MATLAB, and Python. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file can store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compression modes, enabling data with repeated patterns to be stored more efficiently. HDF5 can be a good choice for working with very large datasets that don’t fit into memory, as you can efficiently read and write small sections of much larger arrays.
While it’s possible to directly access HDF5 files using either the
PyTables or h5py libraries, pandas provides a high-level interface that
simplifies storing Series and DataFrame object. The HDFStore class works like a dict
and handles the low-level details:
In[92]:frame=pd.DataFrame({'a':np.random.randn(100)})In[93]:store=pd.HDFStore('mydata.h5')In[94]:store['obj1']=frameIn[95]:store['obj1_col']=frame['a']In[96]:storeOut[96]:<class'pandas.io.pytables.HDFStore'>Filepath:mydata.h5/obj1frame(shape->[100,1])/obj1_colseries(shape->[100])/obj2frame_table(typ->appendable,nrows->100,ncols->1,indexers->[index])/obj3frame_table(typ->appendable,nrows->100,ncols->1,indexers->[index])
Objects contained in the HDF5 file can then be retrieved with the same dict-like API:
In[97]:store['obj1']Out[97]:a0-0.20470810.4789432-0.5194393-0.55573041.965781.....950.795253960.11811097-0.748532980.584970990.152677[100rowsx1columns]
HDFStore supports two storage schemas,
'fixed' and 'table'. The latter is
generally slower, but it supports query operations using a special
syntax:
In[98]:store.put('obj2',frame,format='table')In[99]:store.select('obj2',where=['index >= 10 and index <= 15'])Out[99]:a101.00718911-1.296221120.274992130.228913141.352917150.886429In[100]:store.close()
The put is an explicit version of the
store['obj2'] = frame method but allows us to set
other options like the storage format.
The pandas.read_hdf function gives you a shortcut to these tools:
In[101]:frame.to_hdf('mydata.h5','obj3',format='table')In[102]:pd.read_hdf('mydata.h5','obj3',where=['index < 5'])Out[102]:a0-0.20470810.4789432-0.5194393-0.55573041.965781
If you are processing data that is stored on remote servers, like Amazon S3 or HDFS, using a different binary format designed for distributed storage like Apache Parquet may be more suitable. Python for Parquet and other such storage formats is still developing, so I do not write about them in this book.
If you work with large quantities of data locally, I would encourage you to explore PyTables and h5py to see how they can suit your needs. Since many data analysis problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can massively accelerate your applications.
HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.
pandas also supports reading tabular data stored in Excel 2003 (and higher)
files using either the ExcelFile class or
pandas.read_excel function. Internally these tools use the add-on packages
xlrd and openpyxl to read
XLS and XLSX files, respectively. You may need to install
these manually with pip or conda.
To use ExcelFile, create an
instance by passing a path to an xls
or xlsx file:
In[104]:xlsx=pd.ExcelFile('examples/ex1.xlsx')
Data stored in a sheet can then be read into DataFrame with parse:
In[105]:pd.read_excel(xlsx,'Sheet1')Out[105]:abcdmessage01234hello15678world29101112foo
If you are reading multiple sheets in a file, then it is faster to
create the ExcelFile, but you can also simply pass
the filename to pandas.read_excel:
In[106]:frame=pd.read_excel('examples/ex1.xlsx','Sheet1')In[107]:frameOut[107]:abcdmessage01234hello15678world29101112foo
To write pandas data to Excel format, you must first create an
ExcelWriter, then write data to it using
pandas objects’ to_excel method:
In[108]:writer=pd.ExcelWriter('examples/ex2.xlsx')In[109]:frame.to_excel(writer,'Sheet1')In[110]:writer.save()
You can also pass a file path to to_excel and
avoid the ExcelWriter:
In[111]:frame.to_excel('examples/ex2.xlsx')
Many websites have public APIs providing data feeds via JSON or some other
format. There are a number of ways to access these APIs from Python; one
easy-to-use method that I recommend is the requests
package.
To find the last 30 GitHub issues for pandas on GitHub, we can
make a GET HTTP request
using the add-on requests library:
In[113]:importrequestsIn[114]:url='https://api.github.com/repos/pandas-dev/pandas/issues'In[115]:resp=requests.get(url)In[116]:respOut[116]:<Response[200]>
The Response object’s json method
will return a dictionary containing JSON parsed into native
Python objects:
In[117]:data=resp.json()In[118]:data[0]['title']Out[118]:'BUG: rank with +-inf, #6945'
Each element in data is a
dictionary containing all of the data found on a GitHub issue page (except
for the comments). We can pass data
directly to DataFrame and extract fields of interest:
In[119]:issues=pd.DataFrame(data,columns=['number','title',.....:'labels','state'])In[120]:issuesOut[120]:numbertitle\017903BUG:rankwith+-inf,#6945117902Revert"ERR: Raise ValueError when setting sca...217901WrongorientationofoperationsbetweenDataFr...317900added'infer'optiontocompressionin_get_ha...417898Lastdayofmonthshouldgroupwiththatmonth........2517854Addinganinteger-locationbased"get"method2617853BUG:addsvalidationforbooleankeywordsinD...2717851BUG:duplicateindexingwithembeddednon-orde...2817850ImportError:Nomodulenamed'pandas.plotting'2917846BUG:Ignoredivisionby0whenmergingemptyd...labelsstate0[]open1[{'id':35818298,'url':'https://api.github.c... open2[]open3[]open4[{'id':76811,'url':'https://api.github.com/... open........25[{'id':35818298,'url':'https://api.github.c... open26[{'id':42670965,'url':'https://api.github.c... open27[{'id':76811,'url':'https://api.github.com/... open28[{'id':31932467,'url':'https://api.github.c... open29[{'id':76865106,'url':'https://api.github.c... open[30rowsx4columns]
With a bit of elbow grease, you can create some higher-level interfaces to common web APIs that return DataFrame objects for easy analysis.
In a business setting, most data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.
Loading data from SQL into a DataFrame is fairly straightforward,
and pandas has some functions to simplify the process. As an example, I’ll
create a SQLite database using Python’s built-in sqlite3
driver:
In[121]:importsqlite3In[122]:query=""".....: CREATE TABLE test.....: (a VARCHAR(20), b VARCHAR(20),.....: c REAL, d INTEGER.....: );"""In[123]:con=sqlite3.connect('mydata.sqlite')In[124]:con.execute(query)Out[124]:<sqlite3.Cursorat0x7fb7361b4b90>In[125]:con.commit()
Then, insert a few rows of data:
In[126]:data=[('Atlanta','Georgia',1.25,6),.....:('Tallahassee','Florida',2.6,3),.....:('Sacramento','California',1.7,5)]In[127]:stmt="INSERT INTO test VALUES(?, ?, ?, ?)"In[128]:con.executemany(stmt,data)Out[128]:<sqlite3.Cursorat0x7fb7396d25e0>In[129]:con.commit()
Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list of tuples when selecting data from a table:
In[130]:cursor=con.execute('select * from test')In[131]:rows=cursor.fetchall()In[132]:rowsOut[132]:[('Atlanta','Georgia',1.25,6),('Tallahassee','Florida',2.6,3),('Sacramento','California',1.7,5)]
You can pass the list of tuples to the DataFrame constructor, but
you also need the column names, contained in the cursor’s description attribute:
In[133]:cursor.descriptionOut[133]:(('a',None,None,None,None,None,None),('b',None,None,None,None,None,None),('c',None,None,None,None,None,None),('d',None,None,None,None,None,None))In[134]:pd.DataFrame(rows,columns=[x[0]forxincursor.description])Out[134]:abcd0AtlantaGeorgia1.2561TallahasseeFlorida2.6032SacramentoCalifornia1.705
This is quite a bit of munging that you’d rather not repeat each
time you query the database. The SQLAlchemy project is a popular
Python SQL toolkit that abstracts away many of the common
differences between SQL databases. pandas has a read_sql function that enables you to read data easily from a general
SQLAlchemy connection. Here, we’ll connect to the same SQLite database
with SQLAlchemy and read data from the table created before:
In[135]:importsqlalchemyassqlaIn[136]:db=sqla.create_engine('sqlite:///mydata.sqlite')In[137]:pd.read_sql('select * from test',db)Out[137]:abcd0AtlantaGeorgia1.2561TallahasseeFlorida2.6032SacramentoCalifornia1.705
Getting access to data is frequently the first step in the data analysis process. We have looked at a number of useful tools in this chapter that should help you get started. In the upcoming chapters we will dig deeper into data wrangling, data visualization, time series analysis, and other topics.
1 For the full list, see https://www.fdic.gov/bank/individual/failed/banklist.html.