How to Read a Txt File in Pandas

27. Reading and Writing Data in Pandas

Past Bernd Klein. Last modified: 01 Feb 2022.

On this page ➤

All the powerful data structures like the Series and the DataFrames would avail to nothing, if the Pandas module wouldn't provide powerful functionalities for reading in and writing out data. It is non only a matter of having a functions for interacting with files. To be useful to data scientists it also needs functions which support the near of import data formats similar

Delimiter-separated files, like east.g. csv
Microsoft Excel files
HTML
XML
JSON

Delimiter-separated Values

Digits as File Input and Output

Most people take csv files as a synonym for delimter-separated values files. They get out the fact out of business relationship that csv is an acronym for "comma separated values", which is not the case in many situations. Pandas also uses "csv" and contexts, in which "dsv" would exist more than appropriate.

Delimiter-separated values (DSV) are defined and stored 2-dimensional arrays (for example strings) of data past separating the values in each row with delimiter characters defined for this purpose. This way of implementing data is often used in combination of spreadsheet programs, which can read in and write out data as DSV. They are besides used as a general data exchange format.

We call a text file a "delimited text file" if it contains text in DSV format.

For case, the file dollar_euro.txt is a delimited text file and uses tabs (\t) as delimiters.

Reading CSV and DSV Files

Pandas offers ii ways to read in CSV or DSV files to be precise:

DataFrame.from_csv
read_csv

At that place is no big departure betwixt those two functions, e.g. they accept different default values in some cases and read_csv has more paramters. Nosotros will focus on read_csv, because DataFrame.from_csv is kept inside Pandas for reasons of backwards compatibility.

              import              pandas              as              pd              exchange_rates              =              pd              .              read_csv              (              "/data1/dollar_euro.txt"              ,              sep              =              "              \t              "              )              print              (              exchange_rates              )

OUTPUT:

              Yr   Average  Min USD/EUR  Max USD/EUR  Working days 0   2016  0.901696     0.864379     0.959785           247 ane   2015  0.901896     0.830358     0.947688           256 ii   2014  0.753941     0.716692     0.823655           255 iii   2013  0.753234     0.723903     0.783208           255 4   2012  0.778848     0.743273     0.827198           256 5   2011  0.719219     0.671953     0.775855           257 6   2010  0.755883     0.686672     0.837381           258 7   2009  0.718968     0.661376     0.796495           256 8   2008  0.683499     0.625391     0.802568           256 nine   2007  0.730754     0.672314     0.775615           255 10  2006  0.797153     0.750131     0.845594           255 11  2005  0.805097     0.740357     0.857118           257 12  2004  0.804828     0.733514     0.847314           259 13  2003  0.885766     0.791766     0.963670           255 14  2002  1.060945     0.953562     i.165773           255 15  2001  1.117587     i.047669     1.192748           255 16  2000  1.085899     0.962649     1.211827           255 17  1999  0.939475     0.848176     0.998502           261

As we tin see, read_csv used automatically the commencement line as the names for the columns. It is possible to give other names to the columns. For this purpose, we have to skip the first line by setting the parameter "header" to 0 and nosotros take to assign a list with the column names to the parameter "names":

              import              pandas              as              pd              exchange_rates              =              pd              .              read_csv              (              "/data1/dollar_euro.txt"              ,              sep              =              "              \t              "              ,              header              =              0              ,              names              =              [              "year"              ,              "min"              ,              "max"              ,              "days"              ])              print              (              exchange_rates              )

OUTPUT:

              year       min       max  days 2016  0.901696  0.864379  0.959785   247 2015  0.901896  0.830358  0.947688   256 2014  0.753941  0.716692  0.823655   255 2013  0.753234  0.723903  0.783208   255 2012  0.778848  0.743273  0.827198   256 2011  0.719219  0.671953  0.775855   257 2010  0.755883  0.686672  0.837381   258 2009  0.718968  0.661376  0.796495   256 2008  0.683499  0.625391  0.802568   256 2007  0.730754  0.672314  0.775615   255 2006  0.797153  0.750131  0.845594   255 2005  0.805097  0.740357  0.857118   257 2004  0.804828  0.733514  0.847314   259 2003  0.885766  0.791766  0.963670   255 2002  i.060945  0.953562  one.165773   255 2001  1.117587  1.047669  one.192748   255 2000  i.085899  0.962649  i.211827   255 1999  0.939475  0.848176  0.998502   261

Exercise one

The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July 2014). The delimiter of the file is a space and commas are used to separate groups of thousands in the numbers. The method 'head(n)' of a DataFrame can exist used to give out only the offset due north rows or lines. Read the file into a DataFrame.

Solution:

              pop              =              pd              .              read_csv              (              "/data1/countries_population.csv"              ,              header              =              None              ,              names              =              [              "Country"              ,              "Population"              ],              index_col              =              0              ,              quotechar              =              "'"              ,              sep              =              " "              ,              thousands              =              ","              )              print              (              pop              .              head              (              5              ))

OUTPUT:

              Population State                    China           1355692576 India           1236344631 European Spousal relationship   511434812 United States    318892103 Republic of indonesia        253609643

Writing csv Files

Writing CSV Files

We tin create csv (or dsv) files with the method "to_csv". Before we do this, nosotros will prepare some data to output, which we will write to a file. We have two csv files with population information for various countries. countries_male_population.csv contains the figures of the male populations and countries_female_population.csv correspondingly the numbers for the female populations. We will create a new csv file with the sum:

            column_names            =            [            "Country"            ]            +            list            (            range            (            2002            ,            2013            ))            male_pop            =            pd            .            read_csv            (            "/data1/countries_male_population.csv"            ,            header            =            None            ,            index_col            =            0            ,            names            =            column_names            )            female_pop            =            pd            .            read_csv            (            "/data1/countries_female_population.csv"            ,            header            =            None            ,            index_col            =            0            ,            names            =            column_names            )            population            =            male_pop            +            female_pop

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
Country
Commonwealth of australia	19640979.0	19872646	20091504	20339759	20605488	21015042	21431781	21874920	22342398	22620554	22683573
Republic of austria	8139310.0	8067289	8140122	8206524	8265925	8298923	8331930	8355260	8375290	8404252	8443018
Belgium	10309725.0	10355844	10396421	10445852	10511382	10584534	10666866	10753080	10839905	10366843	11035958
Canada	NaN	31361611	31372587	31989454	32299496	32649482	32927372	33327337	33334414	33927935	34492645
Czech Republic	10269726.0	10203269	10211455	10220577	10251079	10287189	10381130	10467542	10506813	10532770	10505445
Kingdom of denmark	5368354.0	5383507	5397640	5411405	5427459	5447084	5475791	5511451	5534738	5560628	5580516
Republic of finland	5194901.0	5206295	5219732	5236611	5255580	5276955	5300484	5326314	5351427	5375276	5401267
France	59337731.0	59630121	59900680	62518571	62998773	63392140	63753140	64366962	64716310	65129746	65394283
Frg	82440309.0	82536680	82531671	82500849	82437995	82314906	82217837	82002356	81802257	81751602	81843743
Greece	10988000.0	11006377	11040650	11082751	11125179	11171740	11213785	11260402	11305118	11309885	11290067
Hungary	10174853.0	10142362	10116742	10097549	10076581	10066158	10045401	10030975	10014324	9985722	9957731
Iceland	286575.0	288471	290570	293577	299891	307672	315459	319368	317630	318452	319575
Republic of ireland	3882683.0	3963636	4027732	4109173	4209019	4239848	4401335	4450030	4467854	4569864	4582769
Italy	56993742.0	57321070	57888245	58462375	58751711	59131287	59619290	60045068	60340328	60626442	60820696
Japan	127291000.0	127435000	127620000	127687000	127767994	127770000	127771000	127692000	127510000	128057000	127799000
Korea	47639618.0	47925318	48082163	48138077	48297184	48456369	48606787	48746693	48874539	49779440	50004441
Luxembourg	444050.0	448300	451600	455000	469086	476187	483799	493500	502066	511840	524853
United mexican states	101826249.0	103039964	104213503	103001871	103946866	104874282	105790725	106682518	107550697	108396211	115682867
Netherlands	16105285.0	16192572	16258032	16305526	16334210	16357992	16405399	16485787	16574989	16655799	16730348
New Zealand	3939130.0	4009200	4062500	4100570	4139470	4228280	4268880	4315840	4367740	4405150	4433100
Norway	4524066.0	4552252	4577457	4606363	4640219	4681134	4737171	4799252	4858199	4920305	4985870
Poland	38632453.0	38218531	38190608	38173835	38157055	38125479	38115641	38135876	38167329	38200037	38538447
Portugal	10335559.0	10407465	10474685	10529255	10569592	10599095	10617575	10627250	10637713	10636979	10542398
Slovak Democracy	5378951.0	5379161	5380053	5384822	5389180	5393637	5400998	5412254	5424925	5435273	5404322
Espana	40409330.0	41550584	42345342	43038035	43758250	44474631	45283259	45828172	45989016	46152926	46818221
Sweden	8909128.0	8940788	8975670	9011392	9047752	9113257	9182927	9256347	9340682	9415570	9482855
Switzerland	7261210.0	7313853	7364148	7415102	7459128	7508739	7593494	7701856	7785806	7870134	7954662
Turkey	NaN	70171979	70689500	71607500	72519974	72519974	70586256	71517100	72561312	73722988	74724269
United Kingdom	58706905.0	59262057	59699828	60059858	60412870	60781346	61179260	61595094	62026962	62498612	63256154
Usa	277244916.0	288774226	290810719	294442683	297308143	300184434	304846731	305127551	307756577	309989078	312232049

            population            .            to_csv            (            "/data1/countries_total_population.csv"            )

We want to create a new DataFrame with all the data, i.e. female, male and complete population. This ways that nosotros have to innovate an hierarchical alphabetize. Before nosotros do it on our DataFrame, we will introduce this problem in a elementary example:

              import              pandas              as              pd              shop1              =              {              "foo"              :{              2010              :              23              ,              2011              :              25              },              "bar"              :{              2010              :              13              ,              2011              :              29              }}              shop2              =              {              "foo"              :{              2010              :              223              ,              2011              :              225              },              "bar"              :{              2010              :              213              ,              2011              :              229              }}              shop1              =              pd              .              DataFrame              (              shop1              )              shop2              =              pd              .              DataFrame              (              shop2              )              both_shops              =              shop1              +              shop2              print              (              "Sales of shop1:              \n              "              ,              shop1              )              print              (              "              \n              Sales of both shops              \n              "              ,              both_shops              )

OUTPUT:

Sales of shop1:        foo  bar 2010   23   13 2011   25   29  Sales of both shops        foo  bar 2010  246  226 2011  250  258

              shops              =              pd              .              concat              ([              shop1              ,              shop2              ],              keys              =              [              "i"              ,              "2"              ])              shops

		foo	bar
one	2010	23	13
one	2011	25	29
two	2010	223	213
two	2011	225	229

Nosotros want to swap the hierarchical indices. For this we volition utilize 'swaplevel':

              shops              .              swaplevel              ()              shops              .              sort_index              (              inplace              =              True              )              shops

		foo	bar
ane	2010	23	thirteen
ane	2011	25	29
two	2010	223	213
two	2011	225	229

Nosotros volition get back to our initial problem with the population figures. We will utilise the same steps to those DataFrames:

              pop_complete              =              pd              .              concat              ([              population              .              T              ,              male_pop              .              T              ,              female_pop              .              T              ],              keys              =              [              "total"              ,              "male"              ,              "female"              ])              df              =              pop_complete              .              swaplevel              ()              df              .              sort_index              (              inplace              =              True              )              df              [[              "Austria"              ,              "Australia"              ,              "France"              ]]

	Country	Republic of austria	Australia	France
2002	female	4179743.0	9887846.0	30510073.0
	male	3959567.0	9753133.0	28827658.0
	full	8139310.0	19640979.0	59337731.0
2003	female	4158169.0	9999199.0	30655533.0
	male	3909120.0	9873447.0	28974588.0
	total	8067289.0	19872646.0	59630121.0
2004	female	4190297.0	10100991.0	30789154.0
	male	3949825.0	9990513.0	29111526.0
	total	8140122.0	20091504.0	59900680.0
2005	female	4220228.0	10218321.0	32147490.0
	male	3986296.0	10121438.0	30371081.0
	total	8206524.0	20339759.0	62518571.0
2006	female person	4246571.0	10348070.0	32390087.0
	male	4019354.0	10257418.0	30608686.0
	total	8265925.0	20605488.0	62998773.0
2007	female	4261752.0	10570420.0	32587979.0
	male person	4037171.0	10444622.0	30804161.0
	full	8298923.0	21015042.0	63392140.0
2008	female	4277716.0	10770864.0	32770860.0
	male	4054214.0	10660917.0	30982280.0
	total	8331930.0	21431781.0	63753140.0
2009	female person	4287213.0	10986535.0	33208315.0
	male	4068047.0	10888385.0	31158647.0
	full	8355260.0	21874920.0	64366962.0
2010	female person	4296197.0	11218144.0	33384930.0
	male	4079093.0	11124254.0	31331380.0
	full	8375290.0	22342398.0	64716310.0
2011	female	4308915.0	11359807.0	33598633.0
	male person	4095337.0	11260747.0	31531113.0
	total	8404252.0	22620554.0	65129746.0
2012	female	4324983.0	11402769.0	33723892.0
	male	4118035.0	11280804.0	31670391.0
	total	8443018.0	22683573.0	65394283.0

            df            .            to_csv            (            "/data1/countries_total_population.csv"            )

Live Python training

instructor-led training course

Upcoming online Courses

Enrol here

Exercise two

Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'surface area', 'female', 'male', 'population' and 'density' (inhabitants per square kilometres.
print out the rows where the area is greater than 30000 and the population is greater than 10000
Print the rows where the density is greater than 300

              lands              =              pd              .              read_csv              (              '/data1/bundeslaender.txt'              ,              sep              =              " "              )              print              (              lands              .              columns              .              values              )

OUTPUT:

['state' 'area' 'male' 'female']

              # swap the columns of our DataFrame:              lands              =              lands              .              reindex              (              columns              =              [              'land'              ,              'area'              ,              'female'              ,              'male person'              ])              lands              [:              ii              ]

	land	surface area	female	male
0	Baden-Württemberg	35751.65	5465	5271
1	Bayern	70551.57	6366	6103

            lands            .            insert            (            loc            =            len            (            lands            .            columns            ),            column            =            'population'            ,            value            =            lands            [            'female'            ]            +            lands            [            'male'            ])

	land	area	female person	male	population
0	Baden-Württemberg	35751.65	5465	5271	10736
1	Bayern	70551.57	6366	6103	12469
2	Berlin	891.85	1736	1660	3396

              lands              .              insert              (              loc              =              len              (              lands              .              columns              ),              column              =              'density'              ,              value              =              (              lands              [              'population'              ]              *              1000              /              lands              [              'surface area'              ])              .              round              (              0              ))              lands              [:              4              ]

	land	area	female	male	population	density
0	Baden-Württemberg	35751.65	5465	5271	10736	300.0
1	Bayern	70551.57	6366	6103	12469	177.0
2	Berlin	891.85	1736	1660	3396	3808.0
3	Brandenburg	29478.61	1293	1267	2560	87.0

              print              (              lands              .              loc              [(              lands              .              area              >              30000              )              &              (              lands              .              population              >              10000              )])

OUTPUT:

              country      area  female person  male  population  density 0    Baden-Württemberg  35751.65    5465  5271       10736    300.0 1               Bayern  70551.57    6366  6103       12469    177.0 9  Nordrhein-Westfalen  34085.29    9261  8797       18058    530.0

Reading and Writing Excel Files

It is also possible to read and write Microsoft Excel files. The Pandas functionalities to read and write Excel files use the modules 'xlrd' and 'openpyxl'. These modules are non automatically installed past Pandas, and so you may have to install them manually!

We will utilise a elementary Excel certificate to demonstrate the reading capabilities of Pandas. The document sales.xls contains two sheets, i called 'week1' and the other one 'week2'.
An Excel file can be read in with the Pandas function "read_excel". This is demonstrated in the following example Python code:

              excel_file              =              pd              .              ExcelFile              (              "/data1/sales.xls"              )              sail              =              pd              .              read_excel              (              excel_file              )              canvas

	Weekday	Sales
0	Monday	123432.980000
1	Tuesday	122198.650200
2	Wednesday	134418.515220
iii	Thursday	131730.144916
4	Fri	128173.431003

The certificate "sales.xls" contains two sheets, but we but have been able to read in the first one with "read_excel". A consummate Excel document, which tin consist of an capricious number of sheets, can exist completely read in like this:

              docu              =              {}              for              sheet_name              in              excel_file              .              sheet_names              :              docu              [              sheet_name              ]              =              excel_file              .              parse              (              sheet_name              )              for              sheet_name              in              docu              :              print              (              "              \northward              "              +              sheet_name              +              ":              \n              "              ,              docu              [              sheet_name              ])

OUTPUT:

week1:       Weekday          Sales 0     Monday  123432.980000 ane    Tuesday  122198.650200 2  Wednesday  134418.515220 3   Thursday  131730.144916 4     Friday  128173.431003  week2:       Weekday          Sales 0     Monday  223277.980000 1    Tuesday  234441.879000 ii  Wednesday  246163.972950 3   Thursday  241240.693491 4     Friday  230143.621590

We will calculate now the avarage sales numbers of the ii weeks:

              boilerplate              =              docu              [              "week1"              ]              .              copy              ()              boilerplate              [              "Sales"              ]              =              (              docu              [              "week1"              ][              "Sales"              ]              +              docu              [              "week2"              ][              "Sales"              ])              /              two              print              (              boilerplate              )

OUTPUT:

              Weekday          Sales 0     Monday  173355.480000 ane    Tuesday  178320.264600 two  Wednesday  190291.244085 three   Thursday  186485.419203 4     Fri  179158.526297

We will save the DataFrame 'boilerplate' in a new document with 'week1' and 'week2' equally additional sheets as well:

            writer            =            pd            .            ExcelWriter            (            '/data1/sales_average.xlsx'            )            document            [            'week1'            ]            .            to_excel            (            writer            ,            'week1'            )            certificate            [            'week2'            ]            .            to_excel            (            writer            ,            'week2'            )            average            .            to_excel            (            writer            ,            'boilerplate'            )            writer            .            save            ()            writer            .            close            ()

Sales_average LibreOffice

Live Python training

instructor-led training course

Upcoming online Courses

Enrol hither

shanksapprive.blogspot.com

Source: https://python-course.eu/numerical-programming/reading-and-writing-data-in-pandas.php

How to Read a Txt File in Pandas

27. Reading and Writing Data in Pandas

Delimiter-separated Values

Reading CSV and DSV Files

OUTPUT:

OUTPUT:

Exercise one

OUTPUT:

Writing csv Files

OUTPUT:

Exercise two

OUTPUT:

OUTPUT:

Reading and Writing Excel Files

OUTPUT:

OUTPUT:

0 Response to "How to Read a Txt File in Pandas"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel