Difference between revisions of "Previews Parsing"
Gskluzacek (Talk | contribs) (added get filename logic) |
Gskluzacek (Talk | contribs) (added get files sections) |
||
| Line 418: | Line 418: | ||
Reformat & return the filename as: "cof_YYYY_MM.ext" | Reformat & return the filename as: "cof_YYYY_MM.ext" | ||
return 'cof_{}_{}.{}'.format(yyyy, mm, ext) | return 'cof_{}_{}.{}'.format(yyyy, mm, ext) | ||
| + | |||
| + | ==== get_text_file ==== | ||
| + | |||
| + | pass in: url, file_path, validate_only=False | ||
| + | |||
| + | create a get request to fetch the file | ||
| + | with requests.get(url) as req: | ||
| + | |||
| + | Checks to make sure the Content-Type is plain/text | ||
| + | if req.headers.get('Content-Type') != 'text/plain': | ||
| + | raise Exception(...) | ||
| + | |||
| + | sets the request encoding to the Apparent Encoding (The apparent encoding, provided by the [https://pypi.org/project/chardet/ chardet library]) | ||
| + | req.encoding = req.apparent_encoding | ||
| + | |||
| + | Validates text is greater than 100 characters and validates the first line or first and second lines | ||
| + | if len(text) < 100: | ||
| + | raise Exception("...) | ||
| + | |||
| + | if req.text[:8] != 'PREVIEWS' and req.text[:4] != 'PAGE' and req.text[2:10] != 'PREVIEWS': | ||
| + | raise Exception(...) | ||
| + | |||
| + | Save the file if the validate_only flag is set to false: | ||
| + | if not validate_only: | ||
| + | fn = path.join(file_path, get_file_name(req)) | ||
| + | with open(fn, 'wt', encoding='UTF-8') as fh: | ||
| + | fh.write(text) | ||
| + | |||
| + | ==== get_pdf_file ==== | ||
| + | |||
| + | pass in: url, file_path, validate_only=False | ||
| + | |||
| + | create a get request to fetch the file | ||
| + | with requests.get(url) as req: | ||
| + | |||
| + | Checks to make sure the Content-Type is application/pdf | ||
| + | if req.headers.get('Content-Type') != 'application/pdf': | ||
| + | raise Exception(...) | ||
| + | |||
| + | Validates text is greater than 100 characters and validates the first line | ||
| + | if len(text) < 100: | ||
| + | raise Exception("...) | ||
| + | |||
| + | if req.content[:4] != b'%PDF': | ||
| + | raise Exception(...) | ||
| + | |||
| + | Save the file if the validate_only flag is set to false: | ||
| + | if not validate_only: | ||
| + | fn = path.join(file_path, get_file_name(req)) | ||
| + | with open(fn, 'wb') as fh: | ||
| + | fh.write(req.content) | ||
=== Parsing of Loaded Data === | === Parsing of Loaded Data === | ||
Revision as of 17:32, 22 October 2018
Contents
Previews Parsing
Purpose
To take the previews order form and parse its contents into database tables
The Previews Web Site
The home page is located at previewsworld.com
They now have a digital version of previews which you can view on the web site or on a mobile app. Each issue is $3.99
Customer Order Form (COF)
The Customer Order Form (COF) can be downloaded in Text or PDF format from the Archive page. The have issues as far back as Jan 2012 on the archive page itself. However, it is posible to request the COFs between JAN 2010 and DEC 2011, by manually entering the URL.
The the URL for each COF follows the format below:
https://www.previewsworld.com/Catalog/CustomerOrderForm/<format>/<MONYY>
where <format> is either PDF or TXT
and <MONYY> is the 3 letter month abbreviation and the 2 digit year.
Current Trends
The current trend is
- LATIN-1 (or US-ASCII) encoding of the text
- No blank lines before the FILE-HEADER line
- No leading or trailing white space for the FILE-HEADER line
- The month is spelled out
- VOL. used as the abbreviation for Volume Number
- Issue Number not left padded with Zeros
- Two (2) blank lines after the FILE-HEADER line and before the PAGE line
- No leading space for the PAGE line
- No tailing 5 TAB characters for first PAGE line
Text Encoding
Nearly all texted format COFs are encoded with the US-ASCII (56*) encoding or LATIN-1, also known as ISO-8859-1 (43*) encoding with the following exceptions (* as of 2018/11 with 107 total issues from 2010/01 to 2018/11):
- Windows 1252 (6*): FEB15, NOV14, AUG14, FEB14, DEC12, MAY12
- UTF-8 with BOM (1*): DEC15
- UTF-16 \[LE] (1*): JUN17
Note:
- See differences between US-ASCII, LATIN-1 and WINDOWS-1256 Encodings below
File Layout General
The layout (with some exceptions) generally consists of a FILE-HEADER Line on the first line of the file, followed by some number of blank lines, followed by a PAGE line. As of 2018/11, with 119 issues from 2009/01 to 2018/11:
- The majority (64) have 2 blank lines between the FILE-HEADER line and the first PAGE line (see item 2 below for AUG13; see item 3 below for NOV14)
- 37 have 1 blank line (see item 2 below for JAN2013, DEC12; see item 4 below for MAR10)
- 16 have 3 blank lines
- 1 has 1 blank line
- 1 did not have any FILE-HEADER line (see item 1 below for AUG17)
exceptions:
- AUG17 - no FILE-HEADER line
- AUG13, JAN13, DEC12 - OTHER line types after the header line but before the first PAGE line
- NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
- MAR10 - 1 BLANK line before the HEADER line
FILE-HEADER Line Format
FILE-HEADER line format
- Constant: 'PREVIEWS' starting in column 1
- Followed by 1 space character
- Followed by either the Month Name or Month Abbreviation (3 characters)
- Followed by 1 space
- Constant: either 'VOL' (with or without a trailing period) or 'V'
- Followed by 1 space if VOL or VOL. or no spaces if V
- Followed by the Volume Number: a 2 digit number that is equal to the issue year minus 1990 (yr - 1990 = vol_nbr)
- Followed by 1 space
- Followed by a pound sign '#'
- Followed by the Issue Number (no intervening spaces): a 1 or 2 digit number, for issues in 2009, with values less than 10, are left padded with a zero (see notes below)
- No trailing spaces
notes:
- JUL14 - wrong issue number of 6 given, it should have been 7
- JAN10 - wrong Month abbreviation of DEC given, it should have been JAN
- AUG10 - wrong Month Name of JULY given, it should have been AUGUST
- JAN12 - wrong Month Name of DECEMBER given, it should have been JANUARY
- JAN13 - wrong Month Name of DECEMBER given, it should have been JANUARY
PAGE Line Format
PAGE line format
- Constant: 'PAGE' starting in column 1
- Followed by 1 space character
- Followed by the Page Number (not left padded with zeroes)
- Followed by either no trailing white space (first PAGE line of the file) or 5 trailing TAB characters (all other PAGE lines in the file) (see notes below)
notes:
- JAN13 - this issue has non blank lines between the FILE-HEADER line and the PAGE line, which looks to be the reason why its first PAGE line has 5 trailing TAB characters. So its probably correct to assume that this issue is missing the first PAGE line.
- JUL13 - the first PAGE line for this issue contains 5 trailing TAB characters.
- NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
File Locations
I have down loaded some of the text format COFs and have them located here JAN 2009 thru APR 2013
I also downloaded some of the PDF format COFs which are located here: JAN 2009 thru DEC 2010, SEP 2011, JAN 2012 thru OCT 2012 and JAN 2013 thru APR 2013
I also have a full compliment (both PDF and TEXT format from JAN09 to NOV18) of files on my local 5K iMac in the following directory: /Users/gregskluzacek/Documents/Development/Python/PreviewsParsing/downloads
Data Analysis
the following provides a summary of the FILE HEADER line and first PAGE line of each file
| Month | Year | File Orig | File New | Encoding | Blanks Before | Blanks After | Month Abbr | 'VOL. ', 'VOL ', 'V' | Iss Nbr Padded | Page Hdr / Other |
|---|---|---|---|---|---|---|---|---|---|---|
| 2018 | 11 | NOV18_COF.txt | cof_2018_11.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 10 | OCT18_COF.txt | cof_2018_10.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 09 | SEP18_COF.txt | cof_2018_09.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 08 | AUG18_COF.txt | cof_2018_08.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 07 | JUL18_COF.txt | cof_2018_07.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 06 | JUN18_COF.txt | cof_2018_06.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 05 | MAY18_COF.txt | cof_2018_05.txt | LATIN-1 | 0 | 3 | name | VOL. (space) | no | |
| 2018 | 04 | APR18_COF.txt | cof_2018_04.txt | LATIN-1 | 0 | 4 | name | VOL. (space) | no | 4 blank lines after FILE HEADER line |
| 2018 | 03 | MAR18_COF.txt | cof_2018_03.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2018 | 02 | FEB18_COF.txt | cof_2018_02.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2018 | 01 | JAN18_COF.txt | cof_2018_01.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 12 | DEC17_COF.txt | cof_2017_12.txt | LATIN-1 | 0 | 3 | name | VOL. (space) | no | |
| 2017 | 11 | NOV17_COF.txt | cof_2017_11.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 10 | OCT17_COF.txt | cof_2017_10.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 09 | SEP17_COF.txt | cof_2017_09.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 08 | AUG17_COF.txt | cof_2017_08.txt | US-ASCII | no file header line | no file header line | no file header line | no file header line | no file header line | no file header line |
| 2017 | 07 | JUL17_COF.txt | cof_2017_07.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 06 | JUN17_COF.txt | cof_2017_06.txt | UTF-16-LE | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 05 | MAY17_COF.txt | cof_2017_05.txt | LATIN-1 | 0 | 3 | name | VOL. (space) | no | |
| 2017 | 04 | APR17_COF.txt | cof_2017_04.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 03 | MAR17_COF.txt | cof_2017_03.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 02 | FEB17_COF.txt | cof_2017_02.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2017 | 01 | JAN17_COF.txt | cof_2017_01.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2016 | 12 | DEC16_COF.txt | cof_2016_12.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2016 | 11 | NOV16_COF.txt | cof_2016_11.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2016 | 10 | OCT16_COF.txt | cof_2016_10.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2016 | 09 | SEP16_COF.txt | cof_2016_09.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2016 | 08 | AUG16_COF.txt | cof_2016_08.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2016 | 07 | JUL16_COF.txt | cof_2016_07.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2016 | 06 | JUN16_COF.txt | cof_2016_06.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2016 | 05 | MAY16_COF.txt | cof_2016_05.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2016 | 04 | APR16_COF.txt | cof_2016_04.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2016 | 03 | MAR16_COF.txt | cof_2016_03.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2016 | 02 | FEB16_COF.txt | cof_2016_02.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2016 | 01 | JAN16_COF.txt | cof_2016_01.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 12 | DEC15_COF.txt | cof_2015_12.txt | UTF-8-SIG | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 11 | NOV15_COF.txt | cof_2015_11.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2015 | 10 | OCT15_COF.txt | cof_2015_10.txt | LATIN-1 | 0 | 3 | name | VOL. (space) | no | |
| 2015 | 09 | SEP15_COF.txt | cof_2015_09.txt | LATIN-1 | 0 | 3 | name | VOL. (space) | no | |
| 2015 | 08 | AUG15_COF.txt | cof_2015_08.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 07 | JUL15_COF.txt | cof_2015_07.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 06 | JUN15_COF.txt | cof_2015_06.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 05 | MAY15_COF.txt | cof_2015_05.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 04 | APR15_COF.txt | cof_2015_04.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 03 | MAR15_COF.txt | cof_2015_03.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 02 | FEB15_COF.txt | cof_2015_02.txt | WINOWS-1252 | 0 | 2 | name | VOL. (space) | no | |
| 2015 | 01 | JAN15_COF.txt | cof_2015_01.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2014 | 12 | DEC14_COF.txt | cof_2014_12.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2014 | 11 | NOV14_COF.txt | cof_2014_11.txt | WINOWS-1252 | 0 | 2 | name | VOL. (space) | no | PAGE mis-specified as AG |
| 2014 | 10 | OCT14_COF.txt | cof_2014_10.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2014 | 09 | SEP14_COF.txt | cof_2014_09.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2014 | 08 | AUG14_COF.txt | cof_2014_08.txt | WINOWS-1252 | 0 | 2 | name | VOL. (space) | no | |
| 2014 | 07 | JUL14_COF.txt | cof_2014_07.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | incorrect issue number of 6, should be 7 |
| 2014 | 06 | JUN14_COF.txt | cof_2014_06.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2014 | 05 | MAY14_COF.txt | cof_2014_05.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2014 | 04 | APR14_COF.txt | cof_2014_04.txt | LATIN-1 | 0 | 2 | abr | VOL. (space) | no | |
| 2014 | 03 | MAR14_COF.txt | cof_2014_03.txt | US-ASCII | 0 | 2 | abr | VOL. (space) | no | |
| 2014 | 02 | FEB14_COF.txt | cof_2014_02.txt | WINOWS-1252 | 0 | 1 | abr | VOL. (space) | no | |
| 2014 | 01 | JAN14_COF.txt | cof_2014_01.txt | US-ASCII | 0 | 2 | abr | VOL. (space) | no | |
| 2013 | 12 | DEC13_COF.txt | cof_2013_12.txt | LATIN-1 | 0 | 2 | abr | VOL. (space) | no | |
| 2013 | 11 | NOV13_COF.txt | cof_2013_11.txt | US-ASCII | 0 | 2 | abr | VOL. (space) | no | |
| 2013 | 10 | OCT13_COF.txt | cof_2013_10.txt | US-ASCII | 0 | 1 | abr | VOL. (space) | no | |
| 2013 | 09 | SEP13_COF.txt | cof_2013_09.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2013 | 08 | AUG13_COF.txt | cof_2013_08.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | 1 non blank line before PAGE line |
| 2013 | 07 | JUL13_COF.txt | cof_2013_07.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | first PAGE line has 5 trailing TAB characters |
| 2013 | 06 | JUN13_COF.txt | cof_2013_06.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2013 | 05 | MAY13_COF.txt | cof_2013_05.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2013 | 04 | APR13_COF.txt | cof_2013_04.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2013 | 03 | MAR13_COF.txt | cof_2013_03.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2013 | 02 | FEB13_COF.txt | cof_2013_02.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2013 | 01 | JAN13_COF.txt | cof_2013_01.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | 2 non blank lines before PAGE line; incorrect month of DECEMBER, should be JANUARY; first PAGE line has 5 trailing TAB characters |
| 2012 | 12 | DEC12_COF.txt | cof_2012_12.txt | WINOWS-1252 | 0 | 1 | name | VOL. (space) | no | 25 non blank lines before PAGE line |
| 2012 | 11 | NOV12_COF.txt | cof_2012_11.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2012 | 10 | OCT12_COF.txt | cof_2012_10.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2012 | 09 | SEP12_COF.txt | cof_2012_09.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2012 | 08 | AUG12_COF.txt | cof_2012_08.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2012 | 07 | JUL12_COF.txt | cof_2012_07.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2012 | 06 | JUN12_COF.txt | cof_2012_06.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2012 | 05 | MAY12_COF.txt | cof_2012_05.txt | WINOWS-1252 | 0 | 2 | name | VOL. (space) | no | |
| 2012 | 04 | APR12_COF.txt | cof_2012_04.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2012 | 03 | MAR12_COF.txt | cof_2012_03.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2012 | 02 | FEB12_COF.txt | cof_2012_02.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2012 | 01 | JAN12_COF.txt | cof_2012_01.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | incorrect month of DECEMBER, should be JANUARY |
| 2011 | 12 | DEC11_COF.txt | cof_2011_12.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2011 | 11 | NOV11_COF.txt | cof_2011_11.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2011 | 10 | OCT11_COF.txt | cof_2011_10.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2011 | 09 | SEP11_COF.txt | cof_2011_09.txt | US-ASCII | 0 | 3 | name | VOL. (space) | no | |
| 2011 | 08 | AUG11_COF.txt | cof_2011_08.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2011 | 07 | JUL11_COF.txt | cof_2011_07.txt | LATIN-1 | 0 | 2 | name | VOL. (space) | no | |
| 2011 | 06 | JUN11_COF.txt | cof_2011_06.txt | US-ASCII | 0 | 2 | name | VOL. (space) | no | |
| 2011 | 05 | MAY11_COF.txt | cof_2011_05.txt | US-ASCII | 0 | 1 | name | VOL. (space) | no | |
| 2011 | 04 | APR11_COF.txt | cof_2011_04.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2011 | 03 | MAR11_COF.txt | cof_2011_03.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2011 | 02 | FEB11_COF.txt | cof_2011_02.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2011 | 01 | JAN11_COF.txt | cof_2011_01.txt | LATIN-1 | 0 | 1 | name | VOL. (space) | no | |
| 2010 | 12 | DEC10_COF.txt | cof_2010_12.txt | LATIN-1 | 0 | 1 | name | V (no space) | no | |
| 2010 | 11 | NOV10_COF.txt | cof_2010_11.txt | LATIN-1 | 0 | 2 | name | V (no space) | no | |
| 2010 | 10 | OCT10_COF.txt | cof_2010_10.txt | US-ASCII | 0 | 1 | name | V (no space) | no | |
| 2010 | 09 | SEP10_COF.txt | cof_2010_09.txt | LATIN-1 | 0 | 2 | name | V (no space) | no | |
| 2010 | 08 | AUG10_COF.txt | cof_2010_08.txt | US-ASCII | 0 | 2 | name | V (no space) | no | incorrect month of JULY, should be AUGUST |
| 2010 | 07 | JUL10_COF.txt | cof_2010_07.txt | LATIN-1 | 0 | 2 | name | V (no space) | no | |
| 2010 | 06 | JUN10_COF.txt | cof_2010_06.txt | US-ASCII | 0 | 2 | name | V (no space) | no | |
| 2010 | 05 | MAY10_COF.txt | cof_2010_05.txt | US-ASCII | 0 | 2 | name | V (no space) | no | |
| 2010 | 04 | APR10_COF.txt | cof_2010_04.txt | LATIN-1 | 0 | 1 | abr | V (no space) | no | |
| 2010 | 03 | MAR10_COF.txt | cof_2010_03.txt | US-ASCII | 1 | 1 | abr | V (no space) | no | 1 blank line before FILE HEADER line |
| 2010 | 02 | FEB10_COF.txt | cof_2010_02.txt | LATIN-1 | 0 | 1 | abr | V (no space) | no | |
| 2010 | 01 | JAN10_COF.txt | cof_2010_01.txt | LATIN-1 | 0 | 1 | abr | V (no space) | no | incorrect month abbr of DEC, should be JAN |
| 2009 | 12 | DEC09_COF.txt | cof_2009_12.txt | unknown | 0 | 1 | abr | VOL (space no period) | no | |
| 2009 | 11 | NOV09_COF.txt | cof_2009_11.txt | unknown | 0 | 2 | abr | VOL (space no period) | no | |
| 2009 | 10 | OCT09_COF.txt | cof_2009_10.txt | unknown | 0 | 1 | abr | VOL (space no period) | no | |
| 2009 | 09 | SEP09_COF.txt | cof_2009_09.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 08 | AUG09_COF.txt | cof_2009_08.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 07 | JUL09_COF.txt | cof_2009_07.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 06 | JUN09_COF.txt | cof_2009_06.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 05 | MAY09_COF.txt | cof_2009_05.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 04 | APR09_COF.txt | cof_2009_04.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 03 | MAR09_COF.txt | cof_2009_03.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 02 | FEB09_COF.txt | cof_2009_02.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes | |
| 2009 | 01 | JAN09_COF.txt | cof_2009_01.txt | unknown | 0 | 1 | abr | VOL (space no period) | yes |
High Level Functions
File Loader
Main
set the base URL and Path to store the downloaded files
url_base = 'https://www.previewsworld.com/Catalog/CustomerOrderForm/{}/{}' path_name = '/Users/gregskluzacek/Documents/Development/Python/PreviewsParsing/downloads'
Set the latest date (current month) to process and the number of months to process (working backwards).
beg_dt = datetime(2018, 11, 1) months = 108
Iterate over each month
for i in range(1, months + 1):
set the date string of the URL
dt_str = beg_dt.strftime('%b%y').upper()
set the URLs for the TEXT and PDF formates
url = url_base.format('TXT', dt_str, True)
url = url_base.format('PDF', dt_str, True)
in separate try except blocks call
get_text_file(url, path_name)
get_pdf_file(url, path_name)
at the end of the loop, decrement the date variables
beg_dt = beg_dt - relativedelta(months=1)
get_file_name
Passed a requests request-response object, and tries to get the Content-Disposition header from the response
hdr = r.headers['Content-Disposition']
The Content-Disposition header has the format of: "attachment; filename=MMMYY_COF.ext" where:
- MMM - is the 3 letter Month Abbreviation,
- YY - is the 2 digit year, and
- ext - is the file extension, either txt or pdf
Parse the 'attachment;' portion of the header and validate that its does equal 'attachment'
vals_lvl_1 = hdr.split(';')
if len(vals_lvl_1) != 2 or vals_lvl_1[0] != 'attachment':
raise Exception(...)
Parse the filename key value and validate the key does equal 'filename'
vals_lvl_2 = vals_lvl_1[1].split('=')
if len(vals_lvl_2) != 2 or vals_lvl_2[0].strip() != 'filename':
raise Exception(...)
Get the first 3 characters (the Month Abbreviation) and try to convert it to a valid numeric month
orig_file = vals_lvl_2[1]
month_txt = orig_file[:3]
try:
month_num = time.strptime(month_txt, '%b').tm_mon
except ValueError as err:
raise Exception(...)
1) Convert the numeric month into a left padded, zero filled string (e.g., 01, 02, 03... 09, 10, 11, 12) -- 2) add the century onto the beginning of the 2 digit year -- 3) and get the file extension (txt or pdf)
mm = str(month_num).zfill(2) yyyy = '20' + orig_file[3:5] ext = orig_file[-3:]
Reformat & return the filename as: "cof_YYYY_MM.ext"
return 'cof_{}_{}.{}'.format(yyyy, mm, ext)
get_text_file
pass in: url, file_path, validate_only=False
create a get request to fetch the file
with requests.get(url) as req:
Checks to make sure the Content-Type is plain/text
if req.headers.get('Content-Type') != 'text/plain':
raise Exception(...)
sets the request encoding to the Apparent Encoding (The apparent encoding, provided by the chardet library)
req.encoding = req.apparent_encoding
Validates text is greater than 100 characters and validates the first line or first and second lines
if len(text) < 100:
raise Exception("...)
if req.text[:8] != 'PREVIEWS' and req.text[:4] != 'PAGE' and req.text[2:10] != 'PREVIEWS':
raise Exception(...)
Save the file if the validate_only flag is set to false:
if not validate_only:
fn = path.join(file_path, get_file_name(req))
with open(fn, 'wt', encoding='UTF-8') as fh:
fh.write(text)
get_pdf_file
pass in: url, file_path, validate_only=False
create a get request to fetch the file
with requests.get(url) as req:
Checks to make sure the Content-Type is application/pdf
if req.headers.get('Content-Type') != 'application/pdf':
raise Exception(...)
Validates text is greater than 100 characters and validates the first line
if len(text) < 100:
raise Exception("...)
if req.content[:4] != b'%PDF':
raise Exception(...)
Save the file if the validate_only flag is set to false:
if not validate_only:
fn = path.join(file_path, get_file_name(req))
with open(fn, 'wb') as fh:
fh.write(req.content)
Parsing of Loaded Data
Differences Between Encodings
US-ASCII
Basic character set which uses codes from hex 00 thru 7F. Codes between hex 80 and FF are undefined
| HEX | Char | Description |
|---|---|---|
| 00 | NUL | Null |
| 01 | SOH | Start Of Heading |
| 02 | STX | Start Of Text |
| 03 | ETX | End Of Text |
| 04 | EOT | End of Transmission |
| 05 | ENQ | Enquiry |
| 06 | ACK | Acknowledgement |
| 07 | BEL | Bell |
| 08 | BS | Backsapce |
| 09 | HT | Horizontal Tab |
| 0A | LF | Line Feed |
| 0B | VT | Vertical Tab |
| 0C | FF | Form Feed |
| 0D | CR | Carriage Return |
| E0 | SO | Shift Out |
| 0F | SI | Shift In |
| 10 | DLE | Data Link Escape |
| 11 | DC1 | Xon (device control 1) |
| 12 | DC2 | (device control 2) |
| 13 | DC3 | Xoff (device control 3) |
| 14 | DC4 | (device control 4) |
| 15 | NAK | Negative Acknowledgement |
| 16 | SYN | Synchronous Idle |
| 17 | ETB | End Of Transmission Block |
| 18 | CAN | Cancel |
| 19 | EM | End Of Medium |
| 1A | SUB | Substitute |
| 1B | ESC | Escape |
| 1C | FS | File separator |
| 1D | GS | Group Separator |
| 1E | RS | Record Separator |
| 1F | US | Unit Separator |
| 20 | SP | Space |
| 21 | ! | Exclamation Point |
| 22 | " | Double Quote |
| 23 | # | Pound Sign |
| 24 | $ | Dollar Sign (currency) |
| 25 | % | Per-Cent |
| 26 | & | Ampersand |
| 27 | ' | Single Quote (Apostrophe) |
| 28 | ( | Parentheses Left |
| 29 | ) | Parentheses Right |
| 2A | * | Asterisk |
| 2B | + | Plus Sign |
| 2C | , | Comma |
| 2D | - | Dash or Minus Sign (math) |
| 2E | . | Period |
| 2F | / | Forward Slash |
| 30 | 0 | |
| 31 | 1 | |
| 32 | 2 | |
| 33 | 3 | |
| 34 | 4 | |
| 35 | 5 | |
| 36 | 6 | |
| 37 | 7 | |
| 38 | 8 | |
| 39 | 9 | |
| 3A | : | Colon |
| 3B | ; | Semi Colon |
| 3C | < | Less Than Sign (math) |
| 3D | = | Equal Sign (math) |
| 3E | > | Greater Than Sign (math) |
| 3F | ? | Question Mark |
| 40 | @ | At Sign (at the rate of) |
| 41 | A | |
| 42 | B | |
| 43 | C | |
| 44 | D | |
| 45 | E | |
| 46 | F | |
| 47 | G | |
| 48 | H | |
| 49 | I | |
| 4A | J | |
| 4B | K | |
| 4C | L | |
| 4D | M | |
| 4E | N | |
| 4F | O | |
| 50 | P | |
| 51 | Q | |
| 52 | R | |
| 53 | S | |
| 54 | T | |
| 55 | U | |
| 56 | V | |
| 57 | W | |
| 58 | X | |
| 59 | Y | |
| 5A | Z | |
| 5B | [ | Square bracket Left |
| 5C | \ | Backslash |
| 5D | ] | Square bracket Right |
| 5E | ^ | Caret |
| 5F | _ | Underscore |
| 60 | ` | Grave Accent |
| 61 | a | |
| 62 | b | |
| 63 | c | |
| 64 | d | |
| 65 | e | |
| 66 | f | |
| 67 | g | |
| 68 | h | |
| 69 | i | |
| 6A | j | |
| 6B | k | |
| 6C | l | |
| 6D | m | |
| 6E | n | |
| 6F | o | |
| 70 | p | |
| 71 | q | |
| 72 | r | |
| 73 | s | |
| 74 | t | |
| 75 | u | |
| 76 | v | |
| 77 | w | |
| 78 | x | |
| 79 | y | |
| 7A | z | |
| 7B | { | Curly Brace Left |
| 7C | Pipe | |
| 7D | } | Curly Brace Right |
| 7E | ~ | Tilde |
| 7F | DEL | Delete |
LATIN-1
Also known as ISO-8859-1, extends the US-ASCII encoding by adding additional characters from hex A0 thru FF
| HEX | Char | Description |
|---|---|---|
| A0 | NBSP | Non Breaking Space |
| A1 | ¡ | Inverted Exclamation Point |
| A2 | ¢ | Cent Sign (currency) |
| A3 | £ | Pound Sign (currency) |
| A4 | ¤ | Unspecified Currency Sign |
| A5 | ¥ | Yen Sign (currency) |
| A6 | ¦ | Vertical Bar |
| A7 | § | Section Sign |
| A8 | ¨ | Diaeresis |
| A9 | © | Copyright Symbol |
| AA | ª | Ordinal indicator |
| AB | « | Angle Quote Double Left |
| AC | ¬ | Negation (Logical Compliment) |
| AD | SHY | Soft Hyphen |
| AE | ® | Registered Trademark Symbol |
| AF | ¯ | Macron |
| B0 | ° | Degree Symbol |
| B1 | ± | Plus Minus Symbol |
| B2 | ² | Superscript 2 |
| B3 | ³ | Superscript 3 |
| B4 | ´ | Acute Accent |
| B5 | µ | Micro |
| B6 | ¶ | Paragraph Mark |
| B7 | · | Interpunct (Centered Dot) |
| B8 | ¸ | Cedilla |
| B9 | ¹ | Superscript 1 |
| BA | º | Ordinal indicator |
| BB | » | Angle Quote Double Right |
| BC | ¼ | Fraction One Quarter |
| BD | ½ | Fraction One Half |
| BE | ¾ | Fraction Three Quareters |
| BF | ¿ | Inverted Question Mark |
| C0 | À | A Grave (Upper Case) |
| C1 | Á | A Acute (Upper Case) |
| C2 | Â | A Circumflex (Upper Case) |
| C3 | Ã | A Tilde (Upper Case) |
| C4 | Ä | A Diaeresis (Upper Case) |
| C5 | Å | A Overring (Upper Case) |
| C6 | Æ | AE Ligature (Upper Case) |
| C7 | Ç | C Cedilla (Upper Case) |
| C8 | È | E Grave (Upper Case) |
| C9 | É | E Acute (Upper Case) |
| CA | Ê | E Circumflex (Upper Case) |
| CB | Ë | E Diaeresis (Upper Case) |
| CC | Ì | I Grave (Upper Case) |
| CD | Í | I Acute (Upper Case) |
| CE | Î | I Circumflex (Upper Case) |
| CF | Ï | I Diaeresis (Upper Case) |
| D0 | Ð | Eth or EDH (TH) (Upper Case) |
| D1 | Ñ | N Tilde (Upper Case) |
| D2 | Ò | O Grave (Upper Case) |
| D3 | Ó | O Acute (Upper Case) |
| D4 | Ô | O Circumflex (Upper Case) |
| D5 | Õ | O Tilde (Upper Case) |
| D6 | Ö | O Diaeresis (Upper Case) |
| D7 | × | Multiplication Sign (math) (Upper Case) |
| D8 | Ø | O vowel (foreign) (Upper Case) |
| D9 | Ù | U Grave (Upper Case) |
| DA | Ú | U Acute (Upper Case) |
| DB | Û | U Circumflex (Upper Case) |
| DC | Ü | U Diaeresis (Upper Case) |
| DD | Ý | Y Acute (Upper Case) |
| DE | Þ | Thorn (TH) |
| DF | ß | Eszett (German) |
| E0 | à | A Grave (lower case) |
| E1 | á | A Acute (lower case) |
| E2 | â | A Circumflex (lower case) |
| E3 | ã | A Tilde (lower case) |
| E4 | ä | A Diaeresis (lower case) |
| E5 | å | A Overring (lower case) |
| E6 | æ | AE Ligature (lower case) |
| E7 | ç | C Cedilla (lower case) |
| E8 | è | E Grave (lower case) |
| E9 | é | E Acute (lower case) |
| EA | ê | E Circumflex (lower case) |
| EB | ë | E Diaeresis (lower case) |
| EC | ì | I Grave (lower case) |
| ED | í | I Acute (lower case) |
| EE | î | I Circumflex (lower case) |
| EF | ï | I Diaeresis (lower case) |
| F0 | ð | Eth or EDH (TH) (lower case) |
| F1 | ñ | N Tilde (lower case) |
| F2 | ò | O Grave (lower case) |
| F3 | ó | O Acute (lower case) |
| F4 | ô | O Circumflex (lower case) |
| F5 | õ | O Tilde (lower case) |
| F6 | ö | O Diaeresis (lower case) |
| F7 | ÷ | Division Sign (math) (lower case) |
| F8 | ø | O vowel (foreight) (lower case) |
| F9 | ù | U Grave (lower case) |
| FA | ú | U Acute (lower case) |
| FB | û | U Circumflex (lower case) |
| FC | ü | U Diaeresis (lower case) |
| FD | ý | Y Acute (lower case) |
| FE | þ | Thorn (TH) (lower case) |
| FF | ÿ | Y Diaeresis (lower case) |
WINDOWS-1552
Further extends the US-ASCII enconding by adding 27 additional characters between hex 80 thru 9F (5 characters remain undefined).
| HEX | Char | Description |
|---|---|---|
| 80 | € | Euro Sign |
| 82 | ‚ | Smart Quote Single (low) |
| 83 | ƒ | Forin |
| 84 | „ | Smart Quote Double (low) |
| 85 | … | Elipse |
| 86 | † | Dagger Single Cross |
| 87 | ‡ | Dagger Double Cross |
| 88 | ˆ | Circumflex |
| 89 | ‰ | Per-Mili (like Per-Cent) |
| 8A | Š | |
| 8B | ‹ | Angle Quote Single Left |
| 8C | Œ | OE ligagure (upper case) |
| 8E | Ž | |
| 91 | ‘ | Smart Quote Single Left |
| 92 | ’ | Smart Quote Single Right |
| 93 | “ | Smart Quote Duble Left |
| 94 | ” | Smart Quote Duble Right |
| 95 | • | Bullet |
| 96 | – | Dash (longer thicker) |
| 97 | — | Dash (longer thicker) |
| 98 | ˜ | Tilde |
| 99 | ™ | Trade Mark |
| 9A | š | |
| 9B | › | Angle Quote Single Right |
| 9C | œ | OE ligagure (lower case) |
| 9E | ž | |
| 9F | Ÿ |