Difference between revisions of "Previews Parsing"
Gskluzacek (Talk | contribs) (added links to char sets) |
Gskluzacek (Talk | contribs) (new section) |
||
| Line 104: | Line 104: | ||
=== File Loader === | === File Loader === | ||
=== Parsing of Loaded Data === | === Parsing of Loaded Data === | ||
| + | |||
| + | == Differences Between Encodings == | ||
| + | |||
| + | === US-ASCII === | ||
| + | |||
| + | Basic character set which uses codes from hex 00 thru 7F. Codes between hex 80 and FF are undefined | ||
| + | |||
| + | === LATIN-1 === | ||
| + | |||
| + | Also known as ISO-8859-1, extends the US-ASCII encoding by adding additional characters from hex A0 thru FF | ||
| + | |||
| + | === WINDOWS-1552 === | ||
| + | |||
| + | Further extends the US-ASCII enconding by adding 27 additional characters between hex 80 thru 9F (5 characters remain undefined). | ||
| + | |||
| + | {| class="wikitable" | ||
| + | |- | ||
| + | ! Header text !! Header text !! Header text | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |- | ||
| + | | Example || Example || Example | ||
| + | |} | ||
Revision as of 10:11, 22 October 2018
Previews Parsing
Purpose
To take the previews order form and parse its contents into database tables
The Previews Web Site
The home page is located at previewsworld.com
They now have a digital version of previews which you can view on the web site or on a mobile app. Each issue is $3.99
Customer Order Form (COF)
The Customer Order Form (COF) can be downloaded in Text or PDF format from the Archive page. The have issues as far back as Jan 2012 on the archive page itself. However, it is posible to request the COFs between JAN 2010 and DEC 2011, by manually entering the URL.
The the URL for each COF follows the format below:
https://www.previewsworld.com/Catalog/CustomerOrderForm/<format>/<MONYY>
where <format> is either PDF or TXT
and <MONYY> is the 3 letter month abbreviation and the 2 digit year.
Current Trends
The current trend is
- LATIN-1 (or US-ASCII) encoding of the text
- No blank lines before the FILE-HEADER line
- No leading or trailing white space for the FILE-HEADER line
- The month is spelled out
- VOL. used as the abbreviation for Volume Number
- Issue Number not left padded with Zeros
- Two (2) blank lines after the FILE-HEADER line and before the PAGE line
- No leading space for the PAGE line
- No tailing 5 TAB characters for first PAGE line
Text Encoding
Nearly all texted format COFs are encoded with the US-ASCII (56*) encoding or LATIN-1, also known as ISO-8859-1 (43*) encoding with the following exceptions (* as of 2018/11 with 107 total issues from 2010/01 to 2018/11):
- Windows 1252 (6*): FEB15, NOV14, AUG14, FEB14, DEC12, MAY12
- UTF-8 with BOM (1*): DEC15
- UTF-16 \[LE] (1*): JUN17
Note:
- See differences between US-ASCII, LATIN-1 and WINDOWS-1256 Encodings below
File Layout General
The layout (with some exceptions) generally consists of a FILE-HEADER Line on the first line of the file, followed by some number of blank lines, followed by a PAGE line. As of 2018/11, with 119 issues from 2009/01 to 2018/11:
- The majority (64) have 2 blank lines between the FILE-HEADER line and the first PAGE line (see item 2 below for AUG13; see item 3 below for NOV14)
- 37 have 1 blank line (see item 2 below for JAN2013, DEC12; see item 4 below for MAR10)
- 16 have 3 blank lines
- 1 has 1 blank line
- 1 did not have any FILE-HEADER line (see item 1 below for AUG17)
exceptions:
- AUG17 - no FILE-HEADER line
- AUG13, JAN13, DEC12 - OTHER line types after the header line but before the first PAGE line
- NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
- MAR10 - 1 BLANK line before the HEADER line
FILE-HEADER Line Format
FILE-HEADER line format
- Constant: 'PREVIEWS' starting in column 1
- Followed by 1 space character
- Followed by either the Month Name or Month Abbreviation (3 characters)
- Followed by 1 space
- Constant: either 'VOL' (with or without a trailing period) or 'V'
- Followed by 1 space if VOL or VOL. or no spaces if V
- Followed by the Volume Number: a 2 digit number that is equal to the issue year minus 1990 (yr - 1990 = vol_nbr)
- Followed by 1 space
- Followed by a pound sign '#'
- Followed by the Issue Number (no intervening spaces): a 1 or 2 digit number, for issues in 2009, with values less than 10, are left padded with a zero (see notes below)
- No trailing spaces
notes:
- JUL14 - wrong issue number of 6 given, it should have been 7
PAGE Line Format
PAGE line format
- Constant: 'PAGE' starting in column 1
- Followed by 1 space character
- Followed by the Page Number (not left padded with zeroes)
- Followed by either no trailing white space (first PAGE line of the file) or 5 trailing TAB characters (all other PAGE lines in the file) (see notes below)
notes:
- JAN13 - this issue has non blank lines between the FILE-HEADER line and the PAGE line, which looks to be the reason why its first PAGE line has 5 trailing TAB characters. So its probably correct to assume that this issue is missing the first PAGE line.
- JUL13 - the first PAGE line for this issue contains 5 trailing TAB characters.
- NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
File Locations
I have down loaded some of the text format COFs and have them located here JAN 2009 thru APR 2013
I also downloaded some of the PDF format COFs which are located here: JAN 2009 thru DEC 2010, SEP 2011, JAN 2012 thru OCT 2012 and JAN 2013 thru APR 2013
I also have a full compliment (both PDF and TEXT format from JAN09 to NOV18) of files on my local 5K iMac in the following directory: /Users/gregskluzacek/Documents/Development/Python/PreviewsParsing/downloads
High Level Functions
File Loader
Parsing of Loaded Data
Differences Between Encodings
US-ASCII
Basic character set which uses codes from hex 00 thru 7F. Codes between hex 80 and FF are undefined
LATIN-1
Also known as ISO-8859-1, extends the US-ASCII encoding by adding additional characters from hex A0 thru FF
WINDOWS-1552
Further extends the US-ASCII enconding by adding 27 additional characters between hex 80 thru 9F (5 characters remain undefined).
| Header text | Header text | Header text |
|---|---|---|
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |
| Example | Example | Example |