Difference between revisions of "Previews Parsing"
Gskluzacek (Talk | contribs) (added note for AUG10) |
Gskluzacek (Talk | contribs) (added note for JAN12) |
||
| Line 77: | Line 77: | ||
* JUL14 - wrong issue number of 6 given, it should have been 7 | * JUL14 - wrong issue number of 6 given, it should have been 7 | ||
* JAN10 - wrong Month abbreviation of DEC given, it should have been JAN | * JAN10 - wrong Month abbreviation of DEC given, it should have been JAN | ||
| − | * AUG10 - wrong Month Name of | + | * AUG10 - wrong Month Name of JULY given, it should have been AUGUST |
| + | * JAN12 - wrong Month Name of DECEMBER given, it should have been JANUARY | ||
===== PAGE Line Format ===== | ===== PAGE Line Format ===== | ||
Revision as of 12:25, 22 October 2018
Previews Parsing
Purpose
To take the previews order form and parse its contents into database tables
The Previews Web Site
The home page is located at previewsworld.com
They now have a digital version of previews which you can view on the web site or on a mobile app. Each issue is $3.99
Customer Order Form (COF)
The Customer Order Form (COF) can be downloaded in Text or PDF format from the Archive page. The have issues as far back as Jan 2012 on the archive page itself. However, it is posible to request the COFs between JAN 2010 and DEC 2011, by manually entering the URL.
The the URL for each COF follows the format below:
https://www.previewsworld.com/Catalog/CustomerOrderForm/<format>/<MONYY>
where <format> is either PDF or TXT
and <MONYY> is the 3 letter month abbreviation and the 2 digit year.
Current Trends
The current trend is
- LATIN-1 (or US-ASCII) encoding of the text
- No blank lines before the FILE-HEADER line
- No leading or trailing white space for the FILE-HEADER line
- The month is spelled out
- VOL. used as the abbreviation for Volume Number
- Issue Number not left padded with Zeros
- Two (2) blank lines after the FILE-HEADER line and before the PAGE line
- No leading space for the PAGE line
- No tailing 5 TAB characters for first PAGE line
Text Encoding
Nearly all texted format COFs are encoded with the US-ASCII (56*) encoding or LATIN-1, also known as ISO-8859-1 (43*) encoding with the following exceptions (* as of 2018/11 with 107 total issues from 2010/01 to 2018/11):
- Windows 1252 (6*): FEB15, NOV14, AUG14, FEB14, DEC12, MAY12
- UTF-8 with BOM (1*): DEC15
- UTF-16 \[LE] (1*): JUN17
Note:
- See differences between US-ASCII, LATIN-1 and WINDOWS-1256 Encodings below
File Layout General
The layout (with some exceptions) generally consists of a FILE-HEADER Line on the first line of the file, followed by some number of blank lines, followed by a PAGE line. As of 2018/11, with 119 issues from 2009/01 to 2018/11:
- The majority (64) have 2 blank lines between the FILE-HEADER line and the first PAGE line (see item 2 below for AUG13; see item 3 below for NOV14)
- 37 have 1 blank line (see item 2 below for JAN2013, DEC12; see item 4 below for MAR10)
- 16 have 3 blank lines
- 1 has 1 blank line
- 1 did not have any FILE-HEADER line (see item 1 below for AUG17)
exceptions:
- AUG17 - no FILE-HEADER line
- AUG13, JAN13, DEC12 - OTHER line types after the header line but before the first PAGE line
- NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
- MAR10 - 1 BLANK line before the HEADER line
FILE-HEADER Line Format
FILE-HEADER line format
- Constant: 'PREVIEWS' starting in column 1
- Followed by 1 space character
- Followed by either the Month Name or Month Abbreviation (3 characters)
- Followed by 1 space
- Constant: either 'VOL' (with or without a trailing period) or 'V'
- Followed by 1 space if VOL or VOL. or no spaces if V
- Followed by the Volume Number: a 2 digit number that is equal to the issue year minus 1990 (yr - 1990 = vol_nbr)
- Followed by 1 space
- Followed by a pound sign '#'
- Followed by the Issue Number (no intervening spaces): a 1 or 2 digit number, for issues in 2009, with values less than 10, are left padded with a zero (see notes below)
- No trailing spaces
notes:
- JUL14 - wrong issue number of 6 given, it should have been 7
- JAN10 - wrong Month abbreviation of DEC given, it should have been JAN
- AUG10 - wrong Month Name of JULY given, it should have been AUGUST
- JAN12 - wrong Month Name of DECEMBER given, it should have been JANUARY
PAGE Line Format
PAGE line format
- Constant: 'PAGE' starting in column 1
- Followed by 1 space character
- Followed by the Page Number (not left padded with zeroes)
- Followed by either no trailing white space (first PAGE line of the file) or 5 trailing TAB characters (all other PAGE lines in the file) (see notes below)
notes:
- JAN13 - this issue has non blank lines between the FILE-HEADER line and the PAGE line, which looks to be the reason why its first PAGE line has 5 trailing TAB characters. So its probably correct to assume that this issue is missing the first PAGE line.
- JUL13 - the first PAGE line for this issue contains 5 trailing TAB characters.
- NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
File Locations
I have down loaded some of the text format COFs and have them located here JAN 2009 thru APR 2013
I also downloaded some of the PDF format COFs which are located here: JAN 2009 thru DEC 2010, SEP 2011, JAN 2012 thru OCT 2012 and JAN 2013 thru APR 2013
I also have a full compliment (both PDF and TEXT format from JAN09 to NOV18) of files on my local 5K iMac in the following directory: /Users/gregskluzacek/Documents/Development/Python/PreviewsParsing/downloads
High Level Functions
File Loader
Parsing of Loaded Data
Differences Between Encodings
US-ASCII
Basic character set which uses codes from hex 00 thru 7F. Codes between hex 80 and FF are undefined
| HEX | Char | Description |
|---|---|---|
| 00 | NUL | Null |
| 01 | SOH | Start Of Heading |
| 02 | STX | Start Of Text |
| 03 | ETX | End Of Text |
| 04 | EOT | End of Transmission |
| 05 | ENQ | Enquiry |
| 06 | ACK | Acknowledgement |
| 07 | BEL | Bell |
| 08 | BS | Backsapce |
| 09 | HT | Horizontal Tab |
| 0A | LF | Line Feed |
| 0B | VT | Vertical Tab |
| 0C | FF | Form Feed |
| 0D | CR | Carriage Return |
| E0 | SO | Shift Out |
| 0F | SI | Shift In |
| 10 | DLE | Data Link Escape |
| 11 | DC1 | Xon (device control 1) |
| 12 | DC2 | (device control 2) |
| 13 | DC3 | Xoff (device control 3) |
| 14 | DC4 | (device control 4) |
| 15 | NAK | Negative Acknowledgement |
| 16 | SYN | Synchronous Idle |
| 17 | ETB | End Of Transmission Block |
| 18 | CAN | Cancel |
| 19 | EM | End Of Medium |
| 1A | SUB | Substitute |
| 1B | ESC | Escape |
| 1C | FS | File separator |
| 1D | GS | Group Separator |
| 1E | RS | Record Separator |
| 1F | US | Unit Separator |
| 20 | SP | Space |
| 21 | ! | Exclamation Point |
| 22 | " | Double Quote |
| 23 | # | Pound Sign |
| 24 | $ | Dollar Sign (currency) |
| 25 | % | Per-Cent |
| 26 | & | Ampersand |
| 27 | ' | Single Quote (Apostrophe) |
| 28 | ( | Parentheses Left |
| 29 | ) | Parentheses Right |
| 2A | * | Asterisk |
| 2B | + | Plus Sign |
| 2C | , | Comma |
| 2D | - | Dash or Minus Sign (math) |
| 2E | . | Period |
| 2F | / | Forward Slash |
| 30 | 0 | |
| 31 | 1 | |
| 32 | 2 | |
| 33 | 3 | |
| 34 | 4 | |
| 35 | 5 | |
| 36 | 6 | |
| 37 | 7 | |
| 38 | 8 | |
| 39 | 9 | |
| 3A | : | Colon |
| 3B | ; | Semi Colon |
| 3C | < | Less Than Sign (math) |
| 3D | = | Equal Sign (math) |
| 3E | > | Greater Than Sign (math) |
| 3F | ? | Question Mark |
| 40 | @ | At Sign (at the rate of) |
| 41 | A | |
| 42 | B | |
| 43 | C | |
| 44 | D | |
| 45 | E | |
| 46 | F | |
| 47 | G | |
| 48 | H | |
| 49 | I | |
| 4A | J | |
| 4B | K | |
| 4C | L | |
| 4D | M | |
| 4E | N | |
| 4F | O | |
| 50 | P | |
| 51 | Q | |
| 52 | R | |
| 53 | S | |
| 54 | T | |
| 55 | U | |
| 56 | V | |
| 57 | W | |
| 58 | X | |
| 59 | Y | |
| 5A | Z | |
| 5B | [ | Square bracket Left |
| 5C | \ | Backslash |
| 5D | ] | Square bracket Right |
| 5E | ^ | Caret |
| 5F | _ | Underscore |
| 60 | ` | Grave Accent |
| 61 | a | |
| 62 | b | |
| 63 | c | |
| 64 | d | |
| 65 | e | |
| 66 | f | |
| 67 | g | |
| 68 | h | |
| 69 | i | |
| 6A | j | |
| 6B | k | |
| 6C | l | |
| 6D | m | |
| 6E | n | |
| 6F | o | |
| 70 | p | |
| 71 | q | |
| 72 | r | |
| 73 | s | |
| 74 | t | |
| 75 | u | |
| 76 | v | |
| 77 | w | |
| 78 | x | |
| 79 | y | |
| 7A | z | |
| 7B | { | Curly Brace Left |
| 7C | Pipe | |
| 7D | } | Curly Brace Right |
| 7E | ~ | Tilde |
| 7F | DEL | Delete |
LATIN-1
Also known as ISO-8859-1, extends the US-ASCII encoding by adding additional characters from hex A0 thru FF
| HEX | Char | Description |
|---|---|---|
| A0 | NBSP | Non Breaking Space |
| A1 | ¡ | Inverted Exclamation Point |
| A2 | ¢ | Cent Sign (currency) |
| A3 | £ | Pound Sign (currency) |
| A4 | ¤ | Unspecified Currency Sign |
| A5 | ¥ | Yen Sign (currency) |
| A6 | ¦ | Vertical Bar |
| A7 | § | Section Sign |
| A8 | ¨ | Diaeresis |
| A9 | © | Copyright Symbol |
| AA | ª | Ordinal indicator |
| AB | « | Angle Quote Double Left |
| AC | ¬ | Negation (Logical Compliment) |
| AD | SHY | Soft Hyphen |
| AE | ® | Registered Trademark Symbol |
| AF | ¯ | Macron |
| B0 | ° | Degree Symbol |
| B1 | ± | Plus Minus Symbol |
| B2 | ² | Superscript 2 |
| B3 | ³ | Superscript 3 |
| B4 | ´ | Acute Accent |
| B5 | µ | Micro |
| B6 | ¶ | Paragraph Mark |
| B7 | · | Interpunct (Centered Dot) |
| B8 | ¸ | Cedilla |
| B9 | ¹ | Superscript 1 |
| BA | º | Ordinal indicator |
| BB | » | Angle Quote Double Right |
| BC | ¼ | Fraction One Quarter |
| BD | ½ | Fraction One Half |
| BE | ¾ | Fraction Three Quareters |
| BF | ¿ | Inverted Question Mark |
| C0 | À | A Grave (Upper Case) |
| C1 | Á | A Acute (Upper Case) |
| C2 | Â | A Circumflex (Upper Case) |
| C3 | Ã | A Tilde (Upper Case) |
| C4 | Ä | A Diaeresis (Upper Case) |
| C5 | Å | A Overring (Upper Case) |
| C6 | Æ | AE Ligature (Upper Case) |
| C7 | Ç | C Cedilla (Upper Case) |
| C8 | È | E Grave (Upper Case) |
| C9 | É | E Acute (Upper Case) |
| CA | Ê | E Circumflex (Upper Case) |
| CB | Ë | E Diaeresis (Upper Case) |
| CC | Ì | I Grave (Upper Case) |
| CD | Í | I Acute (Upper Case) |
| CE | Î | I Circumflex (Upper Case) |
| CF | Ï | I Diaeresis (Upper Case) |
| D0 | Ð | Eth or EDH (TH) (Upper Case) |
| D1 | Ñ | N Tilde (Upper Case) |
| D2 | Ò | O Grave (Upper Case) |
| D3 | Ó | O Acute (Upper Case) |
| D4 | Ô | O Circumflex (Upper Case) |
| D5 | Õ | O Tilde (Upper Case) |
| D6 | Ö | O Diaeresis (Upper Case) |
| D7 | × | Multiplication Sign (math) (Upper Case) |
| D8 | Ø | O vowel (foreign) (Upper Case) |
| D9 | Ù | U Grave (Upper Case) |
| DA | Ú | U Acute (Upper Case) |
| DB | Û | U Circumflex (Upper Case) |
| DC | Ü | U Diaeresis (Upper Case) |
| DD | Ý | Y Acute (Upper Case) |
| DE | Þ | Thorn (TH) |
| DF | ß | Eszett (German) |
| E0 | à | A Grave (lower case) |
| E1 | á | A Acute (lower case) |
| E2 | â | A Circumflex (lower case) |
| E3 | ã | A Tilde (lower case) |
| E4 | ä | A Diaeresis (lower case) |
| E5 | å | A Overring (lower case) |
| E6 | æ | AE Ligature (lower case) |
| E7 | ç | C Cedilla (lower case) |
| E8 | è | E Grave (lower case) |
| E9 | é | E Acute (lower case) |
| EA | ê | E Circumflex (lower case) |
| EB | ë | E Diaeresis (lower case) |
| EC | ì | I Grave (lower case) |
| ED | í | I Acute (lower case) |
| EE | î | I Circumflex (lower case) |
| EF | ï | I Diaeresis (lower case) |
| F0 | ð | Eth or EDH (TH) (lower case) |
| F1 | ñ | N Tilde (lower case) |
| F2 | ò | O Grave (lower case) |
| F3 | ó | O Acute (lower case) |
| F4 | ô | O Circumflex (lower case) |
| F5 | õ | O Tilde (lower case) |
| F6 | ö | O Diaeresis (lower case) |
| F7 | ÷ | Division Sign (math) (lower case) |
| F8 | ø | O vowel (foreight) (lower case) |
| F9 | ù | U Grave (lower case) |
| FA | ú | U Acute (lower case) |
| FB | û | U Circumflex (lower case) |
| FC | ü | U Diaeresis (lower case) |
| FD | ý | Y Acute (lower case) |
| FE | þ | Thorn (TH) (lower case) |
| FF | ÿ | Y Diaeresis (lower case) |
WINDOWS-1552
Further extends the US-ASCII enconding by adding 27 additional characters between hex 80 thru 9F (5 characters remain undefined).
| HEX | Char | Description |
|---|---|---|
| 80 | € | Euro Sign |
| 82 | ‚ | Smart Quote Single (low) |
| 83 | ƒ | Forin |
| 84 | „ | Smart Quote Double (low) |
| 85 | … | Elipse |
| 86 | † | Dagger Single Cross |
| 87 | ‡ | Dagger Double Cross |
| 88 | ˆ | Circumflex |
| 89 | ‰ | Per-Mili (like Per-Cent) |
| 8A | Š | |
| 8B | ‹ | Angle Quote Single Left |
| 8C | Œ | OE ligagure (upper case) |
| 8E | Ž | |
| 91 | ‘ | Smart Quote Single Left |
| 92 | ’ | Smart Quote Single Right |
| 93 | “ | Smart Quote Duble Left |
| 94 | ” | Smart Quote Duble Right |
| 95 | • | Bullet |
| 96 | – | Dash (longer thicker) |
| 97 | — | Dash (longer thicker) |
| 98 | ˜ | Tilde |
| 99 | ™ | Trade Mark |
| 9A | š | |
| 9B | › | Angle Quote Single Right |
| 9C | œ | OE ligagure (lower case) |
| 9E | ž | |
| 9F | Ÿ |