Difference between revisions of "Previews Parsing"

From Komic Box Docs
Jump to: navigation, search
(ADDED TABLE)
(added note about JAN10)
Line 76: Line 76:
 
notes:
 
notes:
 
* JUL14 - wrong issue number of 6 given, it should have been 7
 
* JUL14 - wrong issue number of 6 given, it should have been 7
 +
* JAN10 - wrong Month abbreviation of DEC given, it should have been JAN
  
 
===== PAGE Line Format =====
 
===== PAGE Line Format =====

Revision as of 12:20, 22 October 2018

Previews Parsing

Purpose

To take the previews order form and parse its contents into database tables

The Previews Web Site

The home page is located at previewsworld.com

They now have a digital version of previews which you can view on the web site or on a mobile app. Each issue is $3.99

Customer Order Form (COF)

The Customer Order Form (COF) can be downloaded in Text or PDF format from the Archive page. The have issues as far back as Jan 2012 on the archive page itself. However, it is posible to request the COFs between JAN 2010 and DEC 2011, by manually entering the URL.

The the URL for each COF follows the format below:

   https://www.previewsworld.com/Catalog/CustomerOrderForm/<format>/<MONYY>

where <format> is either PDF or TXT

and <MONYY> is the 3 letter month abbreviation and the 2 digit year.

Current Trends

The current trend is

  • LATIN-1 (or US-ASCII) encoding of the text
  • No blank lines before the FILE-HEADER line
  • No leading or trailing white space for the FILE-HEADER line
  • The month is spelled out
  • VOL. used as the abbreviation for Volume Number
  • Issue Number not left padded with Zeros
  • Two (2) blank lines after the FILE-HEADER line and before the PAGE line
  • No leading space for the PAGE line
  • No tailing 5 TAB characters for first PAGE line

Text Encoding

Nearly all texted format COFs are encoded with the US-ASCII (56*) encoding or LATIN-1, also known as ISO-8859-1 (43*) encoding with the following exceptions (* as of 2018/11 with 107 total issues from 2010/01 to 2018/11):

  • Windows 1252 (6*): FEB15, NOV14, AUG14, FEB14, DEC12, MAY12
  • UTF-8 with BOM (1*): DEC15
  • UTF-16 \[LE] (1*): JUN17

Note:

  • See differences between US-ASCII, LATIN-1 and WINDOWS-1256 Encodings below

File Layout General

The layout (with some exceptions) generally consists of a FILE-HEADER Line on the first line of the file, followed by some number of blank lines, followed by a PAGE line. As of 2018/11, with 119 issues from 2009/01 to 2018/11:

  • The majority (64) have 2 blank lines between the FILE-HEADER line and the first PAGE line (see item 2 below for AUG13; see item 3 below for NOV14)
  • 37 have 1 blank line (see item 2 below for JAN2013, DEC12; see item 4 below for MAR10)
  • 16 have 3 blank lines
  • 1 has 1 blank line
  • 1 did not have any FILE-HEADER line (see item 1 below for AUG17)

exceptions:

  • AUG17 - no FILE-HEADER line
  • AUG13, JAN13, DEC12 - OTHER line types after the header line but before the first PAGE line
  • NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'
  • MAR10 - 1 BLANK line before the HEADER line
FILE-HEADER Line Format

FILE-HEADER line format

  • Constant: 'PREVIEWS' starting in column 1
  • Followed by 1 space character
  • Followed by either the Month Name or Month Abbreviation (3 characters)
  • Followed by 1 space
  • Constant: either 'VOL' (with or without a trailing period) or 'V'
  • Followed by 1 space if VOL or VOL. or no spaces if V
  • Followed by the Volume Number: a 2 digit number that is equal to the issue year minus 1990 (yr - 1990 = vol_nbr)
  • Followed by 1 space
  • Followed by a pound sign '#'
  • Followed by the Issue Number (no intervening spaces): a 1 or 2 digit number, for issues in 2009, with values less than 10, are left padded with a zero (see notes below)
  • No trailing spaces

notes:

  • JUL14 - wrong issue number of 6 given, it should have been 7
  • JAN10 - wrong Month abbreviation of DEC given, it should have been JAN
PAGE Line Format

PAGE line format

  • Constant: 'PAGE' starting in column 1
  • Followed by 1 space character
  • Followed by the Page Number (not left padded with zeroes)
  • Followed by either no trailing white space (first PAGE line of the file) or 5 trailing TAB characters (all other PAGE lines in the file) (see notes below)

notes:

  • JAN13 - this issue has non blank lines between the FILE-HEADER line and the PAGE line, which looks to be the reason why its first PAGE line has 5 trailing TAB characters. So its probably correct to assume that this issue is missing the first PAGE line.
  • JUL13 - the first PAGE line for this issue contains 5 trailing TAB characters.
  • NOV14 - PAGE line with with PAGE specified as 'AG' instead of 'PAGE'

File Locations

I have down loaded some of the text format COFs and have them located here JAN 2009 thru APR 2013

I also downloaded some of the PDF format COFs which are located here: JAN 2009 thru DEC 2010, SEP 2011, JAN 2012 thru OCT 2012 and JAN 2013 thru APR 2013

I also have a full compliment (both PDF and TEXT format from JAN09 to NOV18) of files on my local 5K iMac in the following directory: /Users/gregskluzacek/Documents/Development/Python/PreviewsParsing/downloads

High Level Functions

File Loader

Parsing of Loaded Data

Differences Between Encodings

US-ASCII

Basic character set which uses codes from hex 00 thru 7F. Codes between hex 80 and FF are undefined

HEX Char Description
00 NUL Null
01 SOH Start Of Heading
02 STX Start Of Text
03 ETX End Of Text
04 EOT End of Transmission
05 ENQ Enquiry
06 ACK Acknowledgement
07 BEL Bell
08 BS Backsapce
09 HT Horizontal Tab
0A LF Line Feed
0B VT Vertical Tab
0C FF Form Feed
0D CR Carriage Return
E0 SO Shift Out
0F SI Shift In
10 DLE Data Link Escape
11 DC1 Xon (device control 1)
12 DC2 (device control 2)
13 DC3 Xoff (device control 3)
14 DC4 (device control 4)
15 NAK Negative Acknowledgement
16 SYN Synchronous Idle
17 ETB End Of Transmission Block
18 CAN Cancel
19 EM End Of Medium
1A SUB Substitute
1B ESC Escape
1C FS File separator
1D GS Group Separator
1E RS Record Separator
1F US Unit Separator
20 SP Space
21  ! Exclamation Point
22 " Double Quote
23 # Pound Sign
24 $ Dollar Sign (currency)
25  % Per-Cent
26 & Ampersand
27 ' Single Quote (Apostrophe)
28 ( Parentheses Left
29 ) Parentheses Right
2A * Asterisk
2B + Plus Sign
2C , Comma
2D - Dash or Minus Sign (math)
2E . Period
2F / Forward Slash
30 0
31 1
32 2
33 3
34 4
35 5
36 6
37 7
38 8
39 9
3A  : Colon
3B  ; Semi Colon
3C < Less Than Sign (math)
3D = Equal Sign (math)
3E > Greater Than Sign (math)
3F  ? Question Mark
40 @ At Sign (at the rate of)
41 A
42 B
43 C
44 D
45 E
46 F
47 G
48 H
49 I
4A J
4B K
4C L
4D M
4E N
4F O
50 P
51 Q
52 R
53 S
54 T
55 U
56 V
57 W
58 X
59 Y
5A Z
5B [ Square bracket Left
5C \ Backslash
5D ] Square bracket Right
5E ^ Caret
5F _ Underscore
60 ` Grave Accent
61 a
62 b
63 c
64 d
65 e
66 f
67 g
68 h
69 i
6A j
6B k
6C l
6D m
6E n
6F o
70 p
71 q
72 r
73 s
74 t
75 u
76 v
77 w
78 x
79 y
7A z
7B { Curly Brace Left
7C Pipe
7D } Curly Brace Right
7E ~ Tilde
7F DEL Delete

LATIN-1

Also known as ISO-8859-1, extends the US-ASCII encoding by adding additional characters from hex A0 thru FF

HEX Char Description
A0 NBSP Non Breaking Space
A1 ¡ Inverted Exclamation Point
A2 ¢ Cent Sign (currency)
A3 £ Pound Sign (currency)
A4 ¤ Unspecified Currency Sign
A5 ¥ Yen Sign (currency)
A6 ¦ Vertical Bar
A7 § Section Sign
A8 ¨ Diaeresis
A9 © Copyright Symbol
AA ª Ordinal indicator
AB «  Angle Quote Double Left
AC ¬ Negation (Logical Compliment)
AD SHY Soft Hyphen
AE ® Registered Trademark Symbol
AF ¯ Macron
B0 ° Degree Symbol
B1 ± Plus Minus Symbol
B2 ² Superscript 2
B3 ³ Superscript 3
B4 ´ Acute Accent
B5 µ Micro
B6 Paragraph Mark
B7 · Interpunct (Centered Dot)
B8 ¸ Cedilla
B9 ¹ Superscript 1
BA º Ordinal indicator
BB  » Angle Quote Double Right
BC ¼ Fraction One Quarter
BD ½ Fraction One Half
BE ¾ Fraction Three Quareters
BF ¿ Inverted Question Mark
C0 À A Grave (Upper Case)
C1 Á A Acute (Upper Case)
C2 Â A Circumflex (Upper Case)
C3 Ã A Tilde (Upper Case)
C4 Ä A Diaeresis (Upper Case)
C5 Å A Overring (Upper Case)
C6 Æ AE Ligature (Upper Case)
C7 Ç C Cedilla (Upper Case)
C8 È E Grave (Upper Case)
C9 É E Acute (Upper Case)
CA Ê E Circumflex (Upper Case)
CB Ë E Diaeresis (Upper Case)
CC Ì I Grave (Upper Case)
CD Í I Acute (Upper Case)
CE Î I Circumflex (Upper Case)
CF Ï I Diaeresis (Upper Case)
D0 Ð Eth or EDH (TH) (Upper Case)
D1 Ñ N Tilde (Upper Case)
D2 Ò O Grave (Upper Case)
D3 Ó O Acute (Upper Case)
D4 Ô O Circumflex (Upper Case)
D5 Õ O Tilde (Upper Case)
D6 Ö O Diaeresis (Upper Case)
D7 × Multiplication Sign (math) (Upper Case)
D8 Ø O vowel (foreign) (Upper Case)
D9 Ù U Grave (Upper Case)
DA Ú U Acute (Upper Case)
DB Û U Circumflex (Upper Case)
DC Ü U Diaeresis (Upper Case)
DD Ý Y Acute (Upper Case)
DE Þ Thorn (TH)
DF ß Eszett (German)
E0 à A Grave (lower case)
E1 á A Acute (lower case)
E2 â A Circumflex (lower case)
E3 ã A Tilde (lower case)
E4 ä A Diaeresis (lower case)
E5 å A Overring (lower case)
E6 æ AE Ligature (lower case)
E7 ç C Cedilla (lower case)
E8 è E Grave (lower case)
E9 é E Acute (lower case)
EA ê E Circumflex (lower case)
EB ë E Diaeresis (lower case)
EC ì I Grave (lower case)
ED í I Acute (lower case)
EE î I Circumflex (lower case)
EF ï I Diaeresis (lower case)
F0 ð Eth or EDH (TH) (lower case)
F1 ñ N Tilde (lower case)
F2 ò O Grave (lower case)
F3 ó O Acute (lower case)
F4 ô O Circumflex (lower case)
F5 õ O Tilde (lower case)
F6 ö O Diaeresis (lower case)
F7 ÷ Division Sign (math) (lower case)
F8 ø O vowel (foreight) (lower case)
F9 ù U Grave (lower case)
FA ú U Acute (lower case)
FB û U Circumflex (lower case)
FC ü U Diaeresis (lower case)
FD ý Y Acute (lower case)
FE þ Thorn (TH) (lower case)
FF ÿ Y Diaeresis (lower case)

WINDOWS-1552

Further extends the US-ASCII enconding by adding 27 additional characters between hex 80 thru 9F (5 characters remain undefined).

HEX Char Description
80 Euro Sign
82 Smart Quote Single (low)
83 ƒ Forin
84 Smart Quote Double (low)
85 Elipse
86 Dagger Single Cross
87 Dagger Double Cross
88 ˆ Circumflex
89 Per-Mili (like Per-Cent)
8A Š
8B Angle Quote Single Left
8C Œ OE ligagure (upper case)
8E Ž
91 Smart Quote Single Left
92 Smart Quote Single Right
93 Smart Quote Duble Left
94 Smart Quote Duble Right
95 Bullet
96 Dash (longer thicker)
97 Dash (longer thicker)
98 ˜ Tilde
99 Trade Mark
9A š
9B Angle Quote Single Right
9C œ OE ligagure (lower case)
9E ž
9F Ÿ