The challenge for day 4 is more suited for databases. The input is fuzzy, but with a CTE or two I should be able to clean it up. Solved the first part already with the kid earlier, but we decided to skip part two: mainly because it involved more complex string operations (best suited for regex operations), which is something the kid hasn’t learned yet.
Task 1: The airport is busy, and the passport scanners do not function correctly. Wondering why every airport seems to be the same mess, but oh well. Anyway, being someone who can do some IT support they ask you to basically hack this thing on the fly and implement passport verification. If that would be possible on a real-world airport … In addition to all the IT problems there, some people arrive with an ID which was issued by some “North Pole” organization which is not an official government body. But because everyone is waiting in line the people in charge decide to accept this ID as well. Only problem: this ID does not have a “Country ID” code. Your task, if you accept it: check the input data and if only the cid (Country ID) is missing, then accept this non-passport as valid document.
Task 2: Lines are getting longer and longer, and there are still problems with some passports. Your task is to apply additional validation on the fuzzy data which comes from the scanners. The task does not say what happens to people with invalid ID documents, but one can only assume that they cannot board the plane.
Preparation
As usual I read the raw data into a table. Like yesterday I also create an ID column along the way. Verification for the ID column was posted in yesterday’s post, and I’m not going to repeat the steps here.
|
|
Loading the data:
|
|
I have 1138
entries in the table, but this says nothing about the number of scanned passports and other ID cards: some lines are empty, some data groups are in one line, some in multiple lines:
|
|
As a preparation I somehow need to cleanup that data. I can go over the data with a Windowing Function, use LAG() to look back into the table and find everything before an empty line. That approach has just a small problem: the last line is not empty, therefore I miss the last group of data. As I’m not allowed to manipulate the data (it’s read-only), I have to make sure I have an empty line at the end. I’m using a UNION SELECT
which adds an empty line at the end. Since I have an identifier (the ID column), I add the empty line with the highest ID + 1
from the data table:
|
|
To have everything in one query, I move this into a first stage of a CTE:
|
|
Good, now I have the groups, apply the LAG()
and extract each group. The query extracts the start and end line numbers for each data group, but not yet the data groups itself:
|
|
That’s it for the preparation part. Let’s move on to solve the task at hand.
Task 1
The actual data extraction happens in a join between the data table and the line numbers from the data groups CTE:
|
|
This query extracts start and end position, as well as the key and the value for each field. Since day4_data_groups
provides the beginning and end, it’s easy now to apply this to the line numbers and join this data, before splitting it up by spaces (for the key:value groups) and use SUBSTRING() to split the keys and values apart. Since each key is exactly three chars, a SUBSTRING(particle FROM 1 FOR 3)
can extract the key and a SUBSTRING(particle FROM 5)
extracts the remaining data as value. Sample data from this query:
|
|
This gives me all the key:value pairs for each scan, need to find out which scans are valud. The task says that either all 8 keys must be present, or if cid
is missing than the scan is a “North Pole” ID and should be accepted. Therefore I only need to see if 7 specific keys are in the scan:
|
|
And finally I need to know how many valid scans are in the list. The total number of scans is in day4_data_groups
, that is the cleaned up list of scans. The number of invalid scans is the last query, which becomes another CTE: day4_invalid_passports
.
|
|
And my result is:
|
|
Task 2
The second task expands on task 1, and asks to confirm more parameters. This is basically plents of text processing, slicing, dicing and more regex, depending on which subtask.
- Birth Year (byr) - four digits; at least
1920
and at most2002
: use a BETWEEN on the integer value - Issue Year (iyr) - four digits; at least
2010
and at most2020
: use aBETWEEN
on the integer value - Expiration Year (eyr) - four digits; at least 2020 and at most
2030
: use aBETWEEN
on the integer value - Height (hgt) - a number followed by either
cm
orin
: use a regex to check for numbers, followed by a string, then check the string - Height (hgt) - if
cm
, the number must be at least150
and at most193
: replace thecm
string, check the remaining value as integer - Height (hgt) - if
in
, the number must be at least59
and at most76
: replace thein
string, check the remaining value as integer - Hair Color (hcl) - a
#
followed by exactly six characters0-9
ora-f
: perfect suited for a regex checking for six characters - Eye Color (ecl) - exactly one of:
amb
blu
brn
gry
grn
hzl
oth
: use IN (more exactly:NOT IN
, because of the way the query is built) to check if the value matches the options - Passport ID (pid) - a nine-digit number, including leading zeroes: another perfect use case for a regex checking for nine digits
- Country ID (cid) - ignored, missing or not: no change, still don’t have to check this one
Here is the expanded query with all the checks included:
|
|
And I have 198
valid scans:
|
|