Advent of Code 2020: "Passport Processing" - Day 4
The challenge for day 4 is more suited for databases. The input is fuzzy, but with a CTE or two I should be able to clean it up. Solved the first part already with the kid earlier, but we decided to skip part two: mainly because it involved more complex string operations (best suited for regex operations), which is something the kid hasn't learned yet.
Task 1: The airport is busy, and the passport scanners do not function correctly. Wondering why every airport seems to be the same mess, but oh well. Anyway, being someone who can do some IT support they ask you to basically hack this thing on the fly and implement passport verification. If that would be possible on a real-world airport ... In addition to all the IT problems there, some people arrive with an ID which was issued by some "North Pole" organization which is not an official government body. But because everyone is waiting in line the people in charge decide to accept this ID as well. Only problem: this ID does not have a "Country ID" code. Your task, if you accept it: check the input data and if only the cid (Country ID) is missing, then accept this non-passport as valid document.
Task 2: Lines are getting longer and longer, and there are still problems with some passports. Your task is to apply additional validation on the fuzzy data which comes from the scanners. The task does not say what happens to people with invalid ID documents, but one can only assume that they cannot board the plane.
Preparation
As usual I read the raw data into a table. Like yesterday I also create an ID column along the way. Verification for the ID column was posted in yesterday's post, and I'm not going to repeat the steps here.
DROP TABLE IF EXISTS day4;
CREATE TABLE day4 (
id SERIAL NOT NULL PRIMARY KEY,
value TEXT
);
Loading the data:
COPY day4 (value) FROM STDIN;
eyr:2028 iyr:2016 byr:1995 ecl:oth
pid:543685203 hcl:#c0946f
hgt:152cm
cid:252
hcl:#733820 hgt:155cm
iyr:2013 byr:1989 pid:728471979
ecl:grn eyr:2022
hgt:171cm
iyr:2013 pid:214368857 hcl:#cfa07d byr:1986 eyr:2028 ecl:grn
hgt:167cm cid:210 ecl:brn pid:429131951 hcl:#cfa07d eyr:2029 iyr:2010
byr:1945
hcl:#888785 iyr:2015
hgt:170cm pid:893805464 ecl:amb byr:1966 eyr:2028
hgt:170cm ecl:amb
hcl:#c0946f eyr:2020 iyr:2016 pid:725010548
byr:1928
...
\.
I have 1138 entries in the table, but this says nothing about the number of scanned passports and other ID cards: some lines are empty, some data groups are in one line, some in multiple lines:
SELECT COUNT(*) FROM day4;
count
-------
1138
(1 row)
As a preparation I somehow need to cleanup that data. I can go over the data with a Windowing Function, use LAG() to look back into the table and find everything before an empty line. That approach has just a small problem: the last line is not empty, therefore I miss the last group of data. As I'm not allowed to manipulate the data (it's read-only), I have to make sure I have an empty line at the end. I'm using a UNION Select which adds an empty line at the end. Since I have an identifier (the ID column), I add the empty line with the highest ID + 1 from the data table:
SELECT id, value
FROM day4
UNION ALL
SELECT (SELECT MAX(id) FROM day4) + 1, ''::TEXT
ORDER BY id;
To have everything in one query, I move this into a first stage of a CTE:
WITH day4_data AS (
SELECT id, value
FROM day4
UNION ALL
SELECT (SELECT MAX(id) FROM day4) + 1, ''::TEXT
ORDER BY id
)
SELECT * FROM day4_data;
Good, now I have the groups, apply the LAG() and extract each group. The query extracts the start and end line numbers for each data group, but not yet the data groups itself:
WITH day4_data AS (
SELECT id, value
FROM day4
UNION ALL
SELECT (SELECT MAX(id) FROM day4) + 1, ''::TEXT
ORDER BY id
),
day4_data_groups AS (
SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
id - 1 AS dg_end
FROM day4_data
WHERE value = ''
)
SELECT * FROM day4_data_groups;
That's it for the preparation part. Let's move on to solve the task at hand.
Task 1
The actual data extraction happens in a join between the data table and the line numbers from the data groups CTE:
WITH day4_data AS (
SELECT id, value
FROM day4
UNION ALL
SELECT (SELECT MAX(id) FROM day4) + 1, ''::TEXT
ORDER BY id
),
day4_data_groups AS (
SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
id - 1 AS dg_end
FROM day4_data
WHERE value = ''
),
day4_data_elements AS (
SELECT dg.dg_begin,
dg.dg_end,
SUBSTRING(particle FROM 1 FOR 3) AS key,
SUBSTRING(particle FROM 5) AS value
FROM day4_data_groups dg
JOIN day4_data ON day4_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day4_data.value, ' ') as _(particle)
)
SELECT * FROM day4_data_elements;
This query extracts start and end position, as well as the key and the value for each field. Since "day4_data_groups" provides the beginning and end, it's easy now to apply this to the line numbers and join this data, before splitting it up by spaces (for the key:value groups) and use SUBSTRING() to split the keys and values apart. Since each key is exactly three chars, a SUBSTRING(particle FROM 1 FOR 3) can extract the key and a SUBSTRING(particle FROM 5) extracts the remaining data as value. Sample data from this query:
dg_begin | dg_end | key | value
----------+--------+-----+------------
1 | 4 | eyr | 2028
1 | 4 | iyr | 2016
1 | 4 | byr | 1995
1 | 4 | ecl | oth
1 | 4 | pid | 543685203
1 | 4 | hcl | #c0946f
1 | 4 | hgt | 152cm
1 | 4 | cid | 252
6 | 8 | hcl | #733820
6 | 8 | hgt | 155cm
6 | 8 | iyr | 2013
6 | 8 | byr | 1989
6 | 8 | pid | 728471979
6 | 8 | ecl | grn
6 | 8 | eyr | 2022
This gives me all the key:value pairs for each scan, need to find out which scans are valud. The task says that either all 8 keys must be present, or if "cid" is missing than the scan is a "North Pole" ID and should be accepted. Therefore I only need to see if 7 specific keys are in the scan:
SELECT dg.dg_end
FROM day4_data_groups dg,
(VALUES ('byr'), ('iyr'), ('eyr'), ('hgt'), ('hcl'), ('ecl'), ('pid')) as required(key)
WHERE NOT EXISTS(SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = required.key)
GROUP BY dg.dg_end
And finally I need to know how many valid scans are in the list. The total number of scans is in "day4_data_groups", that is the cleaned up list of scans. The number of invalid scans is the last query, which becomes another CTE: "day4_invalid_passports".
SELECT (SELECT COUNT(*) FROM day4_data_groups) - (SELECT COUNT(*) FROM day4_invalid_passports);
And my result is:
?column?
----------
256
(1 row)
Task 2
The second task expands on task 1, and asks to confirm more parameters. This is basically plents of text processing, slicing, dicing and more regex, depending on which subtask.
- Birth Year (byr) - four digits; at least 1920 and at most 2002: use a BETWEEN on the integer value
- Issue Year (iyr) - four digits; at least 2010 and at most 2020: use a BETWEEN on the integer value
- Expiration Year (eyr) - four digits; at least 2020 and at most 2030: use a BETWEEN on the integer value
- Height (hgt) - a number followed by either cm or in: use a regex to check for numbers, followed by a string, then check the string
- Height (hgt) - if cm, the number must be at least 150 and at most 193: replace the 'cm' string, check the remaining value as integer
- Height (hgt) - if in, the number must be at least 59 and at most 76: replace the 'in' string, check the remaining value as integer
- Hair Color (hcl) - a # followed by exactly six characters 0-9 or a-f: perfect suited for a regex checking for six characters
- Eye Color (ecl) - exactly one of: amb blu brn gry grn hzl oth: use IN (more exactly: NOT IN, because of the way the query is built) to check if the value matches the options
- Passport ID (pid) - a nine-digit number, including leading zeroes: another perfect use case for a regex checking for nine digits
- Country ID (cid) - ignored, missing or not: no change, still don't have to check this one
Here is the expanded query with all the checks included:
WITH day4_data AS (
SELECT id, value
FROM day4
UNION ALL
SELECT (SELECT MAX(id) FROM day4) + 1, ''::TEXT
ORDER BY id
),
day4_data_groups AS (
SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
id - 1 AS dg_end
FROM day4_data
WHERE value = ''
),
day4_data_elements AS (
SELECT dg.dg_end,
SUBSTRING(particle FROM 1 FOR 3) AS key,
SUBSTRING(particle FROM 5) AS value
FROM day4_data_groups dg,
day4_data d,
regexp_split_to_table(d.value, ' ') as _(particle)
WHERE d.id BETWEEN dg.dg_begin AND dg.dg_end
),
day4_invalid_passports AS (
SELECT dg.dg_end
FROM day4_data_groups dg,
(VALUES ('byr'), ('iyr'), ('eyr'), ('hgt'), ('hcl'), ('ecl'), ('pid')) as required(key)
WHERE NOT EXISTS(SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = required.key)
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'byr'
AND CAST(de.value AS INT4) NOT BETWEEN 1920 AND 2002)
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'iyr'
AND CAST(de.value AS INT4) NOT BETWEEN 2010 AND 2020)
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'eyr'
AND CAST(de.value AS INT4) NOT BETWEEN 2020 AND 2030)
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'hgt'
AND SUBSTRING(de.value, '^\d+(.+)$') NOT IN ('cm', 'in'))
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'hgt'
AND de.value ~ '^\d+cm$'
AND CAST(REPLACE(de.value, 'cm', '') AS INT4) NOT BETWEEN 150 AND 193)
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'hgt'
AND de.value ~ '^\d+in$'
AND CAST(REPLACE(de.value, 'in', '') AS INT4) NOT BETWEEN 59 AND 76)
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'hcl'
AND de.value !~ '^#[0-9a-f]{6}$')
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'ecl'
AND de.value NOT IN ('amb', 'blu', 'brn', 'gry', 'grn', 'hzl', 'oth'))
OR EXISTS (SELECT
FROM day4_data_elements de
WHERE dg.dg_end = de.dg_end
AND de.key = 'pid'
AND de.value !~ '^[0-9]{9}$')
GROUP BY dg.dg_end
)
SELECT (SELECT COUNT(*) FROM day4_data_groups) - (SELECT COUNT(*) FROM day4_invalid_passports);
And I have 198 valid scans:
?column?
----------
198
(1 row)
Comments
Display comments as Linear | Threaded