Advent of Code 2020: "Custom Customs" - Day 6
The flight is approaching a regional airport, but you still have to fill out a customs form. Since you are good with hacking things together, you walk around the plane and gather all the answers from all groups on the plane. There are 26 yes-or-no questions, marked from a to z.
Task 1: Count how many questions are answeres "yes" in each group. Multiple "yes" answers for the same question do not count. The data is provided as blocks, kind of like two days before from the passport scanners. I can re-use some of the code - which will turn out to be a major headache!
Task 2: Of course someone wrote the wrong instructions down! It's up to you to fix the mess now. You have to find out how many groups have the same question answered by anyone in the group.
Preparation
Load the data, same table structure as day 4:
DROP TABLE IF EXISTS day6;
CREATE TABLE day6 (
id SERIAL NOT NULL PRIMARY KEY,
value TEXT
);
And load the actual data:
COPY day6 (value) FROM STDIN;
qtmdwspah
sqwdamhpt
phwdaqsmt
stmdqwhap
pqawdhtms
bgsickuztovfwa
yiozauvgfsbtkwc
zygijavmtfkcuwobs
fvsuwtiadzrncboklg
...
frhdaek
fpuwosv
qm
m
smg
\.
This time I included a part from the end as well, which will become important later.
I have 2148 rows in the table, but again that does not mean anything as the data is grouped:
SELECT COUNT(*) FROM day6;
count
-------
2148
(1 row)
I also will copy the CTE from day 4 which parse the data into groups, and clean it up.
WITH day6_data AS (
SELECT id, value
FROM day6
UNION ALL
SELECT (SELECT MAX(id) FROM day6) + 1, ''::TEXT
ORDER BY id
),
day6_data_groups AS (
SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
id - 1 AS dg_end
FROM day6_data
WHERE value = ''
),
day6_data_elements AS (
SELECT dg.dg_begin,
dg.dg_end,
particle
FROM day6_data_groups dg
JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
)
SELECT * FROM day6_data_elements;
I made a mistake in this CTE when I copied it from day 4: the "SELECT MAX(id) FROM day6" part to find out the last row and add an empty row I forgot to change. It stayed "SELECT MAX(id) FROM day4" for a while - I had the entire calculation for the number of answers figured out, but kept getting a result which the website reported as "too low".
When I peeked into "day6_data" I couldn't see much, and missed that the empty line at the end was missing. Only when I debugged "day6_data_elements" I found that it ends at the second last group: "frhdaek" + "fpuwosv" and is not showing the last group of data. What happened is that the wrong query selected a number from "day4", which is lower than the number of entries in the table for "day6". Which means I got two rows with the same ID, and the ORDER BY in "day6_data" sorted the empty row somewhere in the middle. Not helpful!
Once I fixed that problem I got the correct result.
The "day6_data" and "day6_data_groups" CTEs are the same as for day4 (with changed names), the "day6_data_elements" CTE is a bit different:
SELECT dg.dg_begin,
dg.dg_end,
particle
FROM day6_data_groups dg
JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
I'm no longer selecting a SUBSTRING(), but just "particle", that is the one character. And the regexp_split_to_table() plits on '' instead of ' '.
Task 1
Using the single characters in "particle" and the position in "dg_begin" (or "dg_end", doesn't matter) I have to figure out a count over all the distinct characters for one group. PostgreSQL offers DISTINCT ON which can be used to return the first row among equal rows. For this I specify "dg_begin" and "particle" as match, and then the ORDER BY has to include the same columns as well - otherwise the same rows are not ordered together and can't be eliminated.
SELECT DISTINCT ON (dg_begin, particle) dg_begin, particle
FROM day6_data_elements
ORDER BY dg_begin, particle
This produces a list of rows where each character in a group only appears once:
dg_begin | particle
----------+----------
1 | a
1 | d
1 | h
1 | m
1 | p
1 | q
1 | s
1 | t
1 | w
7 | a
7 | b
7 | c
7 | d
7 | f
7 | g
7 | i
7 | j
7 | k
7 | l
7 | m
7 | n
7 | o
7 | r
7 | s
7 | t
7 | u
7 | v
7 | w
7 | y
7 | z
At this point I don't need the "dg_begin" column anymore, and can just count the "c" column:
WITH day6_data AS (
SELECT id, value
FROM day6
UNION ALL
SELECT (SELECT MAX(id) FROM day6) + 1, ''::TEXT
ORDER BY id
),
day6_data_groups AS (
SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
id - 1 AS dg_end
FROM day6_data
WHERE value = ''
),
day6_data_elements AS (
SELECT dg.dg_begin,
dg.dg_end,
particle
FROM day6_data_groups dg
JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
)
SELECT COUNT(particle) c
FROM (
SELECT DISTINCT ON (dg_begin, particle) dg_begin, particle
FROM day6_data_elements
ORDER BY dg_begin, particle
) AS count;
c
------
6532
(1 row)
That's already my final result of "yes" answers on the custom forms. With the wrong subselect in "day6_data" I ended up with "6528", which is too low.
Task 2
To solve this question I need to know the number of questions which are answered in each group, and the number of people which answered questions.
Can re-use the first three CTE from task 1, and need to further aggregate "day6_data_elements": the number of groups tells me the number of questions.
day6_element_count AS (
SELECT de.dg_begin,
de.dg_end,
particle,
COUNT(*) AS count
FROM day6_data_elements de
GROUP BY de.dg_begin, de.dg_end, particle
)
The number of "day6_data_groups" shows how many people answered questions. This grouped together answers the question of how many questions got answered by all people:
SELECT COUNT(*) AS count
FROM day6_element_count ec,
day6_data_groups dg
WHERE ec.dg_end = dg.dg_end
AND ec.count = 1 + dg.dg_end - dg.dg_begin;
The full query for task 2:
WITH day6_data AS (
SELECT id, value
FROM day6
UNION ALL
SELECT (SELECT MAX(id) FROM day6) + 1, ''::TEXT
ORDER BY id
),
day6_data_groups AS (
SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
id - 1 AS dg_end
FROM day6_data
WHERE value = ''
),
day6_data_elements AS (
SELECT dg.dg_begin,
dg.dg_end,
particle
FROM day6_data_groups dg
JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
),
day6_element_count AS (
SELECT de.dg_begin,
de.dg_end,
particle,
COUNT(*) AS count
FROM day6_data_elements de
GROUP BY de.dg_begin, de.dg_end, particle
)
SELECT COUNT(*) AS count
FROM day6_element_count ec,
day6_data_groups dg
WHERE ec.dg_end = dg.dg_end
AND ec.count = 1 + dg.dg_end - dg.dg_begin;
Obviously the result can only be as high as the result from task 1, in the best case. More likely it is lower:
count
-------
3427
(1 row)
Comments
Display comments as Linear | Threaded