Advent of Code 2020: "Custom Customs" - Day 6

Posted by ads' corner on Sunday, 2020-12-06
Posted in [Sql]

The flight is approaching a regional airport, but you still have to fill out a customs form. Since you are good with hacking things together, you walk around the plane and gather all the answers from all groups on the plane. There are 26 yes-or-no questions, marked from a to z.

Task 1: Count how many questions are answeres yes in each group. Multiple yes answers for the same question do not count. The data is provided as blocks, kind of like two days before from the passport scanners. I can re-use some of the code - which will turn out to be a major headache!

Task 2: Of course someone wrote the wrong instructions down! It’s up to you to fix the mess now. You have to find out how many groups have the same question answered by anyone in the group.

Preparation

Load the data, same table structure as day 4:

1
2
3
4
5
6
DROP TABLE IF EXISTS day6;

CREATE TABLE day6 (
    id      SERIAL NOT NULL PRIMARY KEY,
    value   TEXT
);

And load the actual data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
COPY day6 (value) FROM STDIN;
qtmdwspah
sqwdamhpt
phwdaqsmt
stmdqwhap
pqawdhtms

bgsickuztovfwa
yiozauvgfsbtkwc
zygijavmtfkcuwobs
fvsuwtiadzrncboklg

...

frhdaek
fpuwosv

qm
m
smg
\.

This time I included a part from the end as well, which will become important later.

I have 2148 rows in the table, but again that does not mean anything as the data is grouped:

1
2
3
4
5
SELECT COUNT(*) FROM day6;
 count
-------
  2148
(1 row)

I also will copy the CTE from day 4 which parse the data into groups, and clean it up.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
WITH day6_data AS (
    SELECT id, value
      FROM day6
     UNION ALL
    SELECT (SELECT MAX(id) FROM day6) + 1, ''::TEXT
  ORDER BY id
),
day6_data_groups AS (
    SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
           id - 1 AS dg_end
      FROM day6_data
     WHERE value = ''
),
day6_data_elements AS (
    SELECT dg.dg_begin,
           dg.dg_end,
           particle
      FROM day6_data_groups dg
      JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
)
SELECT * FROM day6_data_elements;

I made a mistake in this CTE when I copied it from day 4: the SELECT MAX(id) FROM day6 part to find out the last row and add an empty row I forgot to change. It stayed SELECT MAX(id) FROM day4 for a while - I had the entire calculation for the number of answers figured out, but kept getting a result which the website reported as “too low”.

When I peeked into day6_data I couldn’t see much, and missed that the empty line at the end was missing. Only when I debugged day6_data_elements I found that it ends at the second last group: frhdaek + fpuwosv and is not showing the last group of data. What happened is that the wrong query selected a number from “day4”, which is lower than the number of entries in the table for “day6”. Which means I got two rows with the same ID, and the ORDER BY in day6_data sorted the empty row somewhere in the middle. Not helpful!

Once I fixed that problem I got the correct result.

The day6_data and day6_data_groups CTEs are the same as for day4 (with changed names), the day6_data_elements CTE is a bit different:

1
2
3
4
5
6
7
    SELECT dg.dg_begin,
           dg.dg_end,
           particle
      FROM day6_data_groups dg
      JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)

I’m no longer selecting a SUBSTRING(), but just particle, that is the one character. And the regexp_split_to_table() splits on '' instead of ' ‘.

Task 1

Using the single characters in particle and the position in dg_begin (or dg_end, doesn’t matter) I have to figure out a count over all the distinct characters for one group. PostgreSQL offers DISTINCT ON which can be used to return the first row among equal rows. For this I specify dg_begin and particle as match, and then the ORDER BY has to include the same columns as well - otherwise the same rows are not ordered together and can’t be eliminated.

1
2
3
  SELECT DISTINCT ON (dg_begin, particle) dg_begin, particle
    FROM day6_data_elements
ORDER BY dg_begin, particle

This produces a list of rows where each character in a group only appears once:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 dg_begin | particle
----------+----------
        1 | a
        1 | d
        1 | h
        1 | m
        1 | p
        1 | q
        1 | s
        1 | t
        1 | w
        7 | a
        7 | b
        7 | c
        7 | d
        7 | f
        7 | g
        7 | i
        7 | j
        7 | k
        7 | l
        7 | m
        7 | n
        7 | o
        7 | r
        7 | s
        7 | t
        7 | u
        7 | v
        7 | w
        7 | y
        7 | z

At this point I don’t need the dg_begin column anymore, and can just count the c column:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
WITH day6_data AS (
    SELECT id, value
      FROM day6
     UNION ALL
    SELECT (SELECT MAX(id) FROM day6) + 1, ''::TEXT
  ORDER BY id
),
day6_data_groups AS (
    SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
           id - 1 AS dg_end
      FROM day6_data
     WHERE value = ''
),
day6_data_elements AS (
    SELECT dg.dg_begin,
           dg.dg_end,
           particle
      FROM day6_data_groups dg
      JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
)
  SELECT COUNT(particle) c
    FROM (
  SELECT DISTINCT ON (dg_begin, particle) dg_begin, particle
    FROM day6_data_elements
ORDER BY dg_begin, particle
         ) AS count;
1
2
3
4
  c
------
 6532
(1 row)

That’s already my final result of yes answers on the custom forms. With the wrong subselect in day6_data I ended up with 6528, which is too low.

Task 2

To solve this question I need to know the number of questions which are answered in each group, and the number of people which answered questions.

Can re-use the first three CTE from task 1, and need to further aggregate day6_data_elements: the number of groups tells me the number of questions.

1
2
3
4
5
6
7
8
day6_element_count AS (
    SELECT de.dg_begin,
           de.dg_end,
           particle,
           COUNT(*) AS count
      FROM day6_data_elements de
  GROUP BY de.dg_begin, de.dg_end, particle
)

The number of day6_data_groups shows how many people answered questions. This grouped together answers the question of how many questions got answered by all people:

1
2
3
4
5
SELECT COUNT(*) AS count
  FROM day6_element_count ec,
       day6_data_groups dg
 WHERE ec.dg_end = dg.dg_end
   AND ec.count = 1 + dg.dg_end - dg.dg_begin;

The full query for task 2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
WITH day6_data AS (
    SELECT id, value
      FROM day6
     UNION ALL
    SELECT (SELECT MAX(id) FROM day6) + 1, ''::TEXT
  ORDER BY id
),
day6_data_groups AS (
    SELECT COALESCE(LAG(id) OVER (ORDER BY id), 0) + 1 AS dg_begin,
           id - 1 AS dg_end
      FROM day6_data
     WHERE value = ''
),
day6_data_elements AS (
    SELECT dg.dg_begin,
           dg.dg_end,
           particle
      FROM day6_data_groups dg
      JOIN day6_data ON day6_data.id BETWEEN dg.dg_begin AND dg.dg_end
CROSS JOIN regexp_split_to_table(day6_data.value, '') as _(particle)
),
day6_element_count AS (
    SELECT de.dg_begin,
           de.dg_end,
           particle,
           COUNT(*) AS count
      FROM day6_data_elements de
  GROUP BY de.dg_begin, de.dg_end, particle
)
SELECT COUNT(*) AS count
  FROM day6_element_count ec,
       day6_data_groups dg
 WHERE ec.dg_end = dg.dg_end
   AND ec.count = 1 + dg.dg_end - dg.dg_begin;

Obviously the result can only be as high as the result from task 1, in the best case. More likely it is lower:

1
2
3
4
 count
-------
  3427
(1 row)

Categories: [Sql]