The flight is approaching a regional airport, but you still have to fill out a customs form. Since you are good with hacking things together, you walk around the plane and gather all the answers from all groups on the plane. There are
26 yes-or-no questions, marked from
Task 1: Count how many questions are answeres
yes in each group. Multiple
yes answers for the same question do not count. The data is provided as blocks, kind of like two days before from the passport scanners. I can re-use some of the code - which will turn out to be a major headache!
Task 2: Of course someone wrote the wrong instructions down! It’s up to you to fix the mess now. You have to find out how many groups have the same question answered by anyone in the group.
Load the data, same table structure as day 4:
And load the actual data:
This time I included a part from the end as well, which will become important later.
2148 rows in the table, but again that does not mean anything as the data is grouped:
I also will copy the CTE from day 4 which parse the data into groups, and clean it up.
I made a mistake in this CTE when I copied it from day 4: the
SELECT MAX(id) FROM day6 part to find out the last row and add an empty row I forgot to change. It stayed
SELECT MAX(id) FROM day4 for a while - I had the entire calculation for the number of answers figured out, but kept getting a result which the website reported as “too low”.
When I peeked into
day6_data I couldn’t see much, and missed that the empty line at the end was missing. Only when I debugged
day6_data_elements I found that it ends at the second last group:
fpuwosv and is not showing the last group of data. What happened is that the wrong query selected a number from “day4”, which is lower than the number of entries in the table for “day6”. Which means I got two rows with the same ID, and the ORDER BY in
day6_data sorted the empty row somewhere in the middle. Not helpful!
Once I fixed that problem I got the correct result.
day6_data_groups CTEs are the same as for day4 (with changed names), the
day6_data_elements CTE is a bit different:
I’m no longer selecting a
SUBSTRING(), but just
particle, that is the one character. And the
regexp_split_to_table() splits on '' instead of ' ‘.
Using the single characters in
particle and the position in
dg_end, doesn’t matter) I have to figure out a count over all the distinct characters for one group. PostgreSQL offers DISTINCT ON which can be used to return the first row among equal rows. For this I specify
particle as match, and then the
ORDER BY has to include the same columns as well - otherwise the same rows are not ordered together and can’t be eliminated.
This produces a list of rows where each character in a group only appears once:
At this point I don’t need the
dg_begin column anymore, and can just count the
That’s already my final result of
yes answers on the custom forms. With the wrong subselect in
day6_data I ended up with
6528, which is too low.
To solve this question I need to know the number of questions which are answered in each group, and the number of people which answered questions.
Can re-use the first three CTE from task 1, and need to further aggregate
day6_data_elements: the number of groups tells me the number of questions.
The number of
day6_data_groups shows how many people answered questions. This grouped together answers the question of how many questions got answered by all people:
The full query for task 2:
Obviously the result can only be as high as the result from task 1, in the best case. More likely it is lower: