The flight is approaching a regional airport, but you still have to fill out a customs form. Since you are good with hacking things together, you walk around the plane and gather all the answers from all groups on the plane. There are 26
yesorno questions, marked from a
to z
.
Task 1: Count how many questions are answeres yes
in each group. Multiple yes
answers for the same question do not count. The data is provided as blocks, kind of like two days before from the passport scanners. I can reuse some of the code  which will turn out to be a major headache!
Task 2: Of course someone wrote the wrong instructions down! It’s up to you to fix the mess now. You have to find out how many groups have the same question answered by anyone in the group.
Preparation
Load the data, same table structure as day 4:


And load the actual data:


This time I included a part from the end as well, which will become important later.
I have 2148
rows in the table, but again that does not mean anything as the data is grouped:


I also will copy the CTE from day 4 which parse the data into groups, and clean it up.


I made a mistake in this CTE when I copied it from day 4: the SELECT MAX(id) FROM day6
part to find out the last row and add an empty row I forgot to change. It stayed SELECT MAX(id) FROM day4
for a while  I had the entire calculation for the number of answers figured out, but kept getting a result which the website reported as “too low”.
When I peeked into day6_data
I couldn’t see much, and missed that the empty line at the end was missing. Only when I debugged day6_data_elements
I found that it ends at the second last group: frhdaek
+ fpuwosv
and is not showing the last group of data. What happened is that the wrong query selected a number from “day4”, which is lower than the number of entries in the table for “day6”. Which means I got two rows with the same ID, and the ORDER BY in day6_data
sorted the empty row somewhere in the middle. Not helpful!
Once I fixed that problem I got the correct result.
The day6_data
and day6_data_groups
CTEs are the same as for day4 (with changed names), the day6_data_elements
CTE is a bit different:


I’m no longer selecting a SUBSTRING()
, but just particle
, that is the one character. And the regexp_split_to_table()
splits on '' instead of ' ‘.
Task 1
Using the single characters in particle
and the position in dg_begin
(or dg_end
, doesn’t matter) I have to figure out a count over all the distinct characters for one group. PostgreSQL offers DISTINCT ON which can be used to return the first row among equal rows. For this I specify dg_begin
and particle
as match, and then the ORDER BY
has to include the same columns as well  otherwise the same rows are not ordered together and can’t be eliminated.


This produces a list of rows where each character in a group only appears once:


At this point I don’t need the dg_begin
column anymore, and can just count the c
column:




That’s already my final result of yes
answers on the custom forms. With the wrong subselect in day6_data
I ended up with 6528
, which is too low.
Task 2
To solve this question I need to know the number of questions which are answered in each group, and the number of people which answered questions.
Can reuse the first three CTE from task 1, and need to further aggregate day6_data_elements
: the number of groups tells me the number of questions.


The number of day6_data_groups
shows how many people answered questions. This grouped together answers the question of how many questions got answered by all people:


The full query for task 2:


Obviously the result can only be as high as the result from task 1, in the best case. More likely it is lower:

