Crossword clue analysis

2019/02/05 --Confused codec slurs row leads to blog post!

I very much enjoy solving the crossword set by Cyclops in each issue of Private Eye magazine. I use the excellent AlphaCross app and load AcrossLite .puz files from Private Eyes website.

Private Eye's website has over 300 Cyclops puzzles archived, which I thought is enough to do some interesting analysis.

wget --no-parent -A .puz -r -nd http://www.private-eye.co.uk/pictures/crossword/download/\?O=D

Using the excellent python puzpy library we can dump all the clues from all of the puzzles:

import puz
import os

for filename in os.listdir("cyclops"):
    if filename.endswith(".puz"):
        print filename
        p = puz.read('cyclops/' + filename)

        numbering = p.clue_numbering()

        for clue in numbering.across:
            print clue['clue'].encode('utf-8')

        for clue in numbering.down:
            print clue['clue'].encode('utf-8')

Then we do some manual clean-up, such as filter out "see 1ac" etc:

grep -i -P -v '^\(?see ?\d' cyclops.list

This gives us a corpus of 8765 Cyclops clues to work with.

So we can make a wordcloud:

[cyclops-wordcloud.png]

If we pipe these commands together:

 sort | uniq -d 

we can identify how often Cyclops has been lazy and re-used clues:

350 417 20% of semen applied to hair highlight (6)                                          
325 426 A perv iffy about a French johnny-come-lately (7)                                   
334 618 Big noise behind Private Eye (6)                                                    
345 484 Concealments of balls snatched by Bristol supporters? (5-3)                         
382 462 Drop behind a human dynamo (4-4)                                                    
317 391 Embrace, taking drugs and sit round university (7)                                  
318 440 Government enterprise stifles political understanding (7)                           
412 439 Kiss unusually large copper on behind (8)                                           
348 402 Labour's last uncertainty: time for a split (4)                                     
420 465 Land of Bush senior: a major power no more (4)                                      
326 456 Members backing 'the old man' and party getting suddenly dropped as a punishment (9)
316 492 365 Mostly worried about nothing following business procedure to divert motions (9) 
374 483 Openly gay and keen to produce public disapproval (6)                               
404 464 Pants and takes the line of least resistance? (6)                                   
331 414 Penny might come first while being penetrated by good bloke (7)                     
324 395 Pornographic, dual controlled model? (5)                                            
367 458 Prison unrest (4)                                                                   
465 484 Result of having come from club out of basement (5)                                 
318 403 Second nature to put topless, backless bodice on oneself? (5,3)                     
391 455 Self-conceit: can Sir Dicky miss out? (10)                                          
339 477 Snoop hard at it getting stiff (8)                                                  
357 454 Something for a religious arse steeped in drink to heave (4,2)                      
352 421 Start to wank off metalworker (senior) (5)                                          
321 437 Vain fantasist: one to catch in Private Eye (7)                                     

The three digits are the crossword number that the clue appeared in. You can observe one clue was reused twice, and crossword 484 contained two previously used clues.

We also have enough text that we can attempt to make a markov chain model. Markov chains work out the probability of a word following another. The upshot of this is you can get a computer to produce novel content that mimics the style of whatever text was input.

I used another excellent python project called markovify:

import markovify

# Get raw text as string.
with open("cyclops.list.clean") as f:
    text = f.read()

# Build the model.
text_model = markovify.NewlineText(text)

for i in range(20):
    print(text_model.make_sentence())

Which gives us output like this:

River Republican - prepare for the President?
Right sort of day for innocence
Politician's words that he might well have to make hole bigger
Overpaid bank chief liable to produce a tight session?
Electorate embraces Gordon at first appearance in Sun circulation terms, much of it
Cameron's Europe policy apparently is too much booze, with inside of penis turning green initially?
Require foreign article and end of June
Stampede here to fall behind after Gordon just started on about speed of light - stupid cretin, nutter!
Leader of Tories is implicated in email set fiasco - love the old lady - disgrace!
Laugh like a drug
Schmaltz rejected to get behind
Taking Boris's lead, needed tip - what a disaster!
Pissed member before sex check
Fully attentive like our Brian, said to be squeezed by tight chinos - it's part of Gordon Brown's outdated tax - misconceived shit
Abandoned as useless, Corbyn and his dismissive expression
House might be kept stable!
Corporation owner stuffing model in middle of geek's facial hair
Stimulating nice feel outside the church is what Eye readers got in June, so a massive descent
Stupid management relied on by Ray's cooked sole
Reject half of Australia with electronic insert
Ace body-builder's reported muscles makes you want to penetrate slag

In case it is not clear, these are not real clues that could be solved! But they are indeed distinctly Cyclopean.

If you want more I loaded up a database with 10000 precomputed ersatz Cyclops clues: Cyclops Clue Generator


PrevNext
2013 Xmas Cipher Challenge Part 5 SolutionTaskbar Ping using Powershell