avatar.pngabout_image.pngMagSearch
Find an article in a magazine

Introduction

As a regular reader of the Raspberry Pi magazines MagPi and HackSpace, I miss a good searching facility. Sometimes you know that you have read about an interesting project in the past, but you just cannot find it anymore without having to browse all the magazines (and even then cannot find it). Or you want to learn about a new subject and want to know whether there are any interesting articles or tutorials. According to several questions asked in the ‘Your Letters’ section of the MagPi, other readers also seem to struggle in finding the article they are looking for (see e.g. MagPi97 page 92).

There have been a few attempts by readers to create an index. Currently the best is from ‘HugoD’ on the Raspberry Pi forum who has made a word document with titles and one-line descriptions per article for MagPi issue 1 upto (at least) 99. However often titles contain only little information and not always give a good description of the content.

This stimulated the idea to build a kind of index of keywords using the text information in the digital magazine. By means of the Linux application pdftotext (https://www.xpdfreader.com), a PDF file can be converted to plain text. This can be done page by page. Subsequently each text page can be processed by a script.

Not only articles are processed but also program listings and text in images and figures. It is therefore also possible to find an article based on keywords occurring in the code of a program. This offers more flexibility in searching compared to titles only.

PDF to text conversion is not flawless. Sometimes words are incomplete, e.g. miss the first character. Sometimes a part of a word is encoded. It is also possible that there are words in the text which are not visible in the PDF (hidden behind an image or possibly left-overs from PDF processing?). Such incomplete words are also not found by the search option in a PDF reader, while these hidden words also are found by the PDF reader so it does not seem to be related to the pdftotext application.

The text processing follows the following steps. First the text is stripped of any special characters (+, -, @, ...). Because this would make it impossible to find C++, C++ is first converted to Cpp. Next, the page is converted to a list of lower case words. Small words with less than 3 characters are removed, except for some reserved words such as ‘c’, ‘3d’ or ‘ai’’. These reserved words are defined in the file parser.yml. Also words which do not have any content meaning (called ‘stopwords’) are removed. These stopwords are also defined in parser.yml. Finally a frequency table is built. This table indicates how many times a specific word occurs on a page of text. The larger the frequency, the more important the word? Parts of the code have been taken from this website (Programming Historian) which has been the inspiration to build this application.

Download

The free software can be downloaded as a .zip file from this website. The .zip file can also be obtained from my GitHub site. It consists of several Python programs, YAML files and the databases for the MagPi (up to number 127) and HackSpace (up to number 64) magazines. With this software one can enter a combination of keywords (currently only AND combinations) and display the search results (magazines and pages). If you directly want to open a result, it is required that the digital version of that magazine is present on the computer. However the digital versions of the magazines are not included in this download so you have to download them yourselves from the MagPi and/or HackSpace website.

The zip file contains at least the following files:

Programs in the .zip file
mag_gui.py Main Tk GUI application
mag_class.py Class definitions for making databases and searching
config.yml Configuration file with database names, paths
parser.yml Lists of stopwords in English and Dutch, reserved words
magpi.db Database file for the MagPi magazines (dictionary)
hackspace.db Database file for the HackSpace magazines (dictionary)
manual.pdf Short help file, can be called from the GUI
about_image.png About image
LICENSE License file GNU GENERAL PUBLIC LICENSE
README Short description of the program

Installation

The software is primarily intended for Linux systems and has been tested on a Raspberry Pi 4. Although searching in magazines will also work on Windows systems, it is currently not possible (disabled) to make or update the databases on Windows. This is primarily caused by the application pdftotext. The installation of the software is described for the Raspberry Pi.

Before starting the installation, it is always a good habit to start with: sudo apt-get update
sudo apt-get upgrade
Python needs a few modules: PyPDF2, PyYaml, keyboard. The first two can be installed using the Thonny IDE or using pip3. The module keyboard needs to be installed as root with: sudo pip3 install keyboardFor displaying a magazine, the application evince is used. This PDF viewer can be installed via the Linux desktop Preferences - Add/remove Software - evince or sudo apt-get install evinceThe command line application pdftotext (based on Poppler) is installed via Preferences - Add/remove Software - pdftotext or sudo apt-get install pdftotextPossibly this application has already been installed on your system.

Create a directory /home/pi/MagSearch and copy the .zip file into this directory. Unpack/copy the files in the MagSearch folder (MagSearch-main if downloaded from GitHub) in the zip file into this directory.

Optionally, place the downloaded PDF magazines in the corresponding subfolders (e.g. /home/pi/MagPi or /home/pi/HackSpace). The paths to these folders are defined in config.yml under magazines - [mag] - directory. Note that this is not required for searching but only is needed for opening the results and for making and/or updating the databases.

Open the command line in /home/pi/MagSearch and run python3 mag_gui.py or ./mag_gui.py. It might be necessary to first make the file executable with chmod +x mag_gui.py.

Graphical User Interface

The GUI consists of a menu bar at the top, a text window, a search field and some radio buttons.

gui.png
The Graphical Interface of MagSearch
Menu bar
The menu bar has the options ‘File’, ‘Magazine’ and ‘Help’. Radio buttons
The radio buttons determine the way in which a match with a keyword will be done. The selected option will work on all keywords simultaneously. There are 3 options:Search field
The search field can be used to search the magazines for an AND combination of keywords or to open a magazine at a specific page. Keywords are separated by spaces. Keywords can optionally be preceded by an exclamation sign (!) to force an exact match with a word or by a dollar sign ($) to force a match at the beginning of a word. This will overrule the action set by the radio buttons. The search will start by pressing the ‘Enter’ key or by clicking the right button on the mouse. All magazines will be searched for the keywords. By preceding the keywords with a ‘@’ followed by a number, the search can be restricted to a single magazine. E.g. assuming that the magazine MagPi has been selected, the command ‘@100 !c tutorial’ will scan MagPi100 for the keywords ‘c’ and ‘tutorial’ with an exact match for ‘c’. To open an individual magazine at a certain page, enter ‘#number page’ in the search field.

Text field
The text field will display the results of the search. Basically the result is a series of lines with one magazine per line. On a line the page numbers are indicated in bold with the total number of matches on that page between brackets. Right-clicking with the mouse will open a window displaying all keywords on that page which occur more than once. Left-clicking will open the magazine at that page (assuming that the digital versions of the magazine are available).

Database management
This is activated with ‘File - Database’. It will open a small window. Important: database management is currently only possible on Linux systems. On Windows systems it is disabled. For database management the digital versions of the magazines have to be available on the system. By placing a new magazine in the corresponding folder and entering the magazine number in the “Start” field and pressing ‘Add’, the magazine can be added to the database. Similarly a series of magazines can be added by using the ‘Start’ and ‘End’ fields. The contents of an existing database can only be overwritten when the option is checked. When Database management is started, it checks whether any new digital magazines have been added to the folders and will update the numbers in ‘Start’ and ‘End’. The status bar will indicate that the database is not up to date. Press ‘Add’ to update.
database.png

Finding articles

The MagSearch utility does not generate a table of contents. It generates per magazine and per page a list of (almost all) relevant words and how often they occur on a page. Finding an article needs the ability to find some relevant keywords.

Suppose we want to find all articles of the ‘C learning course’. The character ‘C’ is a reserved character and so will be present in the database. However with ‘C’ not always the language C is meant. The character ‘C’ can also be used for numbering, annotation or occur with a different meaning, like in ‘control C’. There are also a lot of words which either start with a ‘C’ or have the character ‘C’ in it. So at least we only want to find exact matches of ‘C’. This can be done by using an exclamation mark before the word or character: !c (see Searching). But we definitely need more keywords to find the course.

There are about 10 subsequent articles about programming in C (MagPi47 to MagPi58). By means of the database I have determined which keywords occur on all starting pages of these articles. This yielded the following list:

This gives a suggestion on how to choose your keywords. Keywords belonging to the author are not very helpful (unless you know the person very well), except maybe the name (e.g. ‘mike’, ‘cook’). The keywords at the top of the page are very interesting because these are also used in the Table of Contents at the start of the magazine. Categories used by the MagPi are for example ‘Cover Feature’, ‘Regulars’, ‘Project Showcases’, ‘Your Projects’, ‘Tutorials’, ‘The Big Feature’, ‘Reviews’, ‘Community’. If you know (or guess) the category of your article, it would be good to include this in the search. Finally some good keywords which describe the content of the article. In this case ‘c’ and ‘introduction’ give a good description, while the word ‘part’ indicates that it is part of a series.

Tips for other keywords

Searching

A search is started by entering a number of space separated keywords in the entry field (Search) of the GUI and pressing ‘Return’. All keywords are connected by means of an AND function. By default the search option is set to ‘Start’ which means that keywords are matched at the start of a word. Instead of ‘Return’ you can also press the right mouse button in the entry field.

There are currently three options to match keywords: a) exact, b) start and c) any. These search options can be selected on the GUI and they will work for all the keywords in the entry field. It is possible for a single keyword to overrule these options. We have already seen that an exclamation point before the keyword forces an exact match for that keyword. In the same way, a ‘$’ forces a match at the start of a word.

A line of search results consists of the magazine number (e.g. MagPi47) in blue followed by a series of pages (bold) and total number of hits (between brackets). The result with the most pages is at the top of the list.

searchresult.png

Left-clicking on the blue magazine number will open the magazine at page one. Left-clicking on any of the page numbers will open the PDF at that page. Right-clicking on the page number will open a window which shows all keywords with more than 1 hit on that page. The words are sorted in decreasing number of occurrence, so most important words are first. This gives a crude idea of the content of that page, although it is not very readable.

The advantage of MagSearch is that you can quickly open a magazine at a page in the searching result. In this way you can immediately see whether it is what you were looking for or not. It allows for an efficient browsing through the magazines. However in order to do so, it is necessary that you have downloaded and stored the digital magazines on your computer (see Installation). The magazines should have a specific naming convention. It is a base name (specified in config.yml), followed by a number. Numbers 1 to 9 have a preceeding 0. E.g. MagPi01, MagPi09, MagPi10, MagPi100, … .

Sometimes you see a series of increasing page numbers on the search result. This indicates a multi-page article (which could be very interesting). Pressing control-B or the middle mouse button selects the longest series per magazine. By default the series is allowed to skip one page (defined in parser.yml). When the distance between pages is more than one, it is not considered a continuous series anymore.

enter.png
Enter or right mouse button
ctrlb.png
ctrl-B or middle mouse button
Sometimes it is also helpful to look at the results with the largest hit numbers (the numbers between brackets).

Happy hunting.

8bitBBCFan - contact vzon2005@gmail.comLast updated: 15 March 2023

Contents