Image Corpus of Malaysian PEPs

Building image corpus of politically exposed persons for facial recognition for investigations

Khairil Yusof
2 min readFeb 3, 2019
Images of Mohd Irwan Serigar bin Abdullah, in Google Photos

I’ve written about using Google Photos for facial recognition before. In order for it to be more useful, an extensive corpus of images is needed, such that when given a photograph, faces will be tagged automatically.

Malaysian ministers and senior public officials automatically tagged for publicity photo of East Coast Railway (ECRL) project

I’ve now started a personal project, to slowly start building a corpus of images complete with metadata of description, source and licensing. When it gets more substantial and useful, it will be made available for download freely online.

Government Reports as Sources of Photos

While public figures are easier to find on-line, image searches of photos of senior public officials who avoid the limelight will not return any results if any. Information and photos of many senior public officials, including technocrats, special officers are harder to find, especially after a regime change or when word of a possible corruption scandal starts to leak out.

Limited results for Google Image search for Special Officer to Prime Minister Najib Razak, Wan Ahmad Shihab Wan Ismail

Colourful government annual reports are not only a good source of information for investigations, but also for photos for our facial recognition needs. We only need to extract about 5 or more photos of our PEP of interest.

In the following example, there are enough photos from one government annual report to extract a small corpus of photos of a senior public official.

Extracting Faces from MARA Annual Report 2012 for facial recognition training

If you have copies of Malaysian government agency annual reports, especially from 2012–2015 that are no longer available, please share it with Sinar Project’s Malaysian Government Document Archive project that aims to keep a searchable public digital archive of as many government reports of interest as possible.