SciCAR 2024

Finding leads in big data lakes: State of the art of the FollowTheMoney toolkit
28.09.2024 , Seminarraum I
Sprache: English

“FollowTheMoney” – that’s a research concept, a netherlands newsrooms name and a technical implementation of it. In this workshop we talk about the 3rd one, the toolkit around python’s “followthemoney”, originally developed by the OCCRP as the underlying foundation for Aleph. It allows data journalists and scientists working with structured data made out of entities such as Persons, Companies, and their relations. An important role plays deduplication of similar data points and detecting similarities based on statistical regression models.

Over the past years, there has been a lot of developing and extending of this tool outside of the scope of Aleph. This now allows researchers a diverse and adaptable toolchain for “FollowTheMoney” investigations (without or including Aleph). There are more tools building on top of it for scraping, transforming, deduplication, analyzing, storing and searching and even presenting or visualizing data for publications.

Due to recent work by OpenSanctions.org and investigativedata.io, the toolkit now allows creating a "data lake" with all the datasets relevant for a research group or organization. The methods allow retrieval, parsing, storing and deduplication of huge datasets, enrich them with other sources and keeping them up to date and easy to use for all kinds of investigations.

Examples of data projects using the "FollowTheMoney" stack are followthegrant.org (Tracking conflict of interests and industry influence on science), opensecuritydata.eu (companies, organizations or projects
that receive European Union security and military funding) or https://spendengerichte.correctiv.org (German court donations).

In this workshop we will give a walk through of the current tool stack, how and what to use for which tasks, and work together with the attendees on real work examples around useful datasets.

Simon Wörpel is an independent investigative data journalist, researcher and leak librarian. He specializes in documents processing, data engineering and data analysis for journalistic investigations. Simon works for different non-profit organizations, newspapers and media outlets in germany. There he advises and implements software tools for research teams to enable data mining, documents processing and data analysis to enable data-driven investigative journalism.

From 2015 to 2019 he worked as a data journalist and newsroom developer at the german non-profit investigative newsroom CORRECTIV. There he built document databases and security infrastructure for collaborative cross-border investigations like “The CumEx Files” and “Grand Theft Europe”. Both of these projects involved dealing with highly sensitive material and orchestrating secure communication and data sharing between dozens of journalists from different countries.

Simon originally attended journalism school and studied economics and politics in cologne, germany.

Diese(r) Vortragende hält außerdem: