Save the Data - .docx im Jahre 2133

Die Zukunft der Daten in der kantonalen Verwaltung des Kantons Bern

⛶  Open fullscreen

Viele der in der Kantonsverwaltung erzeugten Daten wurden im .docx-Format erstellt. .pdf ist der große Konkurrent, allerdings ist der Großteil der Datenproduktion in der Kantonsverwaltung im OOXML-Format (.docx).

Obwohl .docx das am häufigsten verwendete Format ist, wird für die Archivierung nicht als archivtaugliches Format zugelassen. Die verschiedenen Word-Versionen, die es gibt, und gewisse Funktionen (z. B. automatisch aktualisierte Felder) können zu veränderbarem Inhalt führen. Deswegen ist es empfohlen für die Archivierung jedes Word Dokument in pdf/A-2u zu konvertieren. Die Konvertierung führt aber zu Informationsverlust.

Gemäss Wikipedia ist Word "das mit Abstand meistverwendete Textverarbeitungsprogramm der Welt". Es ist auch ein Dateiformat das bereits 40 Jahren auf dem Markt ist. Wird Word weitere 40 Jahre existieren? Diese Möglichkeit besteht und deswegen fragen wir uns, ob es möglich wäre die Unterlagen auf .docx, das heisst im Originalformat, zu archivieren.

In dieser Challenge fördern wir, dass einer Software entwickeln wird, der eine Analyse von .docx Dokumenten macht und dass die kritischen Elemente im Bezug von Veränderbarkeit von .docx beschreibt. Zusätzlich möchten wir, dass ein neues archivtaugliches Word-Dokument erstellt wird.

Unterschiedliche Versionen von Microsoft Word OOXML:

Microsoft Word 2007

Microsoft Word 2010

Microsoft Word 2013

Microsoft Word 2016


Microsoft Word 1.0 - Internet Archive

Weitere Daten für die Challenge sind im Slack Channel verfügbar.

Docx Date Converter App

This simple Application allows users to select a Docx file to be analyzed. The analysis consists of checking if there are any variable fillable fields in the document, replacing them in the XML source code with a static value, and delivering a set of relevant metadata values for the analyzed document. For the first version, the solution will focus only on replacing variable date fields, which would otherwise be automatically filled by Microsoft Word after opening the document. Date values will be replaced by a static date at the time of analysis. This application converts .docx files that have a dynamic date field and saves them with a static date of the data when the .docx was created. It is a solution to the challenge of preserving the integrity of .docx files in archival systems.

The solution consists of three parts:

  • Front-End (React, Electron)

  • API (Python)

  • Back-End (Python)

Docx Date Converter

The Challenge

A large portion of data produced within the Canton Administration is created in the .docx format. Despite .docx being the most frequently used format, it\'s not permitted for archival due to potential inconsistencies across different Word versions and functions (e.g., automatically updated fields) that can lead to mutable content. Therefore, it\'s recommended to convert every Word document to pdf/A-2u for archival purposes, but this conversion leads to information loss.

We ask the question: "Could we archive documents in .docx, i.e., in their original format, given that Word has been a prominent document format for 40 years and might continue to be for the next 40 years?"

This challenge aims to develop software that performs an analysis of .docx documents, identifies critical elements concerning the mutability of .docx, and creates a new archive-friendly Word document.

Running the Software

To run the software, you need to have both Python (for the Flask backend) and Node.js (for the React frontend) installed on your machine.

Setup

  1. Clone this repository:

    git clone git@github.com:LucaKern/Save-the-Data---.docx-im-Jahre-2133.git
    
  2. Navigate to the project directory:

    cd Save-the-Data---.docx-im-Jahre-2133
    
  3. Install the Node.js dependencies:

    npm install
    
  4. Navigate to the backend directory:

    cd ./backend
    
  5. Create a virtual environment:

    python -m venv venv
    
  6. Activate the virtual environment:

    • On Windows:

      venv\Scripts\activate
      
    • On Unix or MacOS:

      source venv/bin/activate
      
  7. Install the Python dependencies:

    pip install -r requirements.txt
    ````
    
  8. Navigate to the electron app:

    cd ./frontend
    

Running the App

To start the application, run the following command:

npm start

Further Case Analysis

DOCX files, commonly used for document storage and sharing, may not be optimal for long-term archiving due to several factors:

  1. Compatibility: DOCX is a file format associated with Microsoft Office applications. While it enjoys widespread support presently, there is no assurance of its future dominance. If you rely on specific software to access DOCX files, there is a risk of obsolescence or incompatibility with newer systems, hindering retrieval of archived documents.

  2. Backward Compatibility: With each new release, Microsoft Office introduces changes to the DOCX format. Opening older DOCX files with newer Office versions can result in formatting errors, missing content, or altered document layout. Such issues pose challenges when accessing archived files after a significant duration.

  3. Data Corruption: Over time, files may experience corruption or degradation due to hardware failures, software bugs, or storage media issues. Due to the dependencies and complexity of the DOCX format, it is relatively more susceptible to integrity problems. Repairing corrupted DOCX files can be challenging, potentially leading to data loss or incomplete retrieval of archived documents.

  4. Long-Term Storage Standards: For long-term archiving, employing open, standardized file formats is recommended. Formats like PDF/A2-u, designed specifically for archiving, ensure document preservation and accessibility over extended periods. These formats are independent of specific software vendors, enhancing compatibility and longevity.

While converting important documents to standardized formats like PDF/A2-u is generally advised for long-term archiving, it is worth noting that the widespread use and potential future developments surrounding DOCX could contribute to its acceptance as an archival standard. Factors such as popularity, backward compatibility efforts, support from Microsoft, industry acceptance, and technological advancements might enhance the reliability of DOCX as an archiving format. However, until industry-wide acceptance and preservation efforts validate its suitability, adhering to current best practices by relying on standardized formats like PDF/A2-u for long-term archiving remains a prudent approach.

This content is a preview from an external site.
 

Event finished

13.05.2023 15:00

Edited content version 87

13.05.2023 13:02 ~ MilagrosWernicke

Edited content version 85

13.05.2023 12:27 ~ MilagrosWernicke

Edited content version 83

13.05.2023 12:24 ~ MilagrosWernicke

Edited content version 81

13.05.2023 12:23 ~ MilagrosWernicke

Edited content version 79

13.05.2023 12:20 ~ LucaKern

Edited content version 77

13.05.2023 12:20 ~ LucaKern

Edited content version 75

13.05.2023 12:11 ~ MilagrosWernicke

Edited content version 73

13.05.2023 11:35 ~ MilagrosWernicke

Edited content version 71

13.05.2023 11:34 ~ MilagrosWernicke

updated readme (@Luca Kern)

push image (@Luca Kern)

Readme (@Luca Kern)

added screenshot (@Luca Kern)

Delete frontend/dist directory

added files and screenshot (@Luca Kern)

first commit (@Luca Kern)

Get

13.05.2023 07:42

Repository updated

13.05.2023 07:42 ~ BrunoRodrigues

Edited content version 66

13.05.2023 07:42 ~ oleg

12.05.2023 16:01 ~ LucaKern

changed to Electron should be working now with API

12.05.2023 15:59 ~ LucaKern

Edited content version 59

12.05.2023 15:09 ~ MilagrosWernicke

Edited content version 57

12.05.2023 15:09 ~ MilagrosWernicke

Coming soon

12.05.2023 14:08 ~ MilagrosWernicke

Edited content version 51

12.05.2023 12:58 ~ MilagrosWernicke

Edited content version 49

12.05.2023 12:57 ~ MilagrosWernicke

dummy function (@Luca Kern)

Get

12.05.2023 12:20

new readme.md (@Luca Kern)

updated README.md (@Luca Kern)

Update README.md

added README.md (@Luca Kern)

first commit (@Luca Kern)

Get

12.05.2023 11:19

Joined the team

12.05.2023 11:19 ~ Jonas

I've looked into Microsoft Purview, the python-docx library, found a nice technical (XML) intro to .docx, and grabbed me a copy of 💾 Microsoft Word 1.0

12.05.2023 10:39 ~ oleg

Joined the team

12.05.2023 10:37 ~ LucaKern

Find

12.05.2023 10:35

We are defining deliverables and working strategies and identifying further issues.

12.05.2023 10:34 ~ BrunoRodrigues

Ask

12.05.2023 10:34

Für die Dokumentation: https://kost-ceco.ch/cms/ooxml.html

12.05.2023 10:34 ~ MilagrosWernicke

Joined the team

12.05.2023 10:29 ~ BrunoRodrigues

Event started

12.05.2023 09:00

Ask

09.05.2023 12:48

Edited content version 26

09.05.2023 12:48 ~ Felix

Edited content version 24

07.05.2023 18:20 ~ Felix

Edited content version 22

05.05.2023 20:22 ~ MilagrosWernicke

Edited content version 20

03.05.2023 13:25 ~ Felix

Edited content version 18

03.05.2023 13:25 ~ Felix

Edited content version 16

02.05.2023 08:44 ~ MilagrosWernicke

Edited content version 14

02.05.2023 08:40 ~ MilagrosWernicke

Edited content version 12

02.05.2023 08:39 ~ MilagrosWernicke

Joined the team

10.03.2023 11:10 ~ MilagrosWernicke

Ask

01.03.2023 11:04

Challenge posted

01.03.2023 11:04 ~ Felix
 
Alle Teilnehmer*innen, Sponsor, Partner, Freiwilligen und Mitarbeiter*innen unseres Hackathons sind verpflichtet, dem Hack Code of Conduct zuzustimmen. Die Organisatoren werden diesen Kodex während der gesamten Veranstaltung durchsetzen. Wir erwarten die Zusammenarbeit aller Teilnehmer*innen, um eine sichere Umgebung für alle zu gewährleisten.

Tous les participant-es, sponsors, partenaires, bénévoles et collaborateur-es de notre hackathon sont tenus d'accepter le Hack Code of Conduct. Les organisateurs feront appliquer ce code tout au long de l'événement. Nous comptons sur la coopération de tous les participants* afin de garantir un environnement sûr pour tous. Pour plus de détails sur le déroulement de l'événement, veuillez consulter les directives sur notre wiki.

Creative Commons LicenceDie Inhalte dieser Website stehen, sofern nicht anders angegeben, unter einer Creative Commons Attribution 4.0 International. / Sauf indication contraire, le contenu de ce site est placé sous Creative Commons Attribution 4.0 International.

Data Hackdays BE 2023