October 7, 2022

InsiderPlays

Creating Possibilities

MeitY releases draft guidelines on data anonymization for public comments

Update on 06/09/2022 at 1:36 pm: The draft guidelines were taken down from the eGov standards website on September 6th. A PDF of the guidelines can be accessed here.

Original story published on 01/09/2022 at 3:03 pm:

A new set of draft data anonymisation guidelines for e-governance projects is open for public consultation until September 21st, 2022. The guidelines suggest various techniques and SOPs that e-governance projects can adopt to anonymise the data they gather (and then harness it for other projects). They also aim to support the implementation of data anonymisation provisions in policies and laws enacted by the government.

Who prepared this?: The draft report was commissioned by the Ministry of Electronics & Information Technology (MeitY) and prepared by the Standardization Testing Quality Certification (STQC) Directorate and Centre for Development of Advanced Computing (C-DAC). A full list of policymakers involved in framing the guidelines is available in Annexure 2 of the uploaded PDF. 

How to participate?: Email your feedback to Shubhanshu Gupta, Principal Technical Officer at CDAC: shubhanshug[at]cdac[dot]in. Remember to copy the following email address when making your submission: headits[at]stqc[dot]gov[dot]in.

Advertisement. Scroll to continue reading.


FREE READ of the day by MediaNama: Click here to sign-up for our free-read of the day newsletter delivered daily before 9 AM in your inbox.


Why anonymise e-governance-related data?: The draft guidelines are clear in their belief that data can play a role in empowering both e-governance and the nation. Emphasising that ‘data-tech’ is now considered a good and central to international dialogue and collaborations, the draft goes on to add that government entities are the ‘most extensive’ data fiduciaries in India. This makes them responsible for protecting the privacy of the reams of citizen data collected through large-scale, interconnected e-governance projects. Data anonymisation—or removing identifiable attributes from a data value to protect a person’s identity—is one such method, according to the guidelines. It both protects citizens while also allowing for the use of anonymised data for limited purposes in other e-governance projects.

Does anonymisation work?: The draft guidelines add that data anonymisation is not a silver bullet solution to ensure privacy, arguing that it should be a component of a larger privacy-by-design approach to an e-governance project’s operations. However, it offers a muted response to arguments that data can never truly be anonymised (and privacy risks mitigated), musing that time and technological advancements will offer more robust solutions to these issues.

When was this released?: In July 2022, according to the PDF uploaded online. However, brief news reports on the draft guidelines—and a similar set of recommendations for mobile security—prominently appeared around August 30th, 2022. As a result, some commentators have recently questioned MeitY’s muted marketing of the guidelines and the consultation period. 

One step further for non-personal data policy?: The draft guidelines appear to be the next step in India’s policy tryst with harnessing the ‘non-personal data’ (data not related to an individual, or anonymised personal data) of citizens. Since 2020, India has been flirting with the idea of managing and sharing non-personal data to tap its ‘social and public value’ and improve digital innovation and entrepreneurship. These plans then found their way into India’s now-withdrawn proposed data protection laws, although there are now rumours that it may be separately regulated from personal data. Then came the National Data Governance Framework Policy released by MeitY in May 2022, which seeks to build ‘a vast repository of anonymised, non-personal data obtained from government ministries, departments and organisations, alongside anonymised data voluntarily disclosed by private entities’.

Why it matters: While the government purportedly seeks to improve the quality of business and governance through such non-personal data policy frameworks, they could harm hard-earned competitive edges in India’s tech sector, while still raising privacy concerns. Also, while the 2022 draft guidelines want to help the government buildvibrant, diverse and large base of datasets for research and innovation whilst maintaining informational privacy’, whether the State is competent to do so, or whether datasets filled with questionable data can indeed improve governance, largely remains to be seen. 

Advertisement. Scroll to continue reading.

Who Are the Stakeholders Involved in the Anonymisation of Data?

Professional users: People who use the anonymised data captured or processed by an e-governance organisation (the application that captures this data is called the ‘owner application’). These can include citizens, call centres, third-party service providers, researchers, and data analysers. It can also include departments or applications that have to access the data produced by the source application (or where inter-system data sharing is necessary).

Processors: Teams involved in processing the data captured by the owner application. They convert raw data into anonymised data. They include development and testing teams, production support, and system administrators. 

Auditors and reviewers: Responsible for ensuring the anonymity of processed data through rigorous testing. They can include compliance officers, legal staff, or external auditors. 

Data principals: The users, or citizens, whose personal data is processed by e-governance projects during service delivery. 

What kinds of data processing does this document pertain to?

Types of organisational data processing can be classified as follows:

  • Purpose-based data processing, or data processing undertaken for a clearly defined purpose and undertaken with the data principal’s explicit consent;
  • Processing to fulfil ‘lawful disclosure’ requests;
  • Data sharing with processors or third parties for processing;
  • Processing to integrate services and products with other tech ecosystems;
  • Additional processing carried out by organisations to collaborate, improve competitiveness and services, or cross-sell.

‘What, When and How to Anonymise’: What does the draft say?

There’s no one-size-fits-all approach: Determining what data to anonymise, at which stage of the data processing cycle, and how depends on an organisation’s objectives and emerging regulatory regimes and standards. There is a need to consolidate diverse practices and develop standards for data anonymisation.

However, the report does clarify that keeping in mind the principle of ‘data minimisation’, anonymisation should ideally happen as soon as possible in the lifecycle of data collection.

Advertisement. Scroll to continue reading.

What is the recommended SOP for anonymising data?

The 15-step SOP for organisations undertaking data anonymisation suggests:

  • Step 1: Determine which datasets required anonymisation. Consider data collected from all possible sources.
  • Step 2:  Devise a release model or policy on how the anonymised data will be released and to whom. Decide whether this dataset will be publicly available, or shared with controlled groups.
  • Step 3: Identify the teams required within the organisation to perform anonymisation. Identify their roles and responsibilities.
  • Step 4: Determine which data directly identifies an individual (direct identifiers like phone numbers, and interestingly, Aadhaar) and which data indirectly does so (quasi-identifiers like sexual orientation or religious belief). This will help decide which data should be anonymised and the techniques to do so.
  • Step 5: First mask—or anonymise—direct identifiers. This keeps the dataset free from re-identification risks.
  • Step 6: Conduct threat modelling for quasi-identifiers, so identify what information could be revealed as a result of them.
  • Step 7: Determine the re-identification risk threshold based on the anonymisation techniques deployed.
  • Step 8: Determine the anonymisation techniques for quasi-identifiers, and document the process.
  • Step 9: Import sample data from the original database and document the same.
  • Step 10: Based on steps 6-9, perform a trial anonymisation and assess whether the results meet risk-limitation expectations. Review and correct errors, and ensure that risk is below the re-identification threshold.
  • Step 11: Now, anonymise all quasi-identifiers across the dataset.
  • Step 12: Stop to evaluate the actual identification risks for the anonymised data again.
  • Step 13: Compare this risk with the threshold laid out by policymakers—if it falls short, evaluate and repeat the testing.
  • Step 14: Determine access controls for sharing anonymised data. Data owners should ensure that the parties they share the information with use it for a limited purpose and that it is not misused. Organisations receiving data should confirm that they will not attempt to re-identify it.
  • Step 15: Document the anonymisation procedure. This will help auditors identify potential flaws in anonymisation too.

The Draft also recommends conducting a risk assessment post-release of the anonymised data. It broadly recommends putting systems in place to report data privacy incidents to concerned stakeholders within specified timeframes. To minimise these events, the Draft emphasises training e-governance officials in data anonymisation techniques across the life cycle of data processing (collection, processing/usage, archival, deletion/destruction).

How can anonymised data’s privacy be measured, according to the draft?

Through approaches like K-anonymity (and related ones like L-diversity and T-closeness). In essence, these measures ensure that the risk threshold for an anonymised dataset has not been passed. Using these functions can help data processors evaluate how resistant their data anonymisation techniques are to re-identification attacks. They can’t guarantee privacy—but can help measure the likelihood of preserving it.

How should ‘specialised data’ be anonymised?

Specialised data can include audio, video, and images, says the Draft. It can be anonymised and protected through cryptographic methods, such as:

  • Homographic encryption: Identifying data is replaced with an encrypted value. This is a type of randomised encryption.
  • Order-preserving encryption: A form of non-randomised, symmetric encryption, it can be used to replace an identifying attribute with an encrypted value. 
  • Homomorphic secret sharing: Identifiable information, or a ‘secret’ value, is broken down into packets called shares. A mathematical operation is then performed on the shares to reconstruct the original secret, or data. In this case, the technique can be used to replace identifying data with shares. 

Non-cryptographic methods for anonymising specialised data include:

  • Privacy assessment measures: like K-anonymity, L-diversity, and T-closeness.
  • Permutation: Data values are reordered within the dataset without changing their actual value.
  • Masking: Removing the unique identifiers of a data value.
  • Differential privacy: This method defines privacy ‘mathematically’, and is used in the context of machine learning and statistical analysis. It ensures that anyone who views a differentially private data analysis will make the exact same inference on a person’s private information, regardless of whether that data was inputted into the analysis itself. 

What other data anonymisation techniques does the draft suggest?

Attribute Suppression

  • What is it?: Removing an entire chunk of identifiable data—or an ‘attribute—from a dataset to suit the analyses.
  • When can it be used?: When an anonymised dataset doesn’t require that specific attribute, or when reidentification of that attribute is unnecessary.
  • Pros?: With the possibility of re-identification low, this is one of the ‘simplest and strongest’ techniques to anonymise data, claims the draft.
  • Cons?: The re-identification possibility being low may harm future business requirements.

Character Masking

  • What is it?: Masking specific characters of a data value with a constant symbol. However, this should still provide some information on the original value. For example, presenting an Aadhaar number 1234567890123 as XXXXXXXX0123.
  • When can it be used?: When partially hiding the true value of a data value is sufficient to ensure anonymity.
  • Pros?: It allows data subjects, who own the data, to recognise the information collected on them.
  • Cons?: Some subject expertise is required to mask certain characters of a data value. Even then, re-identification can be easy, says the draft.

Pseudonymisation/Coding

  • What is it?: Replacing identifiable data with a pseudonym. The original data is securely maintained and can be retrieved to map to the pseudonymised value.
  • When can it be used?: When no information from the original data value can be shown. It can also be useful for cases where the anonymised data needs to be both irreversible (in that the original data values are discarded) and reversible at the same time (in that the original database is maintained).
  • Pros?: This is a good technique for when one-to-one re-identification is required, argues the report.
  • Cons?: The original data needs to be securely stored and managed.

Data Swapping

  • What is it?: Attributes are shuffled within the dataset so that they don’t correspond to their original data points. The original data is anonymised, while still allowing for accurate analyses. This is an irreversible anonymisation technique, with the report claiming that retrieving original data is almost impossible.
  • When can it be used?: If the underlying data points need to be preserved post anonymisation. Also, useful if relationship analyses within a record are not required.
  • Pros?: Doesn’t need to be applied to all attributes—can be performed on some, while leaving the others untouched.
  • Cons?: When swapping, there is a chance of the same values being swapped for each other, reducing the randomness (and anonymity cover). Also, High-powered computation is required to swap large data sets.

Record Suppression

  • What is it?: Removing an entire record in a dataset. Affects multiple attributes, instead of just one.
  • When can it be used?: When the record doesn’t serve the analyses, but still contains identifying information.
  • Pros?: Easy to implement and a strong anonymisation method, according to the draft.
  • Cons?: Can impact statistics and thus the data analysis itself.

Generalisation

  • What is it?: Replacing individual data values with the broader range they fall into. While this value may be less precise, it is ‘semantically consistent’.
  • When can it be used?: For values that can be generalised and still useful for specific data analyses. Or, when attempting to discern a broader trend.
  • Pros?: The data’s truthfulness is preserved, claims the report, leading to high utility for broader analyses.
  • Cons?: At the same time, high generalisation, although privacy-protecting, can also raise its own problems. For example, identifying the level of generalisation can be difficult. Linkage risks persist too.

Data Perturbation

  • What is it?: Adding ‘noise’ to the dataset to protect the confidentiality of individual records. This is done by modifying or replacing data values for sensitive data with randomised values.
  • When can it be used?: For ‘quasi-identifiers’—or values which can be identified if combined with others. It should be used if the goal is to allow users to access important data without compromising individual privacy.
  • Pros?: ‘Does not require knowledge of the distribution of other records in dataset,’ says the draft. 
  • Cons?: The technique is ineffectual if data accuracy is needed. Also, a ‘small base’ can lead to overall weak anonymisation.

Synthetic Data

  • What is it?: Artificially generated data that attempts to approximate the original dataset. It builds a statistical model based on sampled data from the original set, which is then used for analyses.
  • When can it be used?: When system testing using large datasets—where the actual data cannot be used, but the sample used should be ‘realistic’ and comparable to the actual set. Or, when no connection needs to be established between the anonymised and real data.
  • Pros?: Helpful when system testing. Can be publicly published and shared as the risk of re-identification is reduced.
  • Cons?: The data’s ‘truthfulness’ is lost.

Data Aggregation

  • What is it?: Data is aggregated and summarised for data analyses on relationships and patterns between the data values.
  • When can it be used?: When the data needs to be summarised to perform statistical analyses. Or, as the draft colourfully states, data aggregation tools can be used when looking to perform analyses beyond the ‘two-dimensional’ rows and columns of Microsoft Excel.
  • Pros?: As the data is aggregated, the draft claims that re-identifications risks do not exist.
  • Cons?: The data’s ‘utility’ may be hampered, especially for applications that require data values for analysis.

This post is released under a CC-BY-SA 4.0 license. Please feel free to republish on your site, with attribution and a link. Adaptation and rewriting, though allowed, should be true to the original.

Read More