Graph-Driven Right-to-Be-Deleted Automation with Neo4j and Spark

Authors

  • Radhakrishnan Pachyappan VDart Technologies, USA Author
  • Srikanth Gorle Foot Locker, USA Author
  • Amsa Selvaraj Amtech Analytics, USA Author

Keywords:

data lineage, GDPR, Neo4j, Spark, graph traversal, data deletion, privacy compliance, distributed query processing

Abstract

This study uses graph-based architecture to automate GDPR Article 17's data erasure requirements in big, lineage-rich analytical ecosystems. Inter-table derivations and transformation links are shown in the proposed Neo4j property graph. This offers permanent lineage attribute data. Apache Spark distributes recognized data subject deletion requests over this network. Dependency chains help distributed graph searches find all topic ID-derived tables and columns. For transitive deletion resolution in complicated pipelines, this method covers updates, joins, and aggregations. Our dispute resolution mechanism protects data meaning when shared. Our approach is tested for latency, storage overhead, and scalability using real-world data volumes and query topologies. Results show graph-driven propagation simplifies compliance and lets privacy authorities and data stewards create automated audit trails.

Downloads

Download data is not yet available.

References

S. Ziegler, H. Stamatopoulos, and A. Bernstein, "Modeling data lineage for data lakehouse environments," in Proc. IEEE Int. Conf. Big Data, Dec. 2021, pp. 151–160.

European Union, Regulation (EU) 2016/679 of the European Parliament and of the Council (General Data Protection Regulation), Apr. 2016.

M. Curé and C. Darnell, "GDPR compliance in distributed systems: The role of traceability and lineage," J. Data Prot. Privacy, vol. 3, no. 2, pp. 112–123, 2020.

C. Gkoulalas-Divanis, A. D. Smith, and V. Verykios, "Privacy-preserving record linkage for GDPR-compliant analytics," in Proc. IEEE Int. Conf. Data Eng. (ICDE), Apr. 2019, pp. 2104–2107.

M. Mhedhbi and M. Salihoglu, "Scaling graph analytics: Neo4j meets Apache Spark," in Proc. VLDB, vol. 14, no. 12, pp. 2839–2852, Aug. 2021.

S. Schelter, J. Böse, and A. Emani, "Automating large-scale data quality verification using data lineage and Spark," in Proc. IEEE Int. Conf. Big Data, Dec. 2018, pp. 641–650.

A. Chaves, M. Baca, and S. Ghosh, "Data deletion in distributed systems: Techniques, tools and regulatory implications," IEEE Trans. Cloud Comput., vol. 10, no. 2, pp. 451–463, Mar. 2022.

M. Hauder, C. Matei, and M. Brunner, "Data lifecycle management with Apache Spark and Delta Lake," in Proc. IEEE Int. Conf. Cloud Comput., Jun. 2021, pp. 209–218.

H. Wang, J. Lin, and D. Gedik, "Distributed graph traversal with guarantees for GDPR compliance," in Proc. ACM SIGMOD, Jun. 2020, pp. 1953–1966.

A. Sahoo, V. T. Chakaravarthy, and S. Mittal, "Lineage graphs for data auditability," in Proc. IEEE Int. Conf. Trust, Privacy and Security in Comput. Commun., Aug. 2019, pp. 124–131.

T. Milo and S. Zohar, "Using provenance to support compliance with data protection regulations," in Proc. ACM SIGMOD, Jun. 2017, pp. 1111–1126.

F. Schomm, G. Vossen, and F. Jaekel, "Towards data lineage support for GDPR compliance in data lakes," in Proc. IEEE Int. Conf. Data Sci. Adv. Analytics, Oct. 2019, pp. 1–10.

A. Kumar, J. Macke, and R. Lawrence, "Comprehensive data lineage for real-time pipelines using Apache Flink," in Proc. IEEE Int. Conf. Cloud Eng., Mar. 2020, pp. 162–170.

S. Ghosh, N. V. Chawla, and S. Chen, "A survey of privacy-preserving data deletion methods," ACM Comput. Surv., vol. 55, no. 1, pp. 1–32, Feb. 2023.

H. Muller and T. Bizer, "Tracking data provenance in distributed analytics using property graphs," in Proc. Int. Semantic Web Conf. (ISWC), Oct. 2018, pp. 432–447.

J. Cheney, L. Chiticariu, and W. Tan, "Provenance in databases: Why, how, and where," Found. Trends Databases, vol. 1, no. 4, pp. 379–474, 2009.

M. A. Sayed and J. Widom, "Fine-grained lineage at scale with Apache Spark," in Proc. IEEE BigData, Dec. 2020, pp. 1381–1390.

D. Batory, "Policy-aware data transformations for compliance automation," in Proc. IEEE Int. Conf. Softw. Syst. Process, May 2020, pp. 41–50.

K. Behrens, M. Klettke, and M. Scherzinger, "Graph-based metadata management in hybrid data lakes," in Proc. IEEE Int. Conf. Big Data, Dec. 2020, pp. 212–221.

B. Haslhofer, R. Schandl, and C. Becker, "Provenance-aware storage and deletion in big data environments," in Proc. Int. Conf. Theory Pract. Digit. Libraries, Sept. 2016, pp. 333–344.

Downloads

Published

23-01-2023

How to Cite

[1]
Radhakrishnan Pachyappan, Srikanth Gorle, and Amsa Selvaraj, “Graph-Driven Right-to-Be-Deleted Automation with Neo4j and Spark ”, Los Angeles J Intell Syst Pattern Rec, vol. 3, pp. 565–598, Jan. 2023, Accessed: Mar. 07, 2026. [Online]. Available: https://lajispr.org/index.php/publication/article/view/85