Zingg - Open source deduplication, data mastering and entity resolution

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



As enterprises struggle to save and analyze more and more data, data quality suffers. Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields which makes it hard to combine them together, especially with growing data volumes. This hurts customer analytics - establishing lifetime value, loyalty programs or marketing channels is impossible when the base data is not linked. No AI algorithm for segmentation can produce right results when there are multiple copies of the same customer lurking in the data. No warehouse can live up to its promise if the dimension tables have duplicates.


With a modern data stack and DataOps, we have established patterns for E and L in ELT for building data warehouses, datalakes and deltalakes. However, the T - getting data ready for analytics still needs a lot of effort. Modern tools like DBT are actively and successfully addressing this. What is also needed is a quick and scalable way to build the single source of truth of core business entities post Extraction and pre or post Loading. With Zingg, the analytics engineer and the data scientist can quickly integrate data silos and build unified views at scale!


Zingg integrates different records of an entity like customer, patient, supplier, product etc in same or disparate data sources. Zingg is useful for

  • Building unified and trusted views of customers and suppliers across multiple systems
  • Large Scale Entity Resolution for AML, KYC and other fraud and compliance scenarios
  • Deduplication and data quality
  • Identity Resolution
  • Integrating data silos during mergers and acquisitions
  • Data enrichment from external sources
  • Establishing customer households

Zingg is a no code ML based tool for data unification. It scales well to enterprise data volumes and entity variety. It works for English as well as Chinese, Thai, Japanese, Hindi and other languages.


Reference:

   

Building Zingg - open source data mastering, deduplication and entity resolution with ML.

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

Top 10 AI development tools which you should know in 2020

  • artificial-Intelligence neural-networks frameworks

It is a fact the 2020 is not going the way we expected to be but when it comes to technology breakthrough we can say 2020 will be the heir of greatness. <br />Speaking of technical breakthroughs we have got artificial intelligence which is known to be taking over the mankind like a wildfire. Everything around us is connected through AI be it shopping travelling or even reading. Every other activity of ours is transforming into a whole new extent.

Read More


Getting started with Apache OpenNLP

  • apache-opennlp machine-learning java nlp natural-language-processing

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP also includes entropy and perceptron based machine learning. . It contains several components for natural language processing pipeline like sentence detector, tokenizer, name finder, document categorizer, part-of-speech tagger, chunker, parser, co-reference resolution.

Read More


Microsoft acquires GitHub. Is it Good or Bad?

  • microsoft github news

Microsoft announced that it is acquiring GitHub for 7.5 Billion dollars. GitHub is a most used Software developer and Code hosting platform. It hosts more than 80 million code repository and more than 20 million developers collaborate in GitHub. In addition to managing code repositories, GitHub has developed many tools to increase the productivity of developers. Almost 70% of open source projects hosted in GitHub. Microsoft is a big corporate leader and it buying a open source code hosting company, Is it good or bad?

Read More


PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  • pyspark spark python rdd big-data

We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides basic introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

Read More


Column database vs OLAP

  • business-intelligence olap column-database

OLAP (Online Analytical Processing), Reporting, Data mining related tasks are usually done by Business intelligence products. They do powerful Extraction, Transformation and Loading (ETL) the data and provides various reports. They use relational database as its back end. How could they generate better reports? Will column DB do a better job?

Read More



Apache OpenNLP - Document Classification

  • opennlp natural-language-processing nlp document-classification

Apache OpenNLP is a library for natural language processing using machine learning. In this article, we will explore document/text classification by training with sample data and then execute to get its results. We will use plain training model as one example and then training using Navie Bayes Algorithm.

Read More


Coursera - Take the World's Best Courses, Online, For Free

  • free course academic entrepreneurship

Coursera is a social entrepreneurship company that partners with the top universities in the world to offer courses online for anyone to take, for free. Their courses include in various categories like Biology, Business management, Computer science, Robotics, Artificial Intelligence, Finance, Nutrition, Law, Mathematics, Medicine, Genetics, Data analytics and lot more.

Read More


10 sites to get the large data set or data corpus for free

  • search test-data large-data-set data-corpus dataset

You may require GBs of data to do performance or load testing. How your app behaves when there is loads of data. You need to know the capacity of your application. This is the frequently asked question from the sales team "The customer is having 100GB of data and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.

Read More


How hashmap works in Java. My style of learning.

  • java hashmap opensource-learning

This is the most frequently asked questions in the interview. Googling will throw many links related to this topic. How to learn the implementation of hash map? My style of learning using open source learning technique.

Read More


Getting Started on Angular 7

  • angular ui-ux front-end-framework

Angular is a platform for building responsive web, native desktop and native mobile applications. Angular client applications are built using HTML, CSS and Typescript. Typescript is a typed superset of Javascript that compiles to plain Javascript. Angular core and optional modules are built using Typescript. Code has been licensed as MIT License.

Read More


RESTEasy Advanced Guide - Filters and Interceptors

  • resteasy rest-api filters interceptors java

RESTEasy is JAX-RS 2.1 compliant framework for developing rest applications. It is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol.

Read More


EdX - The Future of Online Education from MIT and Harvard

  • free course academic

EdX is an online learning platform founded by Harvard University and the Massachusetts Institute of Technology (MIT). Along with offering online courses, the institutions will use edX to research how students learn and how technology can transform learning–both on-campus and worldwide. EdX is based in Cambridge, Massachusetts and is governed by MIT and Harvard.

Read More


RESTEasy - A guide to implement CRUD Rest API

  • resteasy rest-api java framework

RESTEasy is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol. It is licensed under the Apache 2.0 license.

Read More


RESTEasy Advanced guide - File Upload

  • resteasy rest-api file-upload java

RESTEasy is a JBoss project that provides various frameworks to help you build RESTful Web Services and RESTful Java applications. It is a fully certified and portable implementation of the JAX-RS 2.1 specification, a JCP specification that provides a Java API for RESTful Web Services over the HTTP protocol. It is licensed under the ASL 2.0.

Read More


Lucene / Solr as NoSQL database

  • lucene solr no-sql nosql document-store

Lucene and Solr are most popular and widely used search engine. It indexes the content and delivers the search result faster. It has all capabilities of NoSQL database. This article describes about its pros and cons.

Read More


An Introduction to the UnQLite Embedded NoSQL Database Engine

  • database nosql embedded key-value-store

UnQLite is an embedded NoSQL database engine. It's a standard Key/Value store similar to the more popular Berkeley DB and a document-store database similar to MongoDB with a built-in scripting language called Jx9 that looks like Javascript. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

Read More


Advantages and Disadvantages of using Hibernate like ORM libraries

  • database orm

Traditionally Programmers used ODBC, JDBC, ADO etc to access database. Developers need to write SQL queries, process the result set and convert the data in the form of objects (Data model). I think most programmers would typically write a function to convert the object to query and result set to object. To overcome these difficulties, ORM provides a mechanism to directly use objects and interact with the database.

Read More


Git vs Subversion

  • version-control subversion git

Git and Subversion are most popular and widely used version control system. What is the best situation to choose them? It is important to know its pros and cons, evaluate your requirement and choose the right one.

Read More


Open-Source Software Projects - Typical Common Reasons for Failure

  • opensource failures mistakes

Failure is often the catalyst for success. You need to make mistakes to truly learn and grow as a tech worker. Sometimes, you don’t need to make the mistakes to learn. Learning from other’s mistakes can be just as informative and productive. Paying attention to what others have done can help you avoid the same kind of failure. Here are a few common mistakes that authors make that lead to the failure of open-source software projects.

Read More


How Debuggers Work

  • golang debug

The first thing I do when I create a project is to create the debugger launch config at the `.vscode` folder. Debuggers help me to avoid putting print statements and building the program again. I always wondered how a debugger can stop the program on the line number I want and be able to inspect variables. Debugger workings have always been dark magic for me. At last, I managed to learn dark art by reading several articles and groking the source code of delve.

Read More







We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.